it's all serialization - but that's not a bottleneck for most web servers. i'd l...

aphyr · on July 29, 2012

Threads and processes both require a context switch, but on posix systems the thread switch is considerably less expensive. Why? Mainly because the process switch involves changing the VM address space, which means a TLB shootdown: all that hard-earned cache has to be fetched from DRAM again. You also pay a higher cost in synchronization: every message shared between processes requires crossing the kernel boundary. So not only do you have a higher memory use for shared structures and higher CPU costs for serialization, but more cache churn and context switching.

it's all serialization - but that's not a bottleneck for most web servers.

I disagree, especially for a format like JSON. In fact, every web app server I've dug into spends a significant amount of time on parsing and unparsing responses. You certainly aren't going to be doing computationally expensive tasks in Node, so messaging performance is paramount.

i'd love to hear your context-switching free multicore solution.

I claimed no such thing: only that multiprocess IPC is more expensive. Modulo syscalls, I think your best bet is gonna be n-1 threads with processor affinities taking advantage of cas/memory fence capabilities on modern hardware.

this is the sanest and most pragmatic way server a web server from multiple threads

What is this I can't even.

aphyr · on July 29, 2012

Note--think I'm wrong about these types of process switches requiring a TLB shootdown. It think it's just cache invalidation.

aphyr · on July 29, 2012

Don't believe me? Try it:

Node.js: https://gist.github.com/3200829

Clojure: https://gist.github.com/3200862

Note that I picked the really small messages here--integers, to give node the best possible serialization advantage.

    $ time node cluster.js 
    Finished with 10000000

    real 3m30.652s
    user 3m17.180s
    sys	 1m16.113s

Note the high sys time: that's IPC. Node also uses only 75% of each core. Why?

    $ pidstat -w | grep node
    11:47:47 AM     25258     48.22      2.11  node
    11:47:47 AM     25260     48.34      1.99  node

96 context switches per second.

Compare that to a multithreaded Clojure program which uses a LinkedTransferQueue--which eats 97% of each core easily. Note that the times here include ~3 seconds of compilation and jvm startup.

    $ time lein2 run queue
    10000000
    "Elapsed time: 55696.274802 msecs"

    real 0m58.540s
    user 1m16.733s
    sys	 0m6.436s

Why is this version over 3 times faster? Partly because it requires only 4 context switches per second.

    $ pidstat -tw -p 26537
    Linux 3.2.0-3-amd64 (azimuth) 	07/29/2012 	_x86_64_	(2 CPU)
    
    11:52:03 AM      TGID       TID   cswch/s nvcswch/s  Command
    11:52:03 AM     26537         -      0.00      0.00  java
    11:52:03 AM         -     26540      0.01      0.00  |__java
    11:52:03 AM         -     26541      0.01      0.00  |__java
    11:52:03 AM         -     26544      0.01      0.00  |__java
    11:52:03 AM         -     26549      0.01      0.00  |__java
    11:52:03 AM         -     26551      0.01      0.00  |__java
    11:52:03 AM         -     26552      2.16      4.26  |__java
    11:52:03 AM         -     26553      2.10      4.33  |__java

And queues are WAY slower than compare-and-set, which involves basically no context switching:

    $ time lein2 run atom
    10000000
    "Elapsed time: 969.599545 msecs"

    real 0m3.925s
    user 0m5.944s
    sys	 0m0.252s

    $ pidstat -tw -p 26717
    Linux 3.2.0-3-amd64 (azimuth) 	07/29/2012 	_x86_64_	(2 CPU)

    11:54:49 AM      TGID       TID   cswch/s nvcswch/s  Command
    11:54:49 AM     26717         -      0.00      0.00  java
    11:54:49 AM         -     26720      0.00      0.01  |__java
    11:54:49 AM         -     26728      0.01      0.00  |__java
    11:54:49 AM         -     26731      0.00      0.02  |__java
    11:54:49 AM         -     26732      0.00      0.01  |__java

TL;DR: node.js IPC is not a replacement for a real parallel VM. It allows you to solve a particular class of parallel problems (namely, those which require relatively infrequent communication) on multiple cores, but shared state is basically impossible and message passing is slow. It's a suitable tool for problems which are largely independent and where you can defer the problem of shared state to some other component, e.g. a database. Node is great for stateless web heads, but is in no way a high-performance parallel environment.

ryah · on July 30, 2012

Yep, serialized message passing between threads is slower - you didn't have to go through all that work. But it doesn't matter because that's not the bottleneck for real websites.

Also Node starts up in 35ms and doesn't require all those parentheses - both of which are waaaay more important.

KirinDave · on July 31, 2012

It is definitely a bottleneck I face daily in designing crowdflower. Our EC2 bill is much higher than it could be because we have to rely so much on process-level forking.

A great example of this is resque. It'd be great if we could have multiple resque workers per process per job type. This would save a ton of resources and greatly improve processing for very expensive classes of jobs. It's a very real-world consideration. But instead, because our architecture follows this "share nothing in my code, pass that buck to someone else" model like a religion, we waste a lot of computing resources and lose opportunities for better reliability.

What I find most confusing about this argument is that I challenge you to find me a website written in your share nothing architecture that, at the end of the day, isn't basically a bunch of CRUD and UX chrome around a big system that does in-process parallelism for performance and consistency considerations. Postgresql, MySQL, Zookeeper, Redis, Riak, DynamoDB ... all these things are where the actual heavy lifting gets done.

Given how pivotal these things are to modern websites, it's bizarre for you to suggest it is not something to consider.

bascule · on July 30, 2012

It's more than that. Several processes with small, independently garbage collected heaps are not as efficient as a single process with a large heap, parallel threads, and a modern concurrent GC (e.g. the JVM's ConcurrentMarkSweep GC)

In addition to that, processes severely inhibit the usefulness of in-process caches. Where threads would allow a single VM to have a large in-process cache, processes generally prevent such collaboration and mean you can only have multiple, duplicated, smaller in-process caches. (Yes, you could use SysV shared memory, but that's also fraught with issues)

The same goes for any type of service you would like to run inside a particular web server that could otherwise be shared among multiple threads.

rektide · on July 29, 2012

I prefer someone keep rolling for sanity loss & resume the work on isolates!

Web serving is OK & all, but I'd love if node could be an ideal runtime for petri-nets and webworker meshes too.