Threads and processes both require a context switch, but on posix systems the thread switch is considerably less expensive. Why? Mainly because the process switch involves changing the VM address space, which means a TLB shootdown: all that hard-earned cache has to be fetched from DRAM again. You also pay a higher cost in synchronization: every message shared between processes requires crossing the kernel boundary. So not only do you have a higher memory use for shared structures and higher CPU costs for serialization, but more cache churn and context switching.
it's all serialization - but that's not a bottleneck for most web servers.
I disagree, especially for a format like JSON. In fact, every web app server I've dug into spends a significant amount of time on parsing and unparsing responses. You certainly aren't going to be doing computationally expensive tasks in Node, so messaging performance is paramount.
i'd love to hear your context-switching free multicore solution.
I claimed no such thing: only that multiprocess IPC is more expensive. Modulo syscalls, I think your best bet is gonna be n-1 threads with processor affinities taking advantage of cas/memory fence capabilities on modern hardware.
this is the sanest and most pragmatic way server a web server from multiple threads
Note that I picked the really small messages here--integers, to give node the best possible serialization advantage.
$ time node cluster.js
Finished with 10000000
real 3m30.652s
user 3m17.180s
sys 1m16.113s
Note the high sys time: that's IPC. Node also uses only 75% of each core. Why?
$ pidstat -w | grep node
11:47:47 AM 25258 48.22 2.11 node
11:47:47 AM 25260 48.34 1.99 node
96 context switches per second.
Compare that to a multithreaded Clojure program which uses a LinkedTransferQueue--which eats 97% of each core easily. Note that the times here include ~3 seconds of compilation and jvm startup.
$ time lein2 run queue
10000000
"Elapsed time: 55696.274802 msecs"
real 0m58.540s
user 1m16.733s
sys 0m6.436s
Why is this version over 3 times faster? Partly because it requires only 4 context switches per second.
$ pidstat -tw -p 26537
Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU)
11:52:03 AM TGID TID cswch/s nvcswch/s Command
11:52:03 AM 26537 - 0.00 0.00 java
11:52:03 AM - 26540 0.01 0.00 |__java
11:52:03 AM - 26541 0.01 0.00 |__java
11:52:03 AM - 26544 0.01 0.00 |__java
11:52:03 AM - 26549 0.01 0.00 |__java
11:52:03 AM - 26551 0.01 0.00 |__java
11:52:03 AM - 26552 2.16 4.26 |__java
11:52:03 AM - 26553 2.10 4.33 |__java
And queues are WAY slower than compare-and-set, which involves basically no context switching:
$ time lein2 run atom
10000000
"Elapsed time: 969.599545 msecs"
real 0m3.925s
user 0m5.944s
sys 0m0.252s
$ pidstat -tw -p 26717
Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU)
11:54:49 AM TGID TID cswch/s nvcswch/s Command
11:54:49 AM 26717 - 0.00 0.00 java
11:54:49 AM - 26720 0.00 0.01 |__java
11:54:49 AM - 26728 0.01 0.00 |__java
11:54:49 AM - 26731 0.00 0.02 |__java
11:54:49 AM - 26732 0.00 0.01 |__java
TL;DR: node.js IPC is not a replacement for a real parallel VM. It allows you to solve a particular class of parallel problems (namely, those which require relatively infrequent communication) on multiple cores, but shared state is basically impossible and message passing is slow. It's a suitable tool for problems which are largely independent and where you can defer the problem of shared state to some other component, e.g. a database. Node is great for stateless web heads, but is in no way a high-performance parallel environment.
Yep, serialized message passing between threads is slower - you didn't have to go through all that work. But it doesn't matter because that's not the bottleneck for real websites.
Also Node starts up in 35ms and doesn't require all those parentheses - both of which are waaaay more important.
It is definitely a bottleneck I face daily in designing crowdflower. Our EC2 bill is much higher than it could be because we have to rely so much on process-level forking.
A great example of this is resque. It'd be great if we could have multiple resque workers per process per job type. This would save a ton of resources and greatly improve processing for very expensive classes of jobs. It's a very real-world consideration. But instead, because our architecture follows this "share nothing in my code, pass that buck to someone else" model like a religion, we waste a lot of computing resources and lose opportunities for better reliability.
What I find most confusing about this argument is that I challenge you to find me a website written in your share nothing architecture that, at the end of the day, isn't basically a bunch of CRUD and UX chrome around a big system that does in-process parallelism for performance and consistency considerations. Postgresql, MySQL, Zookeeper, Redis, Riak, DynamoDB ... all these things are where the actual heavy lifting gets done.
Given how pivotal these things are to modern websites, it's bizarre for you to suggest it is not something to consider.
It's more than that. Several processes with small, independently garbage collected heaps are not as efficient as a single process with a large heap, parallel threads, and a modern concurrent GC (e.g. the JVM's ConcurrentMarkSweep GC)
In addition to that, processes severely inhibit the usefulness of in-process caches. Where threads would allow a single VM to have a large in-process cache, processes generally prevent such collaboration and mean you can only have multiple, duplicated, smaller in-process caches. (Yes, you could use SysV shared memory, but that's also fraught with issues)
The same goes for any type of service you would like to run inside a particular web server that could otherwise be shared among multiple threads.
this is the sanest and most pragmatic way server a web server from multiple threads