My guess would be that the poor performing implementations are using the naive s...

cpleppert · on June 13, 2012

I did a bunch of these benchmarks a while back and the most important factor in a performance test like this is the VM/Language environment. Simply calling into a VM on every IO request will rapidly destroy performance no matter what your IO setup looks like.

Erlang's IO stack if just fantastic for this and is generally very well optimized. If you want to go even faster you can code the fast path in C/C++ and call out to a higher level language as needed. A little dated but helpful: http://www.metabrew.com/article/a-million-user-comet-applica...

I think that Webbit is also pretty good, this benchmark is little contrived.

halayli · on June 13, 2012

Poller is a very important factor but there are other things to consider as well depending on how fast you want to go. But sometimes 'fastest' is not what you should look at though.

Solutions like node.js, python, etc.. copy strings and buffers all the time and this can slow down the server considerably. They don't have an honest iov layer that won't copy data before passing it to writev and you cannot manage your own memory and create memory pools. These factors can have a bigger impact on performance than selecting the right poller.

ajross · on June 13, 2012

That doesn't sound right to me. There are 10k clients, each sending (and the server echoing) one small timestamp record per second. An AWS medium instance has access to half a DRAM pipe (frankly this test should fit entirely in L3 cache, but let's be conservative). The back of my envelope says that the RAM bandwidth can handle about 100kb of data copying per send or receive event. That's a staggeringly large amount of copying, literally thousands of times larger than the size of the record.

Honestly I think CPU or syscall overhead is a more likely culprit.

halayli · on June 14, 2012

You'd think.

Write a web server in C and a web server in python/erlang/go/node.js. You'll notice that the one written in C (or C++) runs twice as fast. If it was a syscall bottleneck then they'd achieve the same performance.

zero-copy request handling is not possible in high level languages without hacking.

For a quick test, run httperf against the dumbest node.js web server that returns 204.

Then run the same test against an nginx server that does nothing but return a 204

___location / { return 204; }

Compare the difference.

allardschip · on June 13, 2012

The Python version uses Gevent so likely uses epoll/kqueue for this test. It'd be interesting to profile and see where the issues are in the Node.js and Python code.