A lot of websockets in Haskell

jlouis · on Nov 5, 2015

Some comments:

Haskell will use less memory on this task. Since it has a shared heap, it doesn't have to allocate a small heap per process, and thus it is expected that it has lower overhead. Furthermore, static typing means Haskell needs less type tags and this tend to make it win. As the system grows in complexity, these things tend to even out a bit more but I would still expect Haskell to use about half the memory of Erlang.

The Elixir numbers for Phoenix sounds off. At 83765 megabytes and 1999984 connections, that is 42 kilobytes per connection. That count is about an order of magnitude over what I would expect it to be. How much of that memory is kernel allocated network buffer space, and how much is buffer space in the Erlang runtime? A "raw" process in Erlang is around 1.5 kilobytes nowadays, including stack and heap, so where do the additional 40 get allocated to? I don't think we have 20 extra processes per connection for some reason :) Definitely something to look into.

Tsung is an old application. It isn't really written in ways that makes it efficient at the network level and this shows its head in this benchmark. Furthermore, Tsung does more work than the broadcast in Haskell, so it is expected the load generator will give up long before the server. Again, measure the amount of memory allocated by the kernel and by the userland process in order to determine if it is one or the other you hit first. Still, I would expect Tsung to be the culprit.

chrismccord · on Nov 5, 2015

I don't have hard numbers on where our 40kb per client is being allocated, but to be fair, we are doing a lot more work than the Haskell example. Our "Channels" layer is a full-featured part of the framework. For each WebSocket connection, we start a WebSocket transport process that multiplexes channel servers that subscribe to different topics. Each of these live underneath a supervisor process as well that monitors the servers. So out of the gate we are starting three processes per connection (one for the supervisor, one for the transport, and one for the single "rooms:lobby" channel). We set up monitors to detect the channel crashing/closing so we can notify the client that that channel went away. These things have overhead. I think we have room to optimize our conn size, but it's worth mentioning it's not a raw WS vs WS comparison.

jlouis · on Nov 5, 2015

That would explain it.

In addition, the classic PubSub pattern here screams the Disruptor pattern (which I first heard about from Trisha Gee and Martin Thompson). The Erlang runtime has no direct good support for this kind of pattern, so you have to opt for things such a ETS to simulate it. It'll work albeit at an overhead.

chrismccord · on Nov 5, 2015

We're using ets for our PubSub layer (which sits under channels). I'm not familiar with the Disruptor patterns, but we've been extremely happy with ets. The latest optimizations to come out of our benchmarks have us sharding pusbsub subscriber by Pid into ets tables managed by pooled pubsub servers. Our PubSub layer also is distributed out of the box. So the flow is ets tables for "local" subscribers on each node, then we use a pg2 group to bridge the broadcasts across the cluster. The pg2 bridge is our next area for stress testing.

bos · on Nov 5, 2015

A small correction: GHC uses plenty of type tagging information. In fact, its metadata overhead is relatively high.

Kutta · on Nov 5, 2015

Relatively high compared to what? With GHC we have a single-word header on objects, which compares favorably to C# which usually has two-word headers, or Java, which similarly uses one word. Of course, GC-less languages like Rust or C++ have usually no tags at all, but I think it makes more sense to compare among GC-d languages.

thinkpad20 · on Nov 5, 2015

Why is this there? Is it to facilitate things like Typeable? I believe that there's no language-level way to do things like runtime type reflection. And even if there were, how would one express a complex type like (Vector (forall a. MyTypeClass a => a -> Int, String))?

I'm also curious if dependently typed languages like Idris, which presumably must be able to have runtime access to type information, handle this stuff.

chadaustin · on Nov 6, 2015

For values, laziness means there is a tag bit for whether a value is a thunk or evaluated. Sum types use tags to determine which variant is active.

For functions, because a function that takes two arguments and returns a value (a -> a -> a) has the same type as a function that takes one argument and returns a function that takes another argument that returns a value (a -> a -> a), the arity of functions is stored in the tag.

Some of these tags are eliminated by inlining but if you sit down and read some typical Haskell output you'll see a _whole lot_ of tag checks.

Source: spent a lot of time reading GHC output and writing high-performance Haskell code.

15155 · on Nov 6, 2015

In Idris, as far as I know, runtime type information is kept around by default and erased through usage-based optimization (and possibly annotation?)

http://docs.idris-lang.org/en/latest/reference/erasure.html

platz · on Nov 5, 2015

what is the state-of-the-art posix bench tool in your opinion? I'm aware of things like gatling but not sure where on the spectrum it sits in terms of features or popularity

jlouis · on Nov 5, 2015

I don't think anything has the flexibility of tsung to be honest. It can test many different protocols already. A better way would probably be to optimize it a bit for lower memory usage.

For web server benchmarking, only wrk2 by Gil Tene does things correctly. Everything else usually does coordinated omission:

Imagine you have 10.000 connections. Each connection is doing 3 req/s. Let's say one connection blocks for 1 second, which means that 2 req's should have fired on that connection "in between". wrk2 will count those two as being "late" whereas most other load generators won't count at all. This means a framework can opt to "stall" some connections in order to get better performance and fewer bad results in the upper latencies.

As an example, here are the Erlang/Cowboy numbers for such a test in wrk2:

 	Latency Distribution (HdrHistogram - Recorded Latency)
	 50.000%    6.33ms
	 75.000%   10.23ms
	 90.000%   13.73ms
	 99.000%   22.37ms
	 99.900%   31.26ms
	 99.990%   38.62ms
	 99.999%   45.09ms
	100.000%   49.60ms

Whereas the Haskell wai framework returns

         Latency Distribution (HdrHistogram - Recorded Latency)
	 50.000%    1.71ms
	 75.000%    2.64ms
	 90.000%   14.92ms
	 99.000%   38.75ms
	 99.900%  742.40ms
	 99.990%  985.60ms
	 99.999%    1.03s 
	100.000%    1.05s

Note how the median latency and the 75th percentile is better for Haskell, but that it occasionally stalls requests for quite some time, probably due to a GC pause or some other cleanup that happens and then in an unfortunate moment all has to happen at the same point in time.

If you go look at typical benchmarks their latency reporting is way off compared to this, which is a surefire way of knowing they did not account for coordinated omission.

Mind, when benchmarks disagree, the trick is to explain why this happens. It often leads to an insight in design difference.

atombender · on Nov 5, 2015

Looks like wrk 4.0 [1] now has a similar "HDR histogram" latency algorithm as wrk2 -- have you tried it out?

[1] https://github.com/TechEmpower/FrameworkBenchmarks/issues/12...

marcosdumay · on Nov 5, 2015

Quite inline with my experience.

I wrote a mail server in Haskell that, because of a bug was leaking connections. Didn't use nix, so I had the external forkIO explicit, and used a custom sockets interface, that creates a buffer for every connection. I optimized almost nothing, this was an earlier version of it.

Anyway, every time the server would reliably get to about 700k open connections before getting killed at my linode machine.

BenoitP · on Nov 5, 2015

What this benchmark shows is the lightweightness of the memory needed per websocket. 500k users is really impressive.

500k users per machine is great if they are mostly idle. This is the use case of WhatsApp, and their stats are[1]:

> Peaked at 2.8M connections per server

> 571k packets/sec

> >200k dist msgs/sec

Not every app is meant to have mostly idle users. Can a real-time MMO FPS be done in the Haskell server is the question (lets limit each user's neighbourhood to the 10 closest players).

I'd be very interested in the other corners of the envelope for the Haskell server: A requests/second benchmark over websockets, with the associated latency HdrHistogram, like [2]

[1] http://highscalability.com/blog/2014/2/26/the-whatsapp-archi...

[2] http://www.ostinelli.net/a-comparison-between-misultin-mochi...

binaryapparatus · on Nov 5, 2015

I really love Haskell and wish there is more opportunity for me to use it. But for high concurrency web projects Elixir has one big advantage -- Chris and the team doing very focused and high quality presentations. Building the community and hype surrounding the project (I say 'hype' in best possible meaning) is half job done. I really get itchy fingers to start some serious Phoenix project every time I see Elixir team at work. Respect.

vegabook · on Nov 5, 2015

Couldn't agree more. Hunting around for a new language to do large scale concurrency in, I had shortlisted Clojure and even Golang, but the Phoenix presentations (and BEAM) tipped it for me. I also appreciate the fact that the Phoenix guys constantly address the need not only to target the browser, but other types of endpoints "beyond the browser" with transport adapters, CoAP etc. Certainly has gotten me to buy a bunch of Elixir books recently to get into the ecosystem.

artursapek · on Nov 5, 2015

I love people who spend time explaining things like this. It's how I've learned a lot of what I know about computers.

axman6 · on Nov 6, 2015

I was actually a little annoyed the author didn't explain more, it felt like it was a half finished article.

windlep · on Nov 6, 2015

I have two samples of a websocket echo program in Haskell, and was unfortunately unable to make it not leak memory merely sending the same data back over the connections repeatedly. I will admit I haven't tested this lately, and I didn't use Network.Websocket because of this: https://github.com/jaspervdj/websockets/issues/72

Haskell code with tester is here: https://github.com/bbangert/ssl-ram-testing/tree/master/Hask...

(In some cases it would take some client churn to get the leak to occur, but clients come/go, a server has to survive this basic fact)

One diagnosis was that it was a memory fragmentation leak, since sure enough when I ran some debugging it reported fragmentation losses at the end. Some tweaks to the initial stack size and how much stack each increment should be done remedied it for the most part, but resulted in it taking quite a bit more memory (I believe I had to set the initial stack to at least 4k).

Using -N4 or -N6 (to turn on multi-core/cpu use in haskell) increased memory consumption quite a bit, as Haskell will now start juggling all those threads across real OS threads with its M:N scheduler (Just like Go has to do with its goroutines). It was drastically more memory efficient with -N1, even though that loses the multi-core utilization.

So far the most efficient implementation I've tested is of course... in C. Where a simple websocket echo used 9.5kb of memory per connection (including SSL, and with TCP kernel buffers set to 4k send/recv). I'm not really eager to write C code though, so we're using Python+twisted where a plain WS connection will take about 16kb per conn, or 23kb with SSL.

In this article's test, the kernel buffer for TCP send/recv was set to 1k, which I guess is fine if your payload will frequently be under that. But if you regularly are sending payloads exceeding that, the amount of wakes to keep shoveling data into the kernel buffer is going to be rather expensive.

As the Phoenix people note, its useful to keep in mind what you need to do, rather than merely how many connections you want to hold open.

codygman · on Nov 6, 2015

It looks like the Yesod websockets version fixed the issue?

jw989 · on Nov 5, 2015

This is cool, largely because of the functional perspective to dealing with pools of websockets. I personally think the next level for websockets infrastructure is building a paradigm for large pools of concurrent websocket connections.

Perhaps a microservces implementation with each step (handshake, broadcasting, http overhead) having their own cluster and communicating through channels.

jkarneges · on Nov 5, 2015

Pushpin (http://pushpin.org) may be in line with what you're thinking. It separates connection management from backend logic. In fact the project itself is a handful of microservices (Mongrel2 is used as a separate process for the HTTP handling).

timc3 · on Nov 5, 2015

Uses GNU AFFERO Version 3 license - noted to save someone else from going from happily excited to disappointment.

jkarneges · on Nov 6, 2015

Code that can be run, modified, and redistributed at no cost. There are worse deals. ;)

jw989 · on Nov 6, 2015

Wow this is very close to what I was saying. Thanks for this

jdreaver · on Nov 5, 2015

Kind of off topic, but: can you elaborate on how you handle the state file for nixops? I really want to use nixops for some smallish server deployments, but I don't want to have to share the state files with team members (or worse, keep track of it on some central machine that becomes a single point of failure). That is the only reason I am sticking to Ansible for the time being.

To anyone else who uses nixops, why does nixops use this state file? Can't they use labels like Ansible to identify deployed machines? Can't you just require the user to provide their own secrets instead of auto-generating a new keypair for each machine?

gautier · on Nov 6, 2015

Glad you asked, as we have already written this article: https://blog.wearewizards.io/how-to-use-nixops-in-a-team

jdreaver · on Nov 6, 2015

Ah, I've already read that and I didn't realize you guys wrote it :)

Thanks for the writeup!

TheIronYuppie · on Nov 5, 2015

Disclaimer: I work at Google, on Kubernetes

Really cool example! Just FYI, pre-emptible VMs would only be $0.06 (vs. the bid of $0.10 for AWS Spot Market). I mention because the author talks about cost being a concern.

https://cloud.google.com/compute/pricing#machinetype

amelius · on Nov 5, 2015

Imagine how long it would take to send a simple message to all of those websockets. Having many websockets is nice, but only if the rest of your infrastructure can deal with it.

chrismccord · on Nov 5, 2015

Probably faster than you'd link. With the linked Phoenix results, we have real payloads going out to 2M clients in 1-2s: https://www.youtube.com/watch?v=N4Duii6Yog0

jlouis · on Nov 5, 2015

On a perfect network, I presume. In reality, clients lose packets and strains the system differently. There is more to this than just perfect network conditions in a lab setting.

Refefer · on Nov 5, 2015

Clearly a chatroom with half a million individuals is unusable from pretty much every perspective. That said, a chat server with N chatrooms and a total population of 500k users sounds like a good day on IRC and well within the realm of what something like this could potentially handle.

felixgallo · on Nov 5, 2015

That depends very much on what's doing the chatting. If it's code chatting with other code -- for example, mobile devices receiving near-real-time notifications -- then half a million is just getting started.

nickpsecurity · on Nov 5, 2015

Exactly

issaria · on Nov 5, 2015

Have to upvote this for no reason :)

arrty88 · on Nov 5, 2015

bravo