Hacker News new | past | comments | ask | show | jobs | submit login

Does Erlang support structural sharing between large data structures? If not, how is it possible to efficiently pass large messages between processes on the same physical machine?



The only thing that is passed "efficiently" between processes are large binary blobs. Everything[0] else is copied. Up to a point, this is efficient, as a copy decouples sharing and this is really good for modern caches, and it is good for the garbage collection.

However, when the data becomes too large, Erlang programmers often choose to bring the computation to the data rather than the other way around. Passing a function is efficient. Keeping data cache-local in a process is efficient.

[0] Except literal data.


Passing on a local machine is optimized quite well. The API is the same as passing to a remote machine, but serialization is omitted when passing locally. A fair amount of copying still happens, but (I assume--now we're into territory that I am less sure about) that keeps the implementation robust and simple (i.e. there's no "ownership" to wrangle down in the BEAM; at the highest level of abstraction inside the VM, it's just copies of immutable structures).

For communicating quickly locally, you could use large/refcounted binaries (http://erlang.org/doc/efficiency_guide/binaryhandling.html) in to avoid copying-on-message-pass. Those generally tend to be frowned upon, and incur the usual headaches/perf issues of a ref-counted structure as well.

In short, I'd suggest defaulting to the assumption that Erlang is fast enough at passing messages locally for it not to matter that much in practice. Benchmark the throughput of your system locally, and if the overhead of message passing isn't a major pain point immediately, move on up the stack. If you're doing scientific computing and want to number-crunch huge data sets in parallel via some minimal-coordination memory-map-backed system, this is not the platform for you (or use a NIF).


How is it optimized well when there is still lots of copying going on? Anything in the same process space should just be a matter of passing a pointer or a pointer with very little information wrapping it up.

> Those generally tend to be frowned upon, and incur the usual headaches/perf issues of a ref-counted structure as well.

Why would reference counting be worse than copying?


I think what is frowned upon is serializing valued into a refcounted binary instead of passing around native terms. If you pass around the same value, you could probably save some copies that way, but it makes for larger code.

Nobody will complain if it's needed, but chances are, it's not needed. (It depends on what you're doing though).

Plus refcounted binaries scare people. Until recent versions, references didn't play nice with GC targets, so it was easy for a very efficient proxy process to not trigger GC but still keep references to a lot of garbage. If you're not careful with referencing pieces of binaries you can do something similar too -- maybe you got a huge binary, but only need to save a few bytes, it's easy to keep a reference to the huge thing (hint: use binary:copy/1 to get a clean copy of the small piece that you want to keep in your state or ets)


Because refcounting means lot of cache miss on the CPU core.

One of the interesting advantage of erlang is that its processes are small and mostly fit in L2 memory. So you can load a whole process at once, then compute for your whole slice, all in cache.

The result may surprise you.


Why would reference counting increase cache misses?

> One of the interesting advantage of erlang is that its processes are small and mostly fit in L2 memory.

Are you talking about data or are you talking about execution? I wouldn't think execution would need reference counting.

> So you can load a whole process at once, then compute for your whole slice, all in cache.

Whatever data is being loaded still has latency. If it is accessed linearly it can be prefetched. If it is being copied around, that won't won't put it into a CPUs cache any faster and will require using writeback cache, not to mention all the cycle of copying.

I'm not sure how data that already exists in memory and potentially in L3 caches, etc. is going to be hurt by reference counting.

> The result may surprise you.

Do you have results that I can look at? If not I think they will surprise both of us.


> How is it optimized well when there is still lots of copying going on? Anything in the same process space should just be a matter of passing a pointer or a pointer with very little information wrapping it up.

Sure, you could pass a pointer around. But then you'd have all the usual headaches about concurrent modification (some of this stuff is mutable when you get down to the implementation/pointer level) and garbage collection (who has it?).

It's optimized well enough that the perf penalty is minor. Copy-on-write semantics, "tail-send" optimizations (I don't know what the technical term for this is, but: data or pointers are moved if a message send is in particular positions in code, without getting the GC involved at all until later) and more are all used where they make sense.

Could it be faster if you built it around a zero-copy runtime? Sure. So could Perl. But then you'd be building your own runtime, with different priorities and tradeoffs than the one that already exists--one whose biggest benefit is that you don't have to build it yourself.

Erlang doesn't promise zero copies locally, but nor does it take the most wasteful possible path; I am not a deep BEAM expert, but am consistently surprised at how well it works. Seldom have I wondered "hmm, the runtime could do something fast/cool with this particular type of code; I wonder if it does..." and, upon checking the available sources/debugger output, been disappointed.

The runtime isn't built around hyper-efficient zero copy semantics (it has a garbage collector, an M:N scheduler, a bloody instruction counter routine that runs every time you invoke any function at all for heaven's sake). For all that, it's more than fast enough for anything besides high-performance scientific computing/number crunching locally. If you need to do that, use a language that lets you sent pointers between threads exactly how you want. If you still want to use that in Erlang, add some reduction-count markers in your $fast_language code and hook it up to Erlang via an NIF.

The BEAM doesn't waste time/complexity more than necessary on hyper-efficient local optimizations so it can focus on its main strengths: really good concurrency and parallelism, and making a near-bulletproof, super-high-performance set of primitives for distributing your code with minimal changes. Of all the "you call this function as if it were local, but it actually runs remotely in nearly the same exact way!" tools out there, even ones invented recently with the benefit of history to learn from, Erlang/BEAM's implementation comes the closest to fulfilling that promise IMO.

> Why would reference counting be worse than copying?

Refcounting is a valid strategy in some cases, but has drawbacks as well. It doesn't handle cycles well, and requires some stop-the-world events (or lots of locking complexity) for garbage collection in concurrent environments. More details on Google, this thread goes into the differences reasonably well: https://www.quora.com/How-do-reference-counting-and-garbage-...


I believe that Beam implements copy-on-write so unless you create a copy with modifications it should minimize copies. values are immutable so this eliminates a lot of the need for copying during the message passing.


When you pass a message, all the terms will be copied into the memory space owned by the receiving process. Refcounted binaries don't copy the value, just the reference of course (as well as updating the reference count information). This results in a lot of copying, but also makes process memory management very simple: when a process dies, you need to clean up the global state: decrement refcounted binaries, send notifications for linked processes and monitors, and release all the process heap memory. You don't need to inspect the process memory or worry about dangling references to it.

Refcounted binaries are (mostly) effectively copy on write though; if you create a term that's a modified form, a new value will be formed.


Yes; data structures are immutable in Erlang, so the runtime system can share them (and their substructures) freely amongst its threads.


I don't believe everything is shared like this. Anything smaller than 64 bytes is copied around to different process stacks.

https://hamidreza-s.github.io/erlang%20garbage%20collection%... Is a great post


BEAM does not share structures like this. It would make garbage collection much harder. Immutability and shared nothing makes the GC simple. You couldn't compact a process heap if parts of it were in use by other processes.


I really like this blog post on Erlang's garbage collection and memory structure.

https://hamidreza-s.github.io/erlang%20garbage%20collection%...


If you want to limit yourself to a single machine: isn't that what files are for? Sharing large data between processes?


Sometimes you want to set up an efficient processing pipeline on a single machine, e.g. when writing a database engine. In that case, you don't want to serialize all communication between processes. Of course, other processes (not related to the efficient pipeline) can live on different machines.


Who said anything about serializing? To coordinate multiple processes via a file-backed interface, you still might have to wait for the disk (unless you're using /dev/shm), but you don't necessarily have to serialize/deserialize data. Most people tend to confuse the two.

You can cast bytes from a memory-mapped file just fine, if you trust the file/other writers, or are willing to sacrifice some safety. There are also a lot of serialization systems that are simple/fast enough as to basically be structs, e.g. flatbuffers: https://google.github.io/flatbuffers/. Even those still don't avoid that pesky copy, though; for that you'll have to interact with raw memory and hope it's in the right layout for your purposes.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: