ORC – Nim's new memory managment

scottlamb · on Dec 8, 2020

Couple questions about how this works.

1. Is "acyclic" checked by the compiler in any way? or is the programmer basically saying "I'm responsible for making sure this there are no cycles, and I accept that memory will leak if I'm wrong"? I imagine the latter because the compiler check sounds infeasible. (Similarly Rust's memory safety doesn't guarantee there are no leaks.)

2. How does it work with multiple threads? I see a Q/A here [1] which says

> C: We do not do atomic reference counting, we have a shared heap where you move subgraphs completely to a different thread. A graph is always owned by a single thread, there is no need to make the reference counting atomic. We have mechanisms in development to ensure this at runtime or at compile-time, inspired by the Pony programming language.

and would be curious to read the one-page rather than one-paragraph version of this...

[1] https://forum.nim-lang.org/t/5734

treeform · on Dec 8, 2020

1. As I understand the {.acyclic.} pragma is way to telling the compiler that you take full responsibility. Otherwise without the {.acyclic.} pragma it will spend the extra time scanning the objects for cycles.

2. It appears that Nim will use "move ownership of isolated object graph" from one thread to the other. The cycle detection code is very similar to freeing an object graph to moving of the ownership of an object graph to a different thread.

I would also like more clarification from the creators. The above is just how I understand it right now. More simple example code would be great.

elcritch · on Dec 8, 2020

The isolation technique is an area I’ve been interested in and following. It’s not too well documented as that area is a work in progress and not really useable currently. Almost though!

First it’s good to note that ARC improves upon the standard reference counting as in Obj-C or Swift by using move and sink‘s which removes many redundant ref’s/decr’s. That is if Swift hasn’t added it. The compiler can infer when ownership is moved or not. Simple ideas but put together in a powerful fashion!

As I understand the isolation check, it is a compile time check that an object being moved to another thread is really isolated with no other external references. This can sometimes be done at compile time similar to acyclic structures though I don’t know the algorithm used. You can also use a runtime check which, appears related to cycle collection. Essentially you can count if the total internal references account for all the references in a given object graph — probably equivalent in cost to an ORC cycle check for that data sub-graph. Whether done at compile time or runtime the plan is to wrap the object in a special ‘Option’ like type ‘Isolated[T]’ that can be accepted by multithreaded constructs. So channels or concurrent queues, etc can declare that they only accept data that can safely be moved between threads using the standard move/sink annotations. This also lets users create their own.

It boils down to ‘Isolated[T]’ just being a data check, and you can move data between threads at your own risk. That’s what I’m doing currently since I’m doing multithreaded embedded programming and using some C Queue constructs. The move/sink and ARC handle the actual data movement. In theory you could use Isolated runtime checks during testing (if your data structures are too tricky for compile time checks), but disable it for release. Similar to cycle checking. There’s a setting to use ORC to print detected cycles, then in release you can turn off ORC.

KMag · on Dec 9, 2020

> As I understand the isolation check, it is a compile time check

You're right that if your language has an affine type system (that statically identifies some references as being unique), then you can use static type analysis to skip the runtime analysis at some call sites (those where the transferred subgraph contains only unique references).

KMag · on Dec 8, 2020

I think you're mostly correct about the { acyclic } pragma, except that the "Otherwise without" part needs a provisio "if static type analysis doesn't prove that objects of the type cannot participate in cycles".

I've worked on a system that used a collector based on that Bacon and Arjan paper linked in the article. It's a very interesting paper. It has something like seven colors as opposed to the classic three color collector.

One of the colors is for objects where static type analysis shows that cycles can't exist: ClassA has references to only objects of ClassB and ClassC, ClassC has references to only objects of ClassD, and classes B and D don't contain any references, so objects of any of these classes can't participate in cycles and are acyclic types. If ClassD had references to objects of type ClassA, then there would be a possibility of cycles. Note that tree structures are very common, and static type analysis (without linear/affine types) cannot rule out cycles in recursive data structures such as trees.

Edit: a quick search appears to show Nim has at least some limited support for affine types in its type system, so trees might be a bad example of where { acyclic } is helpful.

Presumably, the { acyclic } pragma is an escape hatch for these cases where the human knows some invariant that isn't visible from just type analysis. As you allude, the GC algorithm is still correct without this information, just less efficient.

I also agree that it looks like they run a variant of the GC algorithm when passing a reference between threads: run a variant of the algorithm that doesn't actually free unreachable objects and pretend to decrement the reference count on the passed object. If the GC algorithm shows that all objects reachable through the passed object would have been GC'd had you decremented its reference count, then it's safe to pass the reference between threads.

On a side note, my use case of a Bacon-Arjan type collector was in a codebase for a ___domain-specific language with a functional-reactive programming model. We actually used the GC for pruning dependency graphs, to remove nodes that didn't produce externally visible effects. Someone somewhere in a big C++ codebase for the DSL interpreter missed incrementing a reference count, resulting in live nodes being pruned from the graph under rare circumstances. We put a lot of effort into trying to find the missing refcount increment, but ended up ripping out the Bacon-Arjan collector and going with a dead simple mark-and-sweep. We only needed to run the GC in the rare occasions that the graph topology changed, so the performance difference wasn't very noticeable.

elcritch · on Dec 9, 2020

Ah, interesting to know that Nim's system uses some form of affine/linear types. I'm not sure of the extent or details as I've been busy making use of other parts of Nim, but I really want to read up on linear types. Nim's type system is powerful enough to pass write effects which would be pretty handy for first class immutability [1]. Not sure how much that overlaps with linear types though or if it's an orthogonal aspect.

    On a side note, my use case of a Bacon-Arjan type collector ...

Interesting story! I enjoy reading those, and I know the Nim team had a lot of work to get ARC & ORC working well, which also required a lot of work on move/sink as well in the stdlib. I could see it being very difficult to implement in a big C++ code base. From the article it seems M&S can even be a bit faster than ARC, but with higher latency, more memory usage, and harder to target various platforms like WASM. I'm glad as it makes Nim excellent for usage on microcontrollers! No worries on shared heaps or unpredictable GC pauses.

1: https://nim-lang.org/araq/writetracking_2.html

KMag · on Dec 9, 2020

The total CPU overhead of M&S should be proportional to the number of live objects times the rate of object allocation. (Traditionally, collections were triggered every so many allocations or so many bytes allocated, but heuristics based on the amount of free heap would give similar behavior.)

The total CPU overhead of reference counting is proportional to the rate of reference creation and destruction. You can really bring this down if your compiler is told or can figure out which references can be borrowed.

The best case for reference counting is where the compiler is told (or can infer) which references are borrowed, and the code is written in a style that mostly borrows references, and rarely mutates data structures.

One interesting hybrid approach is "one bit reference counting". You have a 1-bit flag in the object header that gets set when a reference to the object gets copied. If a reference gets destroyed, and the object doesn't have the flag set, that means that the referenced object is garbage. Lots of objects never have more than one reference to them, and this little tweak allows for more immediate reclamation of many objects, which allows you to get away with doing tracing GC less often.

wisty · on Dec 8, 2020

If it's verifiable by the compiler, then it shouldn't need to have a pragma, just optimise it automatically. (OK, it might be useful as a hint). I'm guessing there's grey areas where this would be some kind of halting problem type issue though - you won't know until you run it. Constructing a graph that you know to be a tree would be a trivial example.

Runtime checks would be possible but then you'd have to check every time you create an object.

throwaway_pdp09 · on Dec 8, 2020

Pretty sure no efficent run-time algo for general cycle detection exists. It has (AFAIK) been a big area in garbage collection for a long while, so smart people have been all over it.

Not sure if one exists for compile-time either.

Someone · on Dec 8, 2020

I don’t think so, either, but the only time a cycle can be introduced is if one modifies an object to make it point to a younger (created later in time) object.

In some languages, that’s relatively rare because modifying objects after they are created is rare; immutable objects can be part of cycles, but they can’t be the cause of cycles. That may be enough to make the overhead of cycle detection small.

I guess one could keep track of those ‘dirty’ links, and use them to avoid having to detect cycles in most cases.

It might be enough to add a single bit to each object-to-object pointer indicating “this reference was set after construction of the object” (could be an unused bit in a virtual address, or the least significant bit, so it wouldn’t introduce memory overhead).

With such bits, at first sight, one would only have to check for cycles when P references a _possibly_ younger Q, and destruction of some object (indirectly) decreases P’s ref count to 1. However, things probably would get hairy when objects may have many outgoing and/or incoming pointers (consider a set of 3 objects each referencing the other two).

I’ll go and read the referenced paper (https://researcher.watson.ibm.com/researcher/files/us-bacon/...) to see how it handles this.

bobthebuilders · on Dec 8, 2020

How do GCs collect cycles then? I'm pretty sure most modern GCs handle cycles effortlessly.

Someone · on Dec 8, 2020

It depends on what you call ‘effortlessly’.

And, technically, GCs collect non-garbage, and throw away what remains. They don’t care whether what they throw away contains cycles or not, but have to take care that they don’t keep running in circles when they follow links between objects while collecting the non-garbage.

imtringued · on Dec 9, 2020

Graph scans are trivial to implement though.

throwaway_pdp09 · on Dec 9, 2020

Would you consider full graph scan efficient? Would you like a full graph scan after every allocation, millions of times a second?

I don't understand why you said that.

necubi · on Dec 9, 2020

The simplest way to do this is to use the low bit of the address (so long as you align your memory to at least 2 bytes this is fine as all addresses will end in 0) as a marker for whether you've already visited that pointer.

See https://en.wikipedia.org/wiki/Tracing_garbage_collection#Na%...

And yes, you're correct that GCs handle cycles without issue (that's a large part of the reason to use tracing GC over other memory management strategies like ObjC's auto-reference counting).

magicalhippo · on Dec 9, 2020

Don't know if any GCs use it, but if modifying the pointers is not possible then one alternative is Floyd's cycle-finding algorithm[1] which uses constant storage (two pointers).

[1]: https://en.wikipedia.org/wiki/Cycle_detection#Floyd.27s_Tort...

necubi · on Dec 9, 2020

"Finding cycles" is not really the problem a GC is trying to solve. It's trying to find all of the non-reachable data, across a possibly cyclic graph. It just has to ensure that it doesn't loop infinitely while traversing that graph.

magicalhippo · on Dec 9, 2020

And that would be the reminder not to post before breakfast :)

Still, fun algorithm.

throwaway_pdp09 · on Dec 9, 2020

I purposefully used the word 'efficient' in my original post. Of course you can find cycles many ways, but floyd's (or similar) are not acceptably fast and never can be in general.

mratsim · on Dec 9, 2020

Swift leaves it in the hand of programmers and says "use weak references".

Otherwise traditional GC use a copying mechanism, what doesn't end up copied is dead.

The old reference counting GC in Nim had that extra copying pass when a type could have cycles.

That said, it's very easy to tell if a type can have cycles or not at compile-time, you just need to check if it refers to itself i.e. does the Node type has a Node field (or a field that can recursively contain a Node field).

throwaway_pdp09 · on Dec 9, 2020

> easy to tell if a type can have cycles or not at compile-time, you just need to check if it refers to itself

Certainly useful at times as I suppose it may be used to ensure you don't have cycles (I imagine that could be useful for resource management) but 'can have cycles' is definitely not 'does have cycles'. It's the latter that's important here.

mratsim · on Dec 10, 2020

And the ``{.acyclic.}`` hint applies to the later.

In the future we might have formal verification helping as well: https://nim-lang.org/docs/drnim.html

throwaway_pdp09 · on Dec 9, 2020

Fair question, was unclear.

Cycle detection at point of creation (when a pointer is set). You can do a full scan of data structure graph after every malloc but that's equivalent to your GC, and clearly horribly, unacceptably expensive.

disruptek · on Dec 8, 2020

You can read about the reasoning behind subgraph ownership here: https://github.com/nim-lang/RFCs/issues/244

scottlamb · on Dec 8, 2020

Thanks. If I'm understanding correctly: with Isolated (which is still being built), any given object is reachable from only one thread. It can be used for message-passing but not shared memory.

It looks like they've talked about shared memory as important for efficiency, eg https://nim-lang.org/araq/concurrency.html (from back in 2013 apparently, and I don't know if it's a "canonical" project view or not) says:

> For real CPU efficiency on commodity hardware shared memory is unavoidable. Message passing doesn't cut it and ultimately doesn't address the same problems. (http://www.yosefk.com/blog/parallelism-and-concurrency-need-...) Also shared mutable memory is unavoidable in a systems programming context.

(I agree with that statement, fwiw.)

It sounds like they haven't built support for shared memory yet; correct? and as this Isolated construct is also in progress, I think there's basically no threading support in Nim yet?

Are they still intending to build shared memory as well as Isolated? Is it intended to be memory-safe (as in the Rust terminology) or not?

elcritch · on Dec 9, 2020

> It sounds like they haven't built support for shared memory yet; correct? and as this Isolated construct is also in progress, I think there's basically no threading support in Nim yet?

You can do shared memory already. There's concurrent hashes and lists, and full threading support. But currently it's on the programmer to not mess up the GC on shared data (eg ensure data graphs are isolated, but it's not too tricky). I use the ESP32's shared queue data structure in the esp-idf to move Nim's native JSON types between FreeRTOS tasks. It's non-copying and more efficient than normal message passing. I also use some shared globals. It all works pretty well.

treeform · on Dec 8, 2020

It's great that Nim is pushing boundaries in the memory management space. I remember reading how Java and Go were able to scan large heaps faster and faster. But what if you don't even scan the heap? Brilliant.

int_19h · on Dec 8, 2020

I may be missing something here, but reference counting + cycle detector is what Python has been doing for many years now?

In my experience, it's the worst of both worlds. You no longer have the full determinism of reference counting - you only retain it for non-cyclical structures, and in a large enough program, it can be easy to introduce those without noticing (with callbacks etc). But you still have the overhead of RC on every reference assignment.

nikki93 · on Dec 9, 2020

An important thing is that with move semantics and good analysis for those at compile time, a lot of assignments can be moves and not need refcount changes. This fact actually comes up in Nim with ARC a lot. Refcounted refs are pretty explicit in Nim and you can use value types for the most part. That allows it to be pretty clear in your architecture what data has a refcount associated, and you can opt into that in a conscious way. The explicitness doesn't make the code look weird or unidiomatic either, it feels like a natural part of the language that it gives you the tools to do this and is clearly a core part of its design.

The way I've gone about it in Nim so far, the refcounts are actually rare, especially copies (which is where increments occur). You can also look at the generated C and it's pretty clear where the overhead is happening, and I've found that in practice it happens rarely enough that this makes a lot of sense. It's pretty different from how things go in Python. In my project I also actually just use ARC and for sure don't have cycles, and the destructor calls are deterministic.

Here's a talk that's somewhat related (refcounts in the "Lobster" programming language, and how it also elides refcount calls using compile-time logic (which ends up being equivalent-ish to a framing in terms of move semantics)): https://www.youtube.com/watch?v=WUkYIdv9B8c

disruptek · on Dec 8, 2020

Also, we don't scan "live" memory, which matches your intuition -- why would we?

imtringued · on Dec 9, 2020

Reference counting is an old hat.

skulk · on Dec 8, 2020

There's a lot of praise for ORC[1], but is there any discussion of the downsides of this approach? One potential downside is increased compile times, something that hasn't ever been a problem for Nim. However, I'm not sure what kind of load ORC puts on the compiler, and how that will scale to large projects.

[1]: https://forum.nim-lang.org/t/6483

planetis · on Dec 8, 2020

I have switched to `--gc:arc`/`orc` for some time but I haven't noticed any difference in compilation speed.

gok · on Dec 8, 2020

Extremely cool. Reference counting with cycle collection might be an idea whose time has finally come.