I still don't see any comparison of Go's current collector with a generational GC with bump allocation in the nursery. The article states that copying would have been too much work to implement in 2014. Fair enough. But that's an engineering decision specific to Go's circumstances. It does not necessarily support forgoing generational garbage collection in general.
The primary advantage of generational GC is the huge throughput advantage that bump allocation in the nursery gives you. If you're not comparing against such a GC, then you're not properly comparing non-generational GC to generational GC. The write barriers you need reduce throughput, but they're ordinarily counterbalanced by the throughput gains you get from bump allocation. Of course you will take a throughput loss if your generational GC is non-copying; that's why people don't usually deploy non-copying generational GCs.
That's not my understanding of the primary win from generational GC. You can use bump allocation with an old-fashioned two-space GC. In fact that's what the allocate() function on https://en.wikipedia.org/wiki/Cheney%27s_algorithm does.
Rather it's assuming the infant mortality hypothesis - that the probability of an allocation being garbage is inversely proportional to its age. Thus you split your allocations into three logical buckets. The first bucket is recent allocations that are super quick to trace (e.g. no bigger than CPU cache), the third bucket is long-lived allocations ideally reachable by global roots rather than short-lived stack frames; and the second bucket is all the stuff that survives collections on the first bucket (hopefully only rooted by code in flight on the stack) and you want to keep out of the third bucket because the third bucket is so damn expensive to trace.
You need write barriers for generational GC, but you need them for concurrent GC anyway.
Generational GC is ideal for servers that have a predictable amount of scratch allocation per request. If first and second buckets are sized correctly and collected at the right time, and the server never makes anything reachable from the third bucket (like a cache - you should use reusable serialized arrays for your caches in generational GC languages, people! And not weak references!), then you're in a great place. It approaches arena allocation in terms of efficiency. I've written app servers in .NET back in the mid 2000s that barely hit 2% of CPU time in GC following these principles, and they allocated pretty freely.
I think generational GC outside of a server request loop context, or with wildly variable allocation profiles, is less great. Not bad mind, but the generational infant mortality hypothesis, and more importantly, lack of middle age death, is less reliable.
(I wrote this before I finished with the actual article. The point in the article about avoiding write barriers with hashing is interesting in particular - somewhat surprised it scales to 100+G heaps, if indeed it does.)
Yeah, it depends on how you weight the improved speed of minor collections (latency) vs. the improved performance of allocation itself (throughput).
(I don't consider pure semispace GCs to be practical because of memory usage. One benefit of generational GC is that it makes Cheney scans practical by confining the semispace collection to TLABs or similar, which are by their nature small.)
That's correct. The primary performance benefit of generational GC (in terms of throughput and maybe also latency) is that it can reclaim the memory (backing the young generation) without marking or otherwise processing the vast majority of the live data (contained in the old generation).
> The write barriers you need reduce throughput, but they're ordinarily counterbalanced by the throughput gains you get from bump allocation.
Bump allocation significantly increases throughput only if you allocate a lot. If you don't, maybe you gain more by avoiding write barriers. And, as explained in the article, Go tends to allocate less because it is value-oriented (instead of reference-oriented, where most values are "boxed") and thanks to escape analysis.
As the article also states, it's possible to remove those write barriers by hashing. I don't see any reason why it shouldn't be possible to use the same scheme for copying generational GC.
The most important point, though, is that you don't know whether the hypothesis that Go wouldn't benefit from generational GC is true until you test it. Nobody has tested it. And there is plenty of indirect evidence to suggest that this is not true. For one, Go programmers have a habit of manually microoptimizing short-lived allocations to keep escape analysis "happy", which is something that becomes much less important with bump allocation in the nursery. Another simple observation is that .NET has the same situation regarding value types as Go does, and .NET uses a copying generational garbage collector. A final observation is that certain situations, like indirect calls [1], cause objects to be spilled to the heap unconditionally.
But that's what the article says: "We started creating tools to help our users find objects that escaped and if it was minor they could make changes to the code and help the compiler allocate on the stack."
"Tools used by a small minority of users to find performance issues," != "Go programmers have a habit of microoptimizations to avoid heap allocations." You are being disingenuous.
> The most important point, though, is that you don't know whether the hypothesis that Go wouldn't benefit from generational GC is true until you test it. Nobody has tested it.
I agree. But you would probably agree it would be a fairly difficult endeavour to test your hypothesis? To make a fair comparison, we'd need to implement an alternative version of the Go compiler and runtime, replacing the current GC with a copying generational GC, with a similar multi-year effort in optimizing and refining it.
I would note that the JVM has done inter-procedural escape analysis for quite some time (via inlining) and the new Graal compiler does even more advanced inter-procedural escape analysis. In particular JVMs do not do stack allocation, they do something better called scalar replacement, which allows tricks like deleting variables from the middle of an scalar-replaced object. Effectively every field of the non-escaping object become local variables that can then be studied, removed, moved around etc. The talk says that Go doesn't do inter-procedural escape analysis at all.
Escape analysis and value types are both useful techniques, but they don't eliminate the need for a fast GC.
I would actually be quite interested in a study of whether equivalent Go/Java programs allocate more or less, especially when pitted against Graal. Presumably it's not hard to prepare such a test.
looking at go-gc it seems to be really successful. like for a lot of apps people are running people are more concerned over high pause times and reducing tail latencies than having the best throughput. like maybe crappy allocation speed and fragmentation is adding 1ms to every request an app is doing but avoiding random 20ms pauses is worth it. this is the same tradeoff the next generation of 'pauseless' java collectors are taking. basically have mutator threads do a lot more work in exchange for lower pause times.
the weird thing is java doesn't seem to offer this an option because it seems like it would be something simple to add. basically CMS + jemalloc. maybe the performance is actually that bad that it would not be usable on java where there is a lot more heap allocation going on.
the design of go seems generally to use simple solutions where other people use complex solutions. like for interface dispatch go has fat pointers which has middling performance. while java uses JIT to remove the dynamic dispatch where it can which has great performance or falls back to looping through the class to find the correct interface vtable to dispatch the method call which has terrible performance.
Isn't JSC kind of an exception to this? JSC has a generational non-copying GC. It uses bump-pointer allocation for empty blocks (block=memory with objects of same size) and then switches to allocation via free-list if the block is full.
For implementing the generational GC they simply use sticky mark-bits.
I don't consider what JSC does bump allocation—it's just a highly optimized traditional allocator. Most malloc implementations have bump allocation somewhere.
SpiderMonkey used to use an allocator much like that of JSC before it switched to copying generational GC, to significant performance improvements.
The primary advantage of generational GC is the huge throughput advantage that bump allocation in the nursery gives you. If you're not comparing against such a GC, then you're not properly comparing non-generational GC to generational GC. The write barriers you need reduce throughput, but they're ordinarily counterbalanced by the throughput gains you get from bump allocation. Of course you will take a throughput loss if your generational GC is non-copying; that's why people don't usually deploy non-copying generational GCs.