Getting to Go: Garbage collection and runtime issues

kjeetgill · on July 17, 2018

Now this is the kind of post I come to HN for. Deeply technical explanations of real expert work!

I'm particularly keen on some of the details around goroutines since Project Loom hopes to bring something similar to the JVM and it's nice to see what the design space looks like.

Another is that go sometimes gets dinged for not being modern enough in a bunch of different directions, e.g. no rust-like immutability/borrowing, generics, or generational GC, but there is a lot of thought put into what they are working on and where they want to focus the language.

steveklabnik · on July 17, 2018

There's so many good quotes in this article. Two of my favorites:

"It isn't that the generational hypothesis isn't true for Go, it's just that the young objects live and die young on the stack."

"... we are led to the conclusion that memory, or at least chip capacity, will follow Moore's law longer than CPUs"

Kudos to the Go team!

pjmlp · on July 17, 2018

"No more C, no long tail of bugs due to C coders not understanding GC but having a cool idea about how to copy strings."

_wldu · on July 17, 2018

I've seen more home made string implementations in C than I can remember. Most all of them were horrible. Say what you like about C++, but it certainly fixed that mess. And I love Go... I think it's the future.

pjmlp · on July 17, 2018

> Say what you like about C++, but it certainly fixed that mess.

Already in the MS-DOS days, even if string class back then was compiler framework specific, and part of ritual of passage as C++ developer.

munificent · on July 18, 2018

I've seen more home-made string implementations in C++ than C. :-/

pjmlp · on July 18, 2018

I bet all of them derive either from the days before std::string, which only started to be commonly supported after 2000, as compilers were catching up with C++98.

The typical rite of passage that every C++ developer needs to write their own string and vector classes as part of learning C before C++, instead of properly C++ as we can do nowadays.

Finally those shops that don't allow to use proper C++ on their code, rather C with Classes.

fnord123 · on July 18, 2018

Don't forget loggers! It seems like every C++ programmer wants to write their own logger. :-S

pjmlp · on July 18, 2018

Nah, I guess you mean reflection libraries. :)

paulddraper · on July 18, 2018

Ah, good ol QString

dullgiulio · on July 17, 2018

Indeed, the quote about the generational hypothesis is particularly important since this comes up at every Go GC discussion, especially when someone brings up the "more advanced" GCs in Java.

The generational hypothesis is not about the heap, it's about all allocations.

ebikelaw · on July 17, 2018

The second statement is interesting, because it contradicts the thinking within Google (platforms and capacity management) that in the long run we're all deeply screwed on the DRAM front, whereas CPU performance will probably not hit a wall any time soon.

yencabulator · on July 19, 2018

That might be a question of latency vs bandwidth. Often RAM is used for random-access workloads (duh), so latency matters. The GC is very much a sequential scan. The blog post talks about that, briefly.

ernst_klim · on July 17, 2018

What? Are Go interfaces and strings stack allocated? If not, this assertion looks dubious to me.

colonelxc · on July 17, 2018

Go uses escape analysis at compile time to determine where to allocate each variable. If the variable is shown to not 'escape' the current goroutine's stack, then it is can be allocated on the stack. Otherwise it is said that the value 'escapes' to the heap.

You can see what the escape analysis has determined for each variable by building like so: go build -gcflags "-m"

Unfortunately it can be hard to know by just looking at code if something will be stack or heap allocated. You can allocate a struct and take its pointer, and have it still end up on the stack, for instance (which is good). You can also have seemingly obvious flows for a variable that the analyzer can't reason about and it escapes when you didn't think it would. You really just have to run the tool to see if you want to be sure.

pcwalton · on July 17, 2018

> You can also have seemingly obvious flows for a variable that the analyzer can't reason about and it escapes when you didn't think it would. You really just have to run the tool to see if you want to be sure.

And this is why "get programmers to manually optimize their code to please escape analysis" is not a reasonable memory management strategy.

dullgiulio · on July 18, 2018

Nobody is suggesting that. Go has very good profiling support, and that's how performance optimization starts.

If profiling finds that there are too many allocations on a hot path, you might want to see why they escape to heap. But that's a last resort strategy.

staticassertion · on July 18, 2018

This sounds very much like "write code to please escape analysis", but with a profiling step in front of it.

kristianp · on July 18, 2018

No, there's a difference. Profiling will often show you where you're doing allocations or calling code that does allocations in the most often used code. Profiling shows where to reduce those allocations to get the biggest improvements. When making the changes, you don't try to reduce escapes, you reduce allocations.

Have a look at [1] and go down to where it says " To find out why the garbage collector is running so much".

[1] https://blog.golang.org/profiling-go-programs

ychen306 · on July 17, 2018

I am not disagreeing with you but am confused on the issue. Java also has escape analysis, so I wonder what makes the difference here.

colonelxc · on July 17, 2018

That is a good point. I don't know, so I've been googling around and came across this awesome article:

https://shipilev.net/jvm-anatomy-park/18-scalar-replacement/

It essentially says that java doesn't (ever?) allocate Objects on the stack, but it sometimes approximates it by just storing the scalar values (that would have been part of the object) on the stack, and never allocating the object (with its headers) at all. The end result is similar.

It also has a mention of an easy way to confuse the escape analysis (though Go's escape analysis has many pitfalls as well).

I'm speculating, but it could be partly that the escape analysis and resulting scalar replacement is a hotspot optimization, and therefore the Java GC has to be well tuned to handle non-jitted items. It may have something to do with what types of failure cases each of the languages have with their analysis.

pcwalton · on July 17, 2018

The HotSpot JVM does indeed do escape analysis. The primary difference is that, in Go, you can put structs and arrays inside other structs.

I tend to be skeptical of the claims that are often made that Go data structures have different GC-relevant characteristics from those of Java, or especially C#.

ychen306 · on July 17, 2018

Can you elaborate why being able to embed objects within another void this hypothesis?

voidlogic · on July 17, 2018

Java used to default to heap allocation for pretty much everything, but added escape analysis as an optimization. For Go escape analysis as been part of it since day 1. Basically Java is heap by default that uses the stack when it can. Go uses the stack unless it determines it must use the heap.

bbatha · on July 17, 2018

Most go data is value types that can be assumed to be on the stack, unless you introduce a pointer explicitly or use one of the referenceish types like interface, string, array or map. Its very easy for both the programmer and the compiler to see when data escapes.

Java on the other hand is pathologically bad for escape analysis, outside of ints and floats literally every object in java is presumed to be on the heap. Even C# has less heap pressure than java typically because of more value type options, and not type erasing greatly improves the JIT's escape analysis. After sufficient JITing Java may reach go level of garbage generation, it may not: but its not deterministic.

mike_hearn · on July 17, 2018

That's garbled in several ways I'm afraid.

Firstly type erasure and escape analysis are un-related. Whether an object escapes or not is unrelated to things like whether it's a List<Integer> or List<int>.

Secondly .NET doesn't do escape analysis at all:

https://github.com/dotnet/coreclr/issues/1784

Thirdly HotSpot will happily escape analyze and then scalar replace regular objects, it isn't limited to ints and floats. It can make a big difference in many situations.

Fourthly and finally, if it were "easy to see where data escapes" in Go there wouldn't need to be tools to help the programmer find out, and Go developers wouldn't spend time manually tweaking to try and ensure the optimisation works.

jules · on July 19, 2018

Type erasure and escape analysis are not entirely unrelated. If you have an array of pairs and store a pair into that array then the pair is considered to escape in Java because you're actually storing a pointer to your original pair. In C# that array can store a true array of pairs rather than an array of pointers to pairs. A pair you store gets copied into the array, so the original pair does not have to be considered to escape and can be stored on the stack and deallocated when the function exits.

Although C# does not have escape analysis, its value type vs reference type distinction serves roughly the same purpose, but under manual control of the programmer.

pjmlp · on July 17, 2018

Only if you are speaking about Hotspot.

GraalVM and other Java JITs (J9, Azul, PTG, ...) have much better escape analysis algorithms.

jules · on July 19, 2018

In Go you can pass and return by value. That's why it's far more effective in Go and of very limited applicability in Java. For instance if you're working with a pair of numbers and only use that pair locally in one function, then Java escape analysis works. If that pair was returned from another function or passed to another function then escape analysis fails in Java but still works in Go. In Go the escape analysis only fails if you explicitly make a pointer to that pair of numbers and pass that to another function.

steveklabnik · on July 17, 2018

Let me say that I am not super familiar with the Go source, and so this may be wrong. Please correct me if this is true!

My understanding is, the string itself is on the stack, but the data is on the heap https://github.com/golang/go/blob/2f2e8f9c81d899d8eefb1f2f98...

Likewise, the "double pointer" of an interface is on the stack, but points to the heap: https://research.swtch.com/interfaces

So, from a lifetime perspective, it's still pretty stack-like. If you make a string, and you don't refer to it from outside the function, it's gonna get noticed by escape analysis and be like anything else on the stack.

ernst_klim · on July 17, 2018

>the string itself is on the stack, but the data is on the heap

So both are regular heap allocated values? So the majority of data is still stack (edit: heap) allocated (safe ints, floats and some simple structs, I suppose), so I fail to see how ``young objects live and die young on the stack''.

It is true that some GCed languages does not have stack allocation at all, but many of them (java, haskell) have very powerful inlining capabilities to compensate it. I still don't understand why I don't need generations for allocating to-die-soon strings in a more efficient ways.

pcwalton · on July 17, 2018

You are correct. Properly-implemented copying generational GCs are the only things that work to optimize allocation of strings, arrays, and objects judged to be escaping.

steveklabnik · on July 17, 2018

> So the majority of data is still stack (edit: heap)

It doesn't matter that the string's data lives somewhere else. It matters where what owns it lives. The string itself, that is, the pointer and length, both live on the stack, and so, it ends up working like a stack allocated value.

> I still don't understand why I don't need generations for allocating to-die-soon strings in a more efficient ways.

Because the string itself isn't on the heap, and therefore, a generation wouldn't help it.

mike_hearn · on July 17, 2018

How does that follow? If the string is a pointer to a heap array plus a length, then "it ends up working like a stack allocated value" would only be true if the underlying byte array was freed immediately when a scope ended. But that isn't the case because the heap is a GCd heap, Go doesn't have explicit frees.

This is one sort of reason why stack allocation isn't a cure-all - stack allocating a wrapper doesn't help you much if the bulk of the data is still in the heap. What can help, in this situation, is if the compiler can prove (or speculate that) the array is small and then scalar replace the array. However I think doing that sort of thing requires a profiling JIT compiler because statically proving array sizes ahead of time is very difficult.

steveklabnik · on July 17, 2018

I’m guess I am assuming that it would be freed at the end of the scope. As I said upthread, it’s entirely possible that I’m incorrect here.

This kind of thing happens in Ruby a lot; the VM will allocate stuff through malloc that never actually hits the GC; see Aaron Patterson's work.

mike_hearn · on July 17, 2018

I've never seen a GCd heap where you can explicitly free things outside of the Boehm conservative GC. I haven't seen any reference to Go supporting such a thing either.

I think all that happens when a string goes out of scope is that the byte array becomes eligible for GC.

ninkendo · on July 17, 2018

That sounds trivial to fix... go cheats on all sorts of things with the built-in types (maps/slices/strings) that you can't do as a user of the language, seems they could easily implement something that explicitly frees the heap space when the string pops off... the free call could be inlined right into the compiled code by the compiler.

jules · on July 19, 2018

Go has a mark-and-sweep gc so in principle they can support explicit deallocation.

jashmatthews · on July 19, 2018

> This kind of thing happens in Ruby a lot; the VM will allocate stuff through malloc that never actually hits the GC; see Aaron Patterson's work.

IIUC there's a really huge difference here. Go can quickly allocate the byte array into the GC managed heap. MRI Ruby has to make a slow call to malloc(). Worse still, when collecting we can't just mark that memory available again. Ruby has to detect the type of each object and call free() accordingly.

yencabulator · on July 19, 2018

> My understanding is, the string itself is on the stack, but the data is on the heap

What makes you say that? Pointers to stack are totally possible in Go; just because the `str` field is a pointer doesn't mean it points to the heap.

pcwalton · on July 17, 2018

I still don't see any comparison of Go's current collector with a generational GC with bump allocation in the nursery. The article states that copying would have been too much work to implement in 2014. Fair enough. But that's an engineering decision specific to Go's circumstances. It does not necessarily support forgoing generational garbage collection in general.

The primary advantage of generational GC is the huge throughput advantage that bump allocation in the nursery gives you. If you're not comparing against such a GC, then you're not properly comparing non-generational GC to generational GC. The write barriers you need reduce throughput, but they're ordinarily counterbalanced by the throughput gains you get from bump allocation. Of course you will take a throughput loss if your generational GC is non-copying; that's why people don't usually deploy non-copying generational GCs.

barrkel · on July 17, 2018

That's not my understanding of the primary win from generational GC. You can use bump allocation with an old-fashioned two-space GC. In fact that's what the allocate() function on https://en.wikipedia.org/wiki/Cheney%27s_algorithm does.

Rather it's assuming the infant mortality hypothesis - that the probability of an allocation being garbage is inversely proportional to its age. Thus you split your allocations into three logical buckets. The first bucket is recent allocations that are super quick to trace (e.g. no bigger than CPU cache), the third bucket is long-lived allocations ideally reachable by global roots rather than short-lived stack frames; and the second bucket is all the stuff that survives collections on the first bucket (hopefully only rooted by code in flight on the stack) and you want to keep out of the third bucket because the third bucket is so damn expensive to trace.

You need write barriers for generational GC, but you need them for concurrent GC anyway.

Generational GC is ideal for servers that have a predictable amount of scratch allocation per request. If first and second buckets are sized correctly and collected at the right time, and the server never makes anything reachable from the third bucket (like a cache - you should use reusable serialized arrays for your caches in generational GC languages, people! And not weak references!), then you're in a great place. It approaches arena allocation in terms of efficiency. I've written app servers in .NET back in the mid 2000s that barely hit 2% of CPU time in GC following these principles, and they allocated pretty freely.

I think generational GC outside of a server request loop context, or with wildly variable allocation profiles, is less great. Not bad mind, but the generational infant mortality hypothesis, and more importantly, lack of middle age death, is less reliable.

(I wrote this before I finished with the actual article. The point in the article about avoiding write barriers with hashing is interesting in particular - somewhat surprised it scales to 100+G heaps, if indeed it does.)

pcwalton · on July 17, 2018

Yeah, it depends on how you weight the improved speed of minor collections (latency) vs. the improved performance of allocation itself (throughput).

(I don't consider pure semispace GCs to be practical because of memory usage. One benefit of generational GC is that it makes Cheney scans practical by confining the semispace collection to TLABs or similar, which are by their nature small.)

DblPlusUngood · on July 17, 2018

That's correct. The primary performance benefit of generational GC (in terms of throughput and maybe also latency) is that it can reclaim the memory (backing the young generation) without marking or otherwise processing the vast majority of the live data (contained in the old generation).

ngrilly · on July 17, 2018

> The write barriers you need reduce throughput, but they're ordinarily counterbalanced by the throughput gains you get from bump allocation.

Bump allocation significantly increases throughput only if you allocate a lot. If you don't, maybe you gain more by avoiding write barriers. And, as explained in the article, Go tends to allocate less because it is value-oriented (instead of reference-oriented, where most values are "boxed") and thanks to escape analysis.

pcwalton · on July 17, 2018

As the article also states, it's possible to remove those write barriers by hashing. I don't see any reason why it shouldn't be possible to use the same scheme for copying generational GC.

The most important point, though, is that you don't know whether the hypothesis that Go wouldn't benefit from generational GC is true until you test it. Nobody has tested it. And there is plenty of indirect evidence to suggest that this is not true. For one, Go programmers have a habit of manually microoptimizing short-lived allocations to keep escape analysis "happy", which is something that becomes much less important with bump allocation in the nursery. Another simple observation is that .NET has the same situation regarding value types as Go does, and .NET uses a copying generational garbage collector. A final observation is that certain situations, like indirect calls [1], cause objects to be spilled to the heap unconditionally.

[1]: https://github.com/golang/go/blob/964639cc338db650ccadeafb74...

enneff · on July 17, 2018

> Go programmers have a habit of manually microoptimizing short-lived allocations to keep escape analysis "happy"

I've been writing Go for 8+ years and have never done this, and I wouldn't approve a code review of a teammate that did this.

pcwalton · on July 17, 2018

But that's what the article says: "We started creating tools to help our users find objects that escaped and if it was minor they could make changes to the code and help the compiler allocate on the stack."

enneff · on July 17, 2018

"Tools used by a small minority of users to find performance issues," != "Go programmers have a habit of microoptimizations to avoid heap allocations." You are being disingenuous.

ngrilly · on July 19, 2018

> The most important point, though, is that you don't know whether the hypothesis that Go wouldn't benefit from generational GC is true until you test it. Nobody has tested it.

I agree. But you would probably agree it would be a fairly difficult endeavour to test your hypothesis? To make a fair comparison, we'd need to implement an alternative version of the Go compiler and runtime, replacing the current GC with a copying generational GC, with a similar multi-year effort in optimizing and refining it.

geodel · on July 17, 2018

Copying will double memory overhead. So downside is easy to see for cloud applications.

pcwalton · on July 18, 2018

Only toy copying GCs are pure semispace collectors.

mike_hearn · on July 17, 2018

I would note that the JVM has done inter-procedural escape analysis for quite some time (via inlining) and the new Graal compiler does even more advanced inter-procedural escape analysis. In particular JVMs do not do stack allocation, they do something better called scalar replacement, which allows tricks like deleting variables from the middle of an scalar-replaced object. Effectively every field of the non-escaping object become local variables that can then be studied, removed, moved around etc. The talk says that Go doesn't do inter-procedural escape analysis at all.

Escape analysis and value types are both useful techniques, but they don't eliminate the need for a fast GC.

I would actually be quite interested in a study of whether equivalent Go/Java programs allocate more or less, especially when pitted against Graal. Presumably it's not hard to prepare such a test.

benmmurphy · on July 18, 2018

looking at go-gc it seems to be really successful. like for a lot of apps people are running people are more concerned over high pause times and reducing tail latencies than having the best throughput. like maybe crappy allocation speed and fragmentation is adding 1ms to every request an app is doing but avoiding random 20ms pauses is worth it. this is the same tradeoff the next generation of 'pauseless' java collectors are taking. basically have mutator threads do a lot more work in exchange for lower pause times.

the weird thing is java doesn't seem to offer this an option because it seems like it would be something simple to add. basically CMS + jemalloc. maybe the performance is actually that bad that it would not be usable on java where there is a lot more heap allocation going on.

the design of go seems generally to use simple solutions where other people use complex solutions. like for interface dispatch go has fat pointers which has middling performance. while java uses JIT to remove the dynamic dispatch where it can which has great performance or falls back to looping through the class to find the correct interface vtable to dispatch the method call which has terrible performance.

the8472 · on July 17, 2018

Wouldn't a non-copying generational collector with bump allocation be a recipe for fragmentation?

pcwalton · on July 17, 2018

Yes. That is why they have to be copying collectors.

readittwice · on July 17, 2018

Isn't JSC kind of an exception to this? JSC has a generational non-copying GC. It uses bump-pointer allocation for empty blocks (block=memory with objects of same size) and then switches to allocation via free-list if the block is full.

For implementing the generational GC they simply use sticky mark-bits.

pcwalton · on July 17, 2018

I don't consider what JSC does bump allocation—it's just a highly optimized traditional allocator. Most malloc implementations have bump allocation somewhere.

SpiderMonkey used to use an allocator much like that of JSC before it switched to copying generational GC, to significant performance improvements.

collinf · on July 17, 2018

What an amazing read. These folks really pulled off some fantastic engineering without a JIT. It will be interesting to see if the team decides to add additional knobs to configure the Go GC for different workloads like Java has now.

kjeetgill · on July 17, 2018

Go has one BIG advantage compared to Java with respect to both JIT and GC. Go has structs. This introduces a little more complexity that Java chose (reasonably) to avoid by making everything a primitive or a reference. There isn't any "direct" objects or explicit pointers like Go has.

This may muddy up the language vs Java a little bit, but it also gives a lot more tools to work your code around the runtime's limitations. And thats a perfectly fine design point by me.

If you're going to have platform limitations (no JIT, simple GC) then I need reasonable tools when I hit them. It's much easier to avoid frequent allocations or implement arenas in Go. This takes the pressure of the runtime to be one size fits all.

I know, I know, Java is getting Value Types some day! We'll see where that lands.

nemo1618 · on July 18, 2018

I love Go's value types. If you're careful/obsessive, you can write programs that don't allocate anything on the heap. That sort of power is really nice to have in a pinch.

staticassertion · on July 17, 2018

It's then interesting to compare to C#, which also can do stack allocation, but still chooses a generational GC.

masklinn · on July 18, 2018

C# defaults to reference types, structs are not usually recommended as the default way to build your own types, only as an optimisation.

kjeetgill · on July 17, 2018

I knew it had it but I have no C# experience. Does idiomatic code use this or only more niche optimizations?

pjmlp · on July 18, 2018

All performance critical data structures are value types.

Meaning int, long, char, bool, ..., points, rectangles, pairs, colours, enumerations.

Then generics are reiffed. So each instance gets a copy if value types are used.

Additionally, you can explicitly allocate value types and arrays on the stack and there are APIs for manual memory management.

squiguy7 · on July 17, 2018

> We also do not intend to increase the GC API surface. We've had almost a decade now and we have two knobs and that feels about right. There is not an application that is important enough for us to add a new flag.

It sounds like they don't intend to add any more tuning parameters for the time being. Considering they have made it this far without many ways to adjust the garbage collector, I'll bet this remains true.

mike_hearn · on July 17, 2018

I agree it's a good and well written talk. I enjoyed reading it.

WRT knobs. Apparently they aren't going to do so, but I suspect that's a mistake.

It's apparent from reading this talk that GC design for Go has often been constrained by this "no knobs" design philosophy in ways that seem problematic. For instance they designed a GC to have the shortest pauses possible, but then kept hitting throughput problems on a classic batch job where latency doesn't matter at all (their compiler).

I wrote about this problem some time ago:

https://blog.plan99.net/modern-garbage-collection-911ef4f8bd...

The essay points out that GC tuning exists because of fundamental tradeoffs between low pause times and throughput. Java lets you tune the GC precisely to avoid the nasty situation Go got itself into, where the demands of very very different kinds of workload (super latency sensitive Google servers vs compilers) are difficult to reconcile with each other.

You can see this tension come through several times, like when they admit that they would have done a normal "enterprise" style GC as they rather condescendingly called it, but they felt Go's market position was under threat, they only had a year to make improvements but were constrained by the requirement to not slow down the Go compiler itself:

So we were limited. We had basically a year of compiler performance improvements that we could eat up by making the GC concurrent. But that was it. We couldn't slow down Go programs. That would have been untenable in 2014 ... Go also desperately needed short term success in 2015.

So they ended up with this design that they knew was kind of weird and not that great, but it let them promote low latency above all else to their users, and they sort of buried the tradeoffs they were making along the way. Which, reading between the lines, might have been related to internal staffing issues at Google - why else would a well funded project like Go have "desperately needed short term success in 2015"? It's not like they sell the Go compiler.

More quotes that support this:

As you can see if you have [request oriented collector] on and not a lot of sharing, things actually scale quite nicely. If you don't have ROC on it wasn't nearly as good.

So they had a GC that scaled better. But it got killed by, you guessed it, the needs of batch jobs:

But that wasn't good enough, we also had to make sure that ROC didn't slow down other pieces of the system. At that point there was a lot of concern about our compiler and we could not slow down our compilers. Unfortunately the compilers were exactly the programs that ROC did not do well at. We were seeing 30, 40, 50% and more slowdowns and that was unacceptable. Go is proud of how fast its compiler is so we couldn't slow the compiler down, certainly not this much.

They could have solved this problem by having two GCs for Go programs tuned for throughput and latency, or doing what the Java guys did and trying to make an adaptive GC like G1 where you can give it pause time targets: if the target is very high then it tunes for throughput otherwise it tunes for latency. But instead they seem to have sacrificed it all on the altar of "no knobs", which is a pity because they ended up introducing two knobs anyway: SetGCPercent and SetMaxHeap.

Well hey, guess what, if you use the G1 GC in Java you are only supposed to tweak two knobs too - max heap size (same as Go) and pause time target. If anything G1's knobs seem better because pause time target is a lot easier for developers to reason about than "heap overhead multiplier". In the end it's unclear their sacrifices were really worth it, although I don't know enough about Go to know what dire straits in 2015 were causing them to need such short term success.

readittwice · on July 18, 2018

I think you are right, but I can still see why they chose certain trade offs. The most common use case for Go seems to be as servers (right? I am not a Go developer...) where latency is important, non-moving GC allows them to have low latency while not having the throughput hit of read barriers. Obviously this comes with certain trade-offs: e.g. slower allocation performance. OTOH interfacing to C becomes a bit easier, although that's even without that quite complicated in the case of Go. Also the language uses interior pointers a lot, which is simply more complicated with a moving GC.

> So they had a GC that scaled better. But it got killed by, you guessed it, the needs of batch jobs

Even in the JVM where you have multiple specialized GCs, the developers also look at each GC in workloads it wasn't designed for and that also seems to influence design decisions. For me this just means they focus on the server use case (=latency) but not at the cost of 50% throughput hit for the compiler (=batch job).

kodablah · on July 17, 2018

So they had to work with a few algorithms and tweaks. I wonder if there would be value in exposing the GC (and alloc) iface so others can implement without recompiling the language. I would love to toy with a pluggable GC. Last I looked at .Net's when it was open sourced, it was a gargantuan cpp file. I know LLVM has it somewhat pluggable. We can see a lot of Go GC's work at https://github.com/golang/go/blob/master/src/runtime/mgc.go but at quick glance, I couldn't see how I could plug in an alternative impl. Surely they had to build an experimentation abstraction to test their ideas, right? Or are GC's just so specialized and use so much internal knowledge that extracting a common iface is unreasonable?

Also,

> The math is absolutely fascinating, ping me for the design docs

Ping! Publish them please.

Scaevolus · on July 17, 2018

The design docs are linked right below :)

Go 1.5 concurrent garbage collector pacing: https://golang.org/s/go15gcpacing

Proposal: Separate soft and hard heap size goal: https://github.com/golang/proposal/blob/master/design/14951-...

mike_hearn · on July 17, 2018

The latest versions of OpenJDK have a pluggable GC interface, at the source level at least - you can't literally drop a DLL into a directory and load it dynamically, but the source code is modularised quite well and there are several different GC engines available to be chosen at startup.

kodablah · on July 17, 2018

Good to know, I knew that I could choose different ones at runtime via java args, but unaware there was a runtime interface. I see cpp code at [0] and some adjacent impls. is there a single C-level h file or is it cpp only? Any hello-world trivial impl?

0 - http://hg.openjdk.java.net/jdk/jdk11/file/a0de9a3a6766/src/h...

mike_hearn · on July 17, 2018

The hello world collector is called Epsilon. Epsilon is the simplest possible collector: it never collects at all.

The next one up in complexity is called the serial collector. It's a stop the world mark/sweep design straight out of the 1980s, but it can be quite effective for small programs like command line tools, simple GUI apps etc because there's no heap overhead and no runtime overhead when not collecting.

After that you get into the more advanced algorithms. In order they'd be parallel, CMS, G1, Shenandoah/ZGC.

The different engines can be found one level up, here:

http://hg.openjdk.java.net/jdk/jdk11/file/a0de9a3a6766/src/h...

The directory you were looking at is code shared between the collectors.

kodablah · on July 17, 2018

Ah yes, I saw the adjacent impls, didn't check deeper to see Epsilon was exactly what I was looking for. Thanks. I see even the ifaces are GPL sadly as are the impls. I am not sure the classpath exception applies to the interfaces. And I fear even reading/learning from the impls lest I want to use my GPL-gained knowledge on a distributed, non-GPL project.

mike_hearn · on July 18, 2018

GPL doesn't affect knowledge, only code. You can go work on proprietary software using things you learned by reading GPLd code, don't worry about that.

pjmlp · on July 17, 2018

You are looking for "JEP 304: Garbage Collector Interface".

http://openjdk.java.net/jeps/304

Naturally this is OpenJDK specific, other JVMs might do it differently.

qazwe · on July 17, 2018

With all improvements to the GC, in 2018 Go is ready to develop real-time AAA games? In the past there is discussions that Go is not very well for game development. (GC latency)

pcwalton · on July 17, 2018

I would think the biggest barriers (no pun intended) to using Go in CPU-bound apps like games would be the lack of mature compiler optimizations, compared to GCC and LLVM. Additionally, the performance of cgo can cause problems when linking to libraries like Vulkan that can't effectively be lowered into syscalls.

I also think that throughput of memory allocation, not just latency, matters a lot more than people give it credit for.

ngrilly · on July 17, 2018

> Additionally, the performance of cgo can cause problems when linking to libraries like Vulkan that can't effectively be lowered into syscalls.

Agreed. But you also ask for very fast memory allocation, which implies a bump allocator, which implies a copying GC, which implies pinning objects shared with C through cgo (Go's foreign function interface), which would probably make cgo even slower.

> I also think that throughput of memory allocation, not just latency, matters a lot more than people give it credit for.

At 60 fps, minimizing latency is critical. You have to optimize this first. When your longest GC pause is well below 1/60 s, then you can optimize throughput of memory allocation.

Anyway, most AAA video game engines being written in C/C++, my guess is they don't use a bump allocator, which means they do quite well with allocators like tcmalloc or jemalloc, probably by allocating as much as possible early, and using object pools.

pcwalton · on July 17, 2018

I agree that these are all tradeoffs. My view is that languages like Rust and C++ exist for a reason, and demanding games are among the domains that they're most appropriate in. :)

Note that Unity is an existence proof that most games can get by just fine with high-level logic in GC.

dilap · on July 18, 2018

And with a language that is not particularly fast. And with a GC that's way worse than what Go has.

Go would be fine for lots of game dev, imo.

pcwalton · on July 18, 2018

Note that I'm not sure how FFI calls out of Unity's VM compare to those of cgo.

dilap · on July 18, 2018

ah, good point. much faster than cgo. so probably that would kill you in go.

i wonder if it's possible to speed up ffi in go...

ngrilly · on July 18, 2018

You wrote that Unity FFI is much faster than cgo. I'd be interested in knowing more about this. Why is it faster? Can you share a link to some reference material and/or benchmark?

dilap · on July 18, 2018

good question! i'm not an expert.

my understanding is go is slow because each function call has to copy over to a bigger stack (goroutines have tiny stacks to start, and grow on demand, but c code can't do that, natch) and because it has to tell the goroutine scheduler some stuff for hazy reasons.

this github issue has a lot of interesting discussion of go ffi: https://github.com/golang/go/issues/16051

these benchmarks https://github.com/dyu/ffi-overhead seem to show that a c call via cgo will be about 14 (!) times slower than a c call from c# in mono, which itself is about 2 times slower than just a plain c call.

ngrilly · on July 19, 2018

cgo doesn't copy the stack to a "bigger stack" when calling a C function. It just uses another stack, dedicated to cgo, and disallows passing pointers into the stack:

https://github.com/golang/go/issues/11089

So I think the slowness is caused by something else.

Thanks for sharing the benchmark.

dilap · on July 19, 2018

ah, that makes more sense. thanks.

Thaxll · on July 18, 2018

All major game engines use a custom allocator.

ychen306 · on July 18, 2018

I want to point out that because go has value type, it's possible to implement Arena Allocators using only safe Go, which is something one might need for these kind of games.

littlestymaar · on July 18, 2018

Without generics, you'd need to implement one allocator per type you'd like to allocate … (Because just using Interface{} defeats the point of having value types in the first place)

enqk · on July 18, 2018

It seems that frame rates of game monitors are pushing the boundaries of how low the latency should be. Imagine making a competitive multiplayer game design to run on 144hz or 200hz monitors.

blaisio · on July 18, 2018

No, Go is not (and probably never will be) a good language for writing real time programs in. It is a good language for writing video game servers though.

jy3 · on July 18, 2018

As previously stated, Go for video game servers feels like a no brainer.

koffiezet · on July 18, 2018

That sounds a bit backwards, you would need to implement the network protocol twice and keep them in sync - which can be very tricky... Having a common code-base for parsing the network protocol into internal game structure objects is a huge advantage...

pjmlp · on July 18, 2018

A big chunck of AAA game servers is already implemented in Java, .NET, Erlang.

Not all of them use C++ on the server side.

koffiezet · on July 19, 2018

Many of them use C++ libs with language bindings for 'common' code though. And let that just be one of the weakest points of Go...

pjmlp · on July 19, 2018

Yeah, that could be the case yes.

ralusek · on July 18, 2018

I think it's treated as being just fine for video game servers, my assumption was that they meant for game engine development.

skybrian · on July 17, 2018

I wonder if they'd consider changing the bit that indicates whether a word is a pointer or not based on the value rather than the type?

Interfaces in Go seem pretty inefficient for unions of small types. If you have a slice of interfaces, it's all pointers, even if many of the values fit within a word and could be stored inside the interface.

weberc2 · on July 17, 2018

IIRC they originally were stored inside of the interface, but then the GC would have to check whether they were a value or a pointer and this switching was too costly.

abacate · on July 19, 2018

The article is indeed interesting when it goes to the GC/allocation part and it's a nice read, but it really gets C++ wrong when talking about values/references. C++ is not "reference oriented language" the same way it is not an object-oriented language.

For instance, what restricts one from doing:

    int blah(myobj a);

Instead of:

    int blah(myobj &a);

For passing your a myobj value?

The usual argument is efficiency - since "myobj" may be a big and have an expensive copy constructor - and you may happen to see this pattern being commonly used in C++ programs, but restrictions? There are none. C++ even makes it possible to assert coherency with copy/assign constructors, and modern C++ supports the more efficient moving as well.

So, the point is: nothing in the language restricts you from passing things by value. I'd go further an say that blindly passing things by reference is bad practice and detrimental for simple objects - since you now have to consider lifetime/ownership of the reference you passed along, and this will probably complicate an otherwise simple piece of code.

The argument about fields in memory also makes not much sense. How is that different from C++? Keeping references/pointers in structures is possible but not exactly what one needs to do.

Otherwise the article is great I just think it misses the point when comparing to "C and C++", since they are not the same language - even though syntax is similar and there is some level of compatibility.

incadenza · on July 18, 2018

I find this stuff fascinating but so much of it went over my head. Can anybody point me in the right direction to read more about the relevant topics? Is this standard material in compiler design, or another related field?

pcwalton · on July 18, 2018

My favorite resource is: http://memorymanagement.org/

Its glossary is invaluable. Even though the site is old, GC has been around for decades and the state of the art hasn't changed much since it was written. It's opinionated in a good way, explaining why production GCs work as they do.

incadenza · on July 18, 2018

Awesome! Thanks.

riobard · on July 17, 2018

Does anyone have comparable GC latency numbers comparing Go vs Java as of 2018?

mike_hearn · on July 17, 2018

There is a discussion from a year ago that compares them here:

https://www.reddit.com/r/golang/comments/5j7phw/modern_garba...

Note that latency isn't the only metric that matters! The top comment from that discussion says this:

There were tests lately in which Go GC was almost the fastest latency wise. Go was was couple of times faster than Java in mean latency time but it had 1062 pauses comparing to Java G1 GC which had only 65 pauses. Time spent in GC was 23.6s for Go but only 2.7s in Java. There is no free launch, you need to pay for low latency with throughput.

More detail from the same guy:

"Go: 67 ms max, 1062 pauses, 23.6 s total pause, 22 ms mean pause, 91 s total runtime

Java, G1 GC, no tuning: 86 ms max, 65 pauses, 2.7 s total pause, 41 ms mean pause, 20 s total runtime"

nvarsj · on July 17, 2018

It’s not even a contest on the latency front. Golang gc targets sub ms, hotspot easily hits 100s of ms. A well tuned hotspot can reach 10s of ms pauses on average, with a long tail of 100 ms+. This is on typical apps.

pron · on July 17, 2018

1. This is not due to GC design but to language semantics (value types, which aren't yet available in Java).

2. In September '18, ZGC will be available in OpenJDK on Linux/x86 (available today in early access), which also targets sub ms pauses (and guarantees under 10ms): https://wiki.openjdk.java.net/display/zgc/Main

Note that at some point, worst-case latency becomes far less meaningful than throughput, because unless running on a realtime OS, the OS introduces bigger pauses, and it makes no sense for the GC to try and beat them.

ngrilly · on July 17, 2018

> This is not due to GC design but to language semantics (value types, which aren't yet available in Java).

The fact that Go has value types helps a lot, but the latency inferior to 0.5 ms is mainly the result of GC design, as explained in the discussed article (especially the work on removing as much as possible stop the world pauses).

> Note that at some point, worst-case latency becomes far less meaningful than throughput, because unless running on a realtime OS, the OS introduces bigger pauses, and it makes no sense for the GC to try and beat them.

This is already said in the article.

pron · on July 17, 2018

Because you usually pay for latency with throughput, you can afford the low latency achieved with a simpler design only if you have less concurrent work.

masklinn · on July 18, 2018

Go is not magic, and does in fact pay for latency with throughput: https://www.reddit.com/r/golang/comments/5j7phw/modern_garba...

> Go: 67 ms max, 1062 pauses, 23.6 s total pause, 22 ms mean pause, 91 s total runtime

> Java, G1 GC, no tuning: 86 ms max, 65 pauses, 2.7 s total pause, 41 ms mean pause, 20 s total runtime

nvarsj · on July 18, 2018

I couldn't find the benchmarks for that comment. But the Java numbers seem way off. It must have been a simple toy application, probably with allocations optimized away somewhere - I'd want to see memory allocation rates to see if they are comparable. G1 is usually higher latency than CMS in my experience, too. Try running Cassandra with G1, and see how good your pauses are. Min viable pause time target for G1 is ~200ms (the hotspot default).

BuckRogers · on July 19, 2018

I don't really understand the argument that Go is "almost" soft-realtime.. if you need that, you probably need or should just go to realtime. Use say, Rust or C++.

Otherwise it seems to me that the Java/C# model is the best design for most tasks. Which is why they're so popular, it's not a mistake.

ngrilly · on July 19, 2018

> Otherwise it seems to me that the Java/C# model is the best design for most tasks.

This is discussed in the article (basically, Google needed low latency servers):

« If you want 10 answers ask for several more and take the first 10 and those are the answers you put on your search page. If the request exceeds 50%ile reissue or forward the request to another server. If GC is about to run, refuse new requests or forward the requests to another server until GC is done. And so forth and so on.

All these are workarounds come from very clever people with very real problems but they didn't tackle the root problem of GC latency. At Google scale we had to tackle the root problem. Why?

Redundancy wasn't going to scale, redundancy costs a lot. It costs new server farms. »

pron · on July 19, 2018

> but they didn't tackle the root problem of GC latency

But they did. The new low-latency Java GCs are more sophisticated than Go's, and deliver pauses that are on the order of OS-caused pauses. The reason Go was able to achieve low latency with a relatively simple design is because 1. it suffers a hit to throughput and 2. that throughput hit, while significant, is not catastrophic because Go relies heavily on value types.

ngrilly · on July 19, 2018

> The new low-latency Java GCs

As you wrote, they are new, and weren't available when the decision was made for the Go GC.

> The reason Go was able to achieve low latency with a relatively simple design is because 1. it suffers a hit to throughput and 2. that throughput hit, while significant, is not catastrophic because Go relies heavily on value types.

And are you saying these "new low-latency Java GCs" have no tradeoffs either?

I'm sorry, but we are discussing the keynote of the International Symposium on Memory Management, which is a recognized event in the field, and you are claiming things without any substantial material to offer. Maybe you're right, but I need more than vague assertions to be convinced :-)

pron · on July 19, 2018

There is always a tradeoff in throughput (and a commercial low-latency GC has been available for Java for years, as well as realtime GCs). All I'm saying is that the reason Go is able to achieve low latency with a relatively simple design is because the language is designed to generate less garbage, so the challenge is smaller. My point is that Go's GC is not some extraordinary breakthrough in GC design (not that Hudson hadn't made some in the past) that unlocks the secret to low-pause GCs, but more of an indirect exploitation of the fact that the allocation rate is relatively low. The same design with much higher allocation rates would likely suffer an unacceptable hit to throughput.

A recent presentation on ZGC is here: https://www.youtube.com/watch?v=tShc0dyFtgw

ngrilly · on July 19, 2018

I'm not really sure what you're trying to prove here. Java is definitely a great platform and has a cutting-edge GC. Nobody contests that. Go's GC is another point in the design space, starting with different constraints (low latency and non moving). This is what makes the ISMM keynote interesting.

Hudson explains they tried to switch to a generational GC, but for this they needed a write barrier. It was difficult to optimize the write barrier, by eliding it whenever possible, because Go is moving to a system where there is a GC safepoint at every instruction (this is because goroutines can be preempted at anytime, which is not a requirement for Java threads). In other words, the GC design is also constrained by the way goroutines work.

Hudson also explains that because Go relies a lot on value types, it makes espace analysis more effective, even without interprocedural analysis, which makes generational collection less effective than in other languages using a GC.

Keeping the allocation rate low is part of Go's GC design. A language with a "much higher allocation rate" would probably lead to a different design.

Thanks for the link to the presentation on ZGC! I'll watch it soon. But I saw the slide showing the performance goals and ZGC doesn't sounds a lot better than the numbers presented by Hudson for Go:

- "10 ms max pause time" for ZGC versus "two <500 microseconds STW pauses per GC" for Go

- "15 % max throughput reduction" for Java versus "25% of the CPU during GC cycle" for Go"

By the way, I also note that ZGC is "single generation".

ngrilly · on July 19, 2018

> The new low-latency Java GCs are more sophisticated than Go's, and deliver pauses that are on the order of OS-caused pauses.

ZGC, whose explicit goal is to be a low latency GC for Java, has a goal of 10 ms max pause time, which seems way above the pauses caused by the OS (compared to two <0.5 ms pauses per GC cycle for Go). The number came from the video you shared earlier. But maybe I misunderstood something.

pron · on July 20, 2018

1. In the video they say that in practice they see average pauses of 1ms and max pauses of 4ms.

2. It's still very early days for ZGC, but there are more established low-latency Java GCs. A non-realtime one is C4[1] (used in the Zing JVM), which then added generational collection to a design that is similar to ZGC. They report[2] a few 0.5-1ms pauses per hour

[1]: https://www.azul.com/files/c4_paper_acm2.pdf

[2]: https://groups.google.com/forum/#!topic/mechanical-sympathy/...

ngrilly · on July 25, 2018

Thanks for the links.

This is definitely great engineering.

I also recommend this short presentation of ZGC: https://dinfuehr.github.io/blog/a-first-look-into-zgc/

pjmlp · on July 17, 2018

And even on the real time case, we have then the real time extensions, even though they are out of range for most devs (price wise).

pjmlp · on July 17, 2018

Unless one gets to use Azul or real time JDK.

It is easy to forget there isn't a single answer with Java.

nvarsj · on July 18, 2018

Yeah true, but C4 is a commercial product, and I've yet to meet someone that used it. But on paper, Azul's C4 GC sounds really great. I think it might require a custom kernel as well, and some other OS restrictions - which would also somewhat limit its usage as a general purpose collector.

js2 · on July 17, 2018

Not currently loadable for me due to the Google Cloud Loadbalancer outage[0]:

Google Cache (irony...):

http://webcache.googleusercontent.com/search?q=cache:BtSPqHc...

[0] https://news.ycombinator.com/item?id=17552532

geodel · on July 17, 2018

So Go set Heap to 2X and CPU 25% as GC overhead. In Java side may be due to generational/copying GC Heap is recommended 3-4X and CPU ~20%. This is good read about Java GC tuning:

https://engineering.linkedin.com/garbage-collection/garbage-...

the8472 · on July 17, 2018

The article you're linking to is outdated. It talks about java 7 and CMS, which is deprecated as of OpenJDK9. There are other collectors in OpenJDK and then there are other JVMs too. They all come with a lot of different overhead, complexity, latency and throughput tradeoffs.

amelius · on July 18, 2018

Seems like Go may become an interesting target-language for compilers.

rurban · on July 19, 2018

When they'll finally get a good generational copying collector, yes. Now it's just too slow for that.

corydominguez · on July 17, 2018

Has anyone found a video of this keynote?

pjmlp · on July 17, 2018

Very interesting.