ZGC – What's new in JDK 16

modeless · on March 23, 2021

1 ms pause times are pretty good. That's finally getting close to the point where it may no longer be the biggest factor preventing adoption in applications like core game engine code. Although at 144 Hz it's still 14% of your frame time, so it's hardly negligible.

Even if the GC is running on an otherwise idle core there are still other costs like power consumption and memory bandwidth. So you still want to minimize allocation to keep the GC workload down.

For too long GC people were touting 10 ms pause times as "low" and not bothering to go further, but truly low pause times are possible. I'd love to see a new systems language that starts by designing for extremely low-pause GC, not manual allocation or a borrow checker. I think it would be possible to make something that you could use for real time work without having to compromise on memory safety and without having to pay the complexity tax Rust takes on for the borrow checker.

moonchild · on March 23, 2021

> at 144 Hz it's still 14% of your frame time, so it's hardly negligible

A single skipped frame is not a big deal and will probably not be noticed. It will probably happen anyway due to scheduling quirks, resource contention with other processes, existing variation in frametime...

True realtime work requires no dynamic allocations whatsoever (which, notably, is not covariant with gc!), so I think ‘low’ pause times are an acceptable compromise. Where performance is a concern, you need to manually manage a lot of factors, among them GC/dynamic memory use. There's no runtime that can obviate that.

Granted, 1ms pause times are probably still not low enough for realtime audio, and there may be room for some carelessness there (audio being soft realtime, not hard realtime). But I think just being careful to avoid dynamic allocation on the audio thread is probably a worthwhile tradeoff.

AaronFriel · on March 23, 2021

> A single skipped frame is not a big deal and will probably not be noticed.

Some folks definitely notice this phenomenon, called a "microstutter" by that group. You can see it here:

https://testufo.com/stutter#demo=microstuttering&foreground=...

kaba0 · on March 23, 2021

Noone mentioned how frequent frame-skips are we talking about.

Is a single frameskip in an hour a problem?

barrkel · on March 23, 2021

A constant frame interval is better than occasional skipped frames. You don't need a super high frame rate for perceived smooth motion, but dropped frames look like stutter.

brokencode · on March 23, 2021

Stutter is becoming much less of a factor with variable refresh rate display technology. Modern consoles, TVs, and many monitors are being built with VRR these days, and in a few years it will probably be ubiquitous.

Unless you have a highly optimized game, you are probably not able to consistently run at a 144 Hz monitor’s native refresh rate anyway, so even without skipping frames you will see stuttering. VRR solves this problem as well.

syockit · on March 23, 2021

I'm not sure if you and GP share the same notion of stutter or not. I never saw stutters when limiting the game at 24 or 30 fps while playing on a 60 Hz LCD monitor in the past. It stutters only when the fps is not constant.

TylerE · on March 24, 2021

VRR makes that kind of stutter essentially irrelevant, as the penalty for missing the refresh interval is basically zero. They don’t really drop frames, it’ll be displayed as soon as it’s done, so the frame will only delay for however late it actually is, instead of waiting a whole 1/60th of a second for the next refresh

Of course, we’re generally talking about much higher refresh rates, at least 120 if not more.

My display runs at 165hx and you’d be surprised how many games could hit that a lot of the time... even on my old 1080 (and certainly on the 3080 I have now).

The most frequent issue I run in to is actually games having hard FPS caps (120 typically, sometimes 150) and refusing to actually max it out.

viraptor · on March 23, 2021

Is that measured somewhere / have you got a reference? In my experience 120hz dropping frames (or more precisely rendering at 100 +/- 10fps) looks much smoother and nicer than constant 60 or 30fps.

devit · on March 23, 2021

Dynamic allocations don't cause any issues with hard realtime, as long as you don't run out of memory.

moonchild · on March 23, 2021

Most allocators are not constant time, and are fairly slow anyway. (Actually GC tend to have faster allocators, but obviously unpredictable pauses.)

(Though there was an allocator I saw recently that promised O(1) allocations. Pretty neat idea.)

monocasa · on March 23, 2021

Core game code commonly uses custom allocators that do provide those semantics though.

A bump allocator that you reset every frame is O(1) and a dozen or so cycles per allocation for example.

moonchild · on March 23, 2021

Sure, yes. I was referring more to ‘general-purpose dynamic allocator’ (malloc or so). I agree custom memory management/reclamation techniques can be fine for RT; just semantics.

modeless · on March 23, 2021

> A single skipped frame is not a big deal and will probably not be noticed.

Attitudes like this are why my phone sucks to use and I get nauseous in VR and GC devs spent so long in denial saying 10 ms pause times should be good enough. Yes, single dropped frames matter. If you don't think so then I don't want to use your software.

kaba0 · on March 23, 2021

A single skipped frame usually means that we are talking about soft real time. And there it is absolutely acceptable, not in the average case, but eg. on a heavily used computer a slight drop in audio is “appropriate”, it’s not an anti-missile device.

It won’t make the normal case jittery, nauseus or anything like that. Also, in regards to your GC devs comment, I would say that attitudes like this is the problem.. The great majority of programs can do with much more than 10 ms pause times.

akx · on March 23, 2021

A slight drop in audio would be perfectly unacceptable for eg computers running concerts.

viraptor · on March 23, 2021

You're bringing up a specific professional use case in a general-purpose discussion. If you run concerts on anything other than dedicated hardware and software, you're going to have a bad time. If you record at home without low-latency devices and soft-rt software, you're going to have a bad time.

It's in the same category as "the performance of stock Camry is unacceptable for racing" - yes, plan accordingly when entering a race :-)

The_rationalist · on March 23, 2021

Although at 144 Hz it's still 14% of your frame time well if we believe their numbers, the worst case is 0.5ms so 6% of frame time for 144hz. Assuming their stated average pause time of 0.05ms then average pauses (and the GC isn't constantly pausing) take 0.6% of frametime, which is negligible. Though your concerns on throughput and resource usages stands. Well newer programming languages could leverage ZGC (and improve upon it) by targeting graalVM + it enable cross-language interop.

modeless · on March 23, 2021

In my experience GC developers wildly underestimate their worst case, so I don't really believe that 0.5ms number. But more importantly, you should not use average pause time at all. At 144 Hz the 99th percentile frame time occurs more than once per second. If you want to avoid dropping frames you need to design for the worst case.

doikor · on March 23, 2021

There is a worst case that is much worse then the now mentioned 1ms. Namely really big change in allocation rate in which case the gc cycle does not finish before running out of memory (ZGC uses allocation rate to determine when to start next gc cycle). In that case ZGC stops allocations ("Allocation Stall" in gc logs). But this is basically a failure mode and should not happen during normal operations at all.

Though you can configure for this in a couple ways if you run into this issue

1. By telling it to treat some amount of heap as the max that is not actual max when it comes to it calculating when it should start next gc cycle. (-XX:SoftMaxHeapSize) https://malloc.se/blog/zgc-softmaxheapsize

2. Increasing the amount of concurrent gc threads so they will finish their work faster (-XX:ConcGCThreads)

3. Just run really large heap so the "run a gc cycle even if we don't need to based on allocation rate" gc cycle keeps the heap in check.

Though after JDK 15 we have not had to mess with any of these. Prior to that we had to adjust soft max heap size a bit. With JDK 16 it should be even better I guess (should be upgrading sometime next week)

kaba0 · on March 23, 2021

ZGC used to target 10ms as worst-case latency, and they target 3-4ms now I believe.

jacques_chester · on March 23, 2021

The article above says that the target is 1ms.

kaba0 · on March 23, 2021

Thanks for the correction, I remembered the 3-4 ms from the inside java podcast on the ZGC.

jacques_chester · on March 23, 2021

No worries.

dan-robertson · on March 24, 2021

I think games are just a bad example. Many games have both simple ownership semantics (making manual memory management easier) and simple object lifetimes.

In a game an object is likely to live for exactly one frame, exactly n frames where n is like 2 or 3 for some kind of deferred rendering, or an entire level. Some games mightn’t have levels as such but objects still tend to be grouped together in when they load/unload.

I think it is important to note this because games that need to care about performance are not using malloc. They will use some kind of arena allocation because allocating and freeing memory with malloc can be expensive. So what games want here is control over how allocation happens rather than when free is called—that is, it wouldn’t be sufficient to have a language where you called free but couldn’t use a custom allocator.

But there are standard ways to deal with these sorts of allocation patterns in a language like Java. For buffers (of eg vertices) these may just be reused. For smaller objects, pool allocation may be used which sucks but doesn’t suck much more than manual memory management.

I think it is silly to focus on games (or, for that matter, short-running command-line tools that needn’t call free at all) because the allocation pattern is so unusual and weird. I can’t think of great examples of systems that care about latency and don’t have game-like allocation patterns but maybe a complex gui application like a web browser or PowerPoint would be a good example.

kllrnohj · on March 24, 2021

Games also care heavily about data layout, which you can't really do in a GC'd language very well. Well, game engines care heavily about that, anyway. Pooling objects doesn't get you there.

thu2111 · on March 24, 2021

There's no tradeoff between using a GC and controlling data layout. CLR already allows value types and JVM is getting support for advanced value types and specialised generics (templates) over them.

higerordermap · on March 24, 2021

At this point, I wonder whether it's possible to architect an efficient GC that predictably runs only when alloc() is called and thus helps to avoid pauses in main loop. Of course it works best with a language having value types.

dan-robertson · on March 24, 2021

This is the standard way traditional incremental GC works. Typically allocation is done with a protocol like

1. Try to bump a pointer to allocate the memory

2. If there wasn’t enough memory, call a special function (which runs the GC) and try again

This special function may do a big gc pause if necessary or it may just do a small amount of gc work and grant the program some more memory (this way most gc work is shared more evenly with program time). In modern VMs like Java’s, concurrent gc (where gc and mutator run in separate threads possibly needing a pause for some steps) is more popular.

Thaxll · on March 23, 2021

GC will most likely never be used in demanding games. You want total control over memory. 1ms sounds ok but still you don't know when and for how long the GC is going to kicks in.

BenoitP · on March 23, 2021

> for how long the GC is going to kicks in.

1ms (max, average at 50us)

And for the 'when' I'll add that the very concept of having a concurrent GC means you don't need to do a (potentially pausing) malloc right in the middle of what you're trying to do.

monocasa · on March 23, 2021

The kind of people that care about GC pause times have their own allocators that are as cheap as jvm allocations and cheaper deallocation. They aren't poopooing GC's and then just calling regular malloc and free.

Thaxll · on March 23, 2021

Engines rely on smart allocator and memory pool, they usually allocate everything beforehand. You're not running malloc between two frames. Imagine a game like Battlefield if you were to allocate memory for each fired bullet.

CJefferson · on March 23, 2021

The biggest engine in gaming, Unity, uses c#, which is GCed.

terramex · on March 23, 2021

And the amount of man-hours collectively spent on working around this terrible, terrible GC is immense.

It was the worst GC implementation I've seen in my life, could cause 0.5s GC spikes every 10 seconds on Xbox One even though we were allocating none or very little memory during gameplay. The amount of pre-allocated and pooled objects was bringing it down to its knees, because Unity's GC is non-generational and checks every single object every time. In the end we moved a lot of the data into native plugins written in C++. Nothing super hard, but you choose high-level engine to avoid such issues.

I've read that in 2019 they finally added incremental mode GC, that solves some of the issues but is still far cry from modern GC's.

Thaxll · on March 23, 2021

It's def not the biggest, it's almost not used in "AAA" games also.

liamkf · on March 23, 2021

Unreal also has a GC to deal with. I've spent more time on AAA games than I'd like to admit trying to hide/mitigate/optimize the hitch.

bitmapbrother · on March 23, 2021

I would say Unreal is the biggest game engine in terms of pervasiveness. Also, isn't C# just used as a scripting language in Unity? All of the heavy lifting is dome by the C/C++ backend.

TylerE · on March 23, 2021

Unity is way way way way more common than Unreal.

Unreal is more common than it was 5 years ago, but it's still a very distant 2nd. Probably 9/10 games released on Steam in 2021 are Unity.

You may not realize it, since the devs are slightly less bad at hiding the fact thay they're using Unity. (How did anyone at Unity ever think that disaster of a "launcher" was a good idea?)

pjmlp · on March 24, 2021

Unreal uses a GC for Blueprints and C++ entities.

thu2111 · on March 24, 2021

Yeah. These "you can't use GC in games" comments always crack me up.

https://docs.unrealengine.com/en-US/ProgrammingAndScripting/...

"Garbage collection in Unreal Engine 4 is fast and efficient, and has a number of built-in features designed to minimize overhead, such as multithreaded reachability analysis to identify orphaned Objects, and unhashing code optimized to remove Actors from containers as quickly as possible."

I always feel like there are a LOT of posers in every HN/reddit discussion of GC who have tricked out gaming rigs and are desperate to max it out in every game they play, but aren't actually game developers themselves. Whilst out in the real world there is Unreal, Unity, heck even Minecraft all making mad coin using garbage collectors at their core.

pjmlp · on March 24, 2021

HN is mostly full of FOSS and Web development culture.

Coming from the demoscene and having had a foot on game development, I am fully aware that stuff like monetizing IP and getting a game published no matter what and how, are much more relevant in the game development communities than whatever gets discussed over here.

Just look at e.g. the general opinion about Flash around here, and how it was embraced by the game development comunities.

On Reddit there are gamedevelopment forums where you will have more luck finding more real life experience.

Then there is IGDA, Gamedev, Gamasutra, Making Games, PAX, IGF and similar forums.

Fact is, not every game engine needs to be used for a Far Cry clone, and there are many ways to make money.

vlovich123 · on March 23, 2021

One of the observations I've been making is that strategies like this of spreading the work around multiple threads almost seem to play with measurements more than necessarily improving the cost. So yes, the "stop the world phase" is shorter & cheaper. It's unclear the rest of the threads have more implicit overhead to support this concurrency (more book-keeping, participating implicitly in GC, etc). Supporting benchmarks of various workloads would be helpful to understand what tradeoffs were made.

cogman10 · on March 23, 2021

Good observation.

This is a fundamental principle of garbage collection. You can either have low latency or high throughput. You can't get both.

Why is that?

All optimizations that improve latency come at a cost. Generally, more book keeping, more checks, more frequent garbage collections. ZGC is one of those algorithms. It adds a new check every time you access memory to see if it needs to be relocated. That increases the size of objects but also the general runtime of the application.

A similar thing happens with reference counting (which is on the extreme end of the latency/throughput tradeoff). Every time you give a shared pointer or release a shared pointer a check is performed to see if a final release needs to happen.

On the flip side, a naive mark and sweep algorithm is trivially parallelizable. The number of times you check if memory is still in use is bound by when a collection happens. In an ideal state you increase heap size until you get the desired throughput.

We get "violations" of some of these principles if we can take shortcuts or have assumptions about how memory is used and allocated. For example, the assumption that "most allocations are short lived" or the generational hypotheses leads to shorter pause times even when optimizing for throughput without a lot of extra cost. It's only costly when you've got an application that doesn't fit into that hypotheses (which is rare).

Haskell has a somewhat unique garbage collector based on the fact that all data is immutable. They can take shortcuts because older references can't refer to newer references.

pron · on March 23, 2021

> Haskell has a somewhat unique garbage collector based on the fact that all data is immutable. They can take shortcuts because older references can't refer to newer references.

When you don't mutate in OpenJDK you get essentially the same. Much of the cost of a modern GC (OpenJDK's G1, and soon probably ZGC, too) is write barriers, that need to inform the GC about reference mutations. If you don't mutate, you don't pay that cost. This is partly why applications that go to the extreme in the effort not to allocate and end up mutating more, might actually do worse than if they'd allocated more with OpenJDK's newer GCs.

In fact, OpenJDK's GCs rely heavily on the assumption that old objects can't reference newer ones unless explicitly mutated, and so require those barriers only in old regions.

chmod775 · on March 24, 2021

> This is partly why applications that go to the extreme in the effort not to allocate and end up mutating more.

If you go to the extreme and don't allocate, you can turn the GC off.

pron · on March 24, 2021

Yes (well, sort of; you could use the Epsilon GC, which is essentially a no-op GC), but I once read of a library that allocated all of its objects at initialisation and claimed to be "GC-neutral", i.e. not add any memory management burden regardless of the GC chosen by the application using it. They were very surprised when they tried running their benchmarks with G1. OpenJDK's modern GCs are optimised for "reasonable behaviour." Allocate too much and you suffer; allocate too little (by which I mean reuse objects and mutate them a lot) and you also suffer. But as long as your behaviour is in the broad more-or-less normal range, you get some really good performance, which is only getting better with each release.

moonchild · on March 24, 2021

> This is a fundamental principle of garbage collection. You can either have low latency or high throughput. You can't get both.

> All optimizations that improve latency come at a cost

Huh? That's a very strange, absolutist statement. There are many cases where you have the opportunity to trade off one of latency and throughput for the other, yes. But there are also many cases where an optimization can improve both.

cogman10 · on March 24, 2021

Optimization is probably the wrong word, fundamental algorithm might be a better one. For example, I certainly could see some SIMD optimization improving latency and throughput, that's not what I'm talking about. I'm talking about the overarching GC algorithm.

AFAIK, there's no algorithm that's both good at throughput and latency without making assumptions around how memory is used.

whateveracct · on March 23, 2021

> Haskell has a somewhat unique garbage collector based on the fact that all data is immutable. They can take shortcuts because older references can't refer to newer references.

I don't think this is true? Because laziness is really heavily mutable under the hood. Not to mention that it has mutable references. But maybe there are some tricks in the GC I'm not aware of.

cogman10 · on March 23, 2021

If you're interested in a fun read, they've published a paper on how they do garbage collection.

http://simonmar.github.io/bib/papers/parallel-gc.pdf

The_rationalist · on March 23, 2021

Haskell has a somewhat unique garbage collector based on the fact that all data is immutable. They can take shortcuts because older references can't refer to newer references. I wonder, can this be achieved for immutable datastructures in the JVM e.g records, lists ?

dan-robertson · on March 24, 2021

Yeah, the trade here is lower tail latency for increased median latency (ideally the mean is the same but mightn’t be the case).

For many systems this is desirable. It’s also the reason that incremental and concurrent gc were invented even though you could just have a gc where the mutator runs until it can’t allocate anymore, then you do a big mark and sweep, then you resume the mutator.

Decreasing tail latency can generally improve downstream systems. Let’s say latency for requests is iid and I have some intermediate system which receives a request, needs to send 20 requests to your server, wait for them all to complete, then responds. Then the median latency of my system will look like the 95th %ile latency of your system that it gets data from. If your system were a web server and immune a web browser then you could pay yourself on the back for having low median latency while your users actually experience your 95th %ile latency waiting for all the resources to load.

carry_bit · on March 23, 2021

In general if you the highest throughput you'll also get long pause times, since the techniques to reduce the max pause times depends on inserting barriers into the application code.

Ignoring pause times is fine for batch processing, but not ideal for interactive systems.

BenoitP · on March 23, 2021

There is an overhead: They use higher bits in the address space to indicate various stages in the object's collection (Shenandoah has a forwarding pointer IRCC).

This means you may not activate the compressed pointers optimization.

BenoitP · on March 23, 2021

> you can expect to see average GC pause times of around 0.05ms (50 µs)

This is nuts (and very well below OS jittering)

dan-robertson · on March 24, 2021

What do you mean by OS jittering specifically?

thu2111 · on March 24, 2021

E.g. some other thread that isn't yours getting timesliced onto your core. Or something in the OS interrupt handling code wiping out most of your CPU cache, etc.

Bear in mind that by default most Linux distros will a timeslice of 10 milliseconds so your thread can easily skip that much time simply because it was some other thread's turn to run. JVM pauses with ZGC and even G1 most of the time (when set to the most aggressive settings) are now so low, that your pause times will be dominated by other factors.

refulgentis · on March 24, 2021

handwaving here, but broadly correct: modern OS' will try to arrange work/app wakeups so they happen at the same time - ex. if I schedule a 15 ms timer, and you schedule a 18 ms timer, maybe we both just get woken at 18. or 20. or the phone rang, so we had to kick up power to talk to the tower, lets grab email

bestinterest · on March 23, 2021

This might be an odd question but how often does garbage collection run and whats the usual time taken over a period of time?

Say I'm doing a drawing/game app and creating a few hundred heap objects a second that need to get garbage collected.

I have no idea on how often GC is run on a typical app and how much real time it takes over say an hour of an semi complex app running on average. It obviously depends on the app but I do not even have a number average cost of a GC language for some typical web app.

I only know 'GC's are bad' because the 100s of HackerNews comments dismissing languages because they have a GC for some reason rather than hard examples of them eating up time.

brokencode · on March 23, 2021

GC can be very efficient when considering the average cost over time, and is faster than reference counting for instance. It also can have nice features such as heap compaction which you can’t easily do with manual memory management.

But the main thing most folks have problems with is the random latency spikes you get with GC. The GC can start at any time in most languages, and might stop all threads in your program for maybe dozens or hundreds of ms. This would be visible to users if you are rendering frames at a constant rate in a game, since each frame takes only around 16 ms in a 60 FPS game.

That’s what’s exciting about changes like what they are doing with ZGC. They are saying the max garbage collection time is 0.5 ms in normal situations, and the average time is even lower. Most games can accommodate that without a problem.

FYI, this is also important for web servers as well. Some web servers have a huge amount stored in memory, and the GC could take hundreds of ms or even multiple seconds to collect at random times in extreme cases. This can make a web request take perceptibly longer.

Also, if you have multiple machines communicating with one another and randomly spiking in latency due to GC, then worst case latency can add up to pretty terrible numbers if you are not careful.

dignan · on March 23, 2021

GC is a memory management technique with tradeoffs like all the others.

GC has many different implementations, with widely ranging properties. For example, the JVM itself currently supports at least 3 different GC implementations. There are also different types of GC's, so for example in a generational garbage collection system you'll typically see two or three generations of GCs, depending on the generation (how many GC cycles it has survived) of the objects it collects. The shortest GC's in those systems are usually a couple milliseconds, while the longest ones can be many seconds.

GC isn't always a problem. If your application isn't latency sensitive, it's not a big deal. Though if you tune your network timeouts to be too low, even something that is not really latency sensitive can have trouble because of GC causing network connections to timeout. Even if it is a latency sensitive applicatoin, if GC "stop the world" pauses - pauses that stop program execution, are short it can be OK.

One reason you'll see people say GCs are bad is for those latency sensitive applications. For example, I previously worked on distributed datastores where low latency responses were critical. If our 99th percentile response times jumped over say 250ms, that would result in customers calling our support line in massive numbers. These datastores ran on the JVM, where at the time G1GC was the state of the art low-latency GC. If the systems were overloaded or had badly tuned GC parameters, GC times could easily spike into the seconds range.

Other considerations are GC throughput and CPU usage. GC systems can use a lot of CPU. That's often the tradeoff you'll see for these low-latency GC implementations. GC's also can put a cap on memory throughput. How much memory can the GC implementation examine with how much CPU usage with what amount of stop-the-world time tends to be the nature of the question.

gopalv · on March 23, 2021

> Say I'm doing a drawing/game app and creating a few hundred heap objects a second that need to get garbage collected.

Was literally my job ten years ago to optimize this and I was struggling with a GC'd language with a proprietary implementation (flash+actionscript).

The problem is not with hundreds of heap objects per-frame, the problem is that they would accumulate to the tens of thousands before the first GC trigger happens.

And the GC trigger might happen in the middle of drawing a frame, even worse, at the end of drawing a frame (which means even a 10ms pause means you miss the 16ms frame window at 60fps).

The problem that most people had was that this was unevenly distributed and janky to put it in the lingo. So you'd get 900 frames with no issues and a single frame that freezes.

So most of the problem people have with GC pauses is the unpredictability of it and the massive variations in the 99th percentile latency in the system, making it look slower than it actually is.

Most of the original GC implementations scale poorly as the memory sizes went up and the amount of possible garbage went up, until the GC models started switching over the garbage-first optimizations, thread-local alloc buffers and survivor generation + heap reserves etc (i.e we have lots of memory, our problem is with the object walking overheads - so small objects with lots of references is bad).

The GC model is actually pretty okay, but it is still unpredictable enough that tuning the GC or building an application on top of a GC'd language which has strict latency requirements is hard.

However, as a counterpoint - OpenHFT.

Clearly it is possible, but it takes a lot of alignment across all the system layers, but at that point you might as well write C++ because it is not portable enough to run anywhere.

jankotek · on March 23, 2021

It really depends on application and complexity of object graph. Short lived object usually have low overhead. Long lived objects with huge heap may cause a problem.

In past GC had bad reputation for increased and unpredictable latencies. In old JVMs GC would pause execution to traverse object graph.

In general do not worry about GC, unless you run into performance issues. If performance is a problem, run continuous profiler such as Flight Recorded. It has very little overhead.

_ph_ · on March 23, 2021

And in most cases it isn't GC which is the problem, but the program doing too many heap allocations. Cutting heap allocations down improves the speed of most programs, with or without GC.

Jach · on March 24, 2021

"It depends." There is no such thing as a singular "GC", tradeoffs are everywhere, and there are more relevant metrics than just average collection time.

I recommend this paper from 1992 for an introduction of what types of GC techniques exist https://www.cs.rice.edu/~javaplt/411/15-spring/Readings/wils... It's good basic knowledge.

If you get through that you might want to peruse more modern techniques: https://gchandbook.org/ I think the book's a lot drier/harder to understand than the paper though.

As a counter-signal consider my comment to be dismissive of languages without a GC, because life is too short to deal with manual memory management outside of where you really need to do it, and in those cases anyway there are techniques you can use in GC languages for it that reduce to about the same effort as manual memory management.

higerordermap · on March 24, 2021

Do you have any links about recent developments / GC algorithms? Thanks.

Jach · on March 24, 2021

Not on hand, sorry, I don't follow it that closely. My advice if you're interested in ultra low pause time designs is to look more at ZGC and maybe Shenandoah, and look for detailed comparisons to the other JVM GCs (especially G1 which has also seen some advances since it originally was designed). I'm not aware of any other ecosystem besides the JVM's that's had as much GC innovation in the last 15 years (one has to include Azul's now-old proprietary work to get that time span since ZGC is just now catching up to it) -- not to mention exposure to real and varied workloads. And if you just restrict to OpenJDK, the last 2 or 3 years have seen a lot of improvement that even many Java programmers still have no idea of if they're still stuck in Java 8 land.

unclad5968 · on March 23, 2021

After some research I couldn't really find much of an answer.

The thing about GC is you either don't care at all, or you don't want it at all. There's rarely a case where you know how many GC cycles you can handle in a certain period. Web dev, GC all you want. Games can handle GC but its likely you'll need to be cognitive of memory use. Embedded stuff doesn't have enough memory to utilize a GC.

I'm sure why GC languages get so much hate. I do a lot with C# and the runtime gives a few options for controlling allocations and accessing memory, so I can usually get it to be fast enough.

olodus · on March 23, 2021

Really impressive results.

Sorry for my ignorance on the topic, but will this have any impact on other JVM languages or will this mostly only benefit Java itself?

I realize even though I use JVM languages now and then I do not really know if they use their own GC implementation or make use of Java's. Does this differ between the languages maybe?

jfengel · on March 23, 2021

The JVM has its own garbage collector. Every language uses it.

There may be tiny differences in the way code generators and optimizers work, which mean they may not get exactly the same properties out of equivalent code. For example, if they're generating a lot of objects behind the scenes, the GC improvements might help more, or less, or even do worse.

But that's the kind of thing that's really dependent on the algorithm you've implemented. So mostly likely you get some benefit for free. If you don't, you'll need to benchmark to find out. The optimizers do a lot of work for you (and the JVM does a ton of language-independent optimization), but some things are up to experiment.

buryat · on March 23, 2021

this will work for any language that runs on top of JVM, that’s the beauty of the JVM, improvements benefit all its languages

geodel · on March 23, 2021

Sub milli sec GC pause is very impressive. Though one thing to me is not clear is that if it is true only for very large heaps or it will be great also for typical service/micro service heaps in range of 4-32 GB.

perliden · on March 23, 2021

ZGC pause times will be the same regardless of heap size. ZGC currently supports heaps from 8MB to 16TB. So if you have 4-32GB heaps and want low latency, then ZGC is definitely something to try.

geodel · on March 23, 2021

Ah, you are the author of article :). Thanks for replying! Does ZGC compromise on throughput compare to G1 to achieve low pause times?

perliden · on March 23, 2021

ZGC in its current form trades a bit of throughput performance for better latency. This presentation provides some more details and some performance numbers (a link to the slides is also available there). https://malloc.se/blog/zgc-oracle-developer-live-2020

eklavya · on March 23, 2021

Hey, is there any benchmark comparing throughput performance of ZGC vs G1 etc. How much hit (performance wise) would one take for getting this awesome pause time limit?

kaba0 · on March 23, 2021

Here is a quite elaborate one, though it is not totally up-to-date:

https://jet-start.sh/blog/2020/06/09/jdk-gc-benchmarks-part1

pradeepchhetri · on March 23, 2021

It works great even for large heap sizes. I moved my ES cluster (running with around 92G heap size) from G1GC to ZGC and saw huge improvements in GC. Best part about ZGC is you don't need to touch any GC parameter and it autotunes everything.

pron · on March 23, 2021

Whether G1 or ZGC are the best choice depends on the workload and requirements, but G1 in recent JDK versions also requires virtually no tuning (if your G1 usage had flags other than maximum heap size, maybe minimum heap size, and maybe pause target, try again without them).

JD557 · on March 23, 2021

>running with around 92G heap size

I'm curious about this choice. The elasticsearch documentation recommends a maximum heap slightly below 32GB [1].

Is this not a problem anymore with G1GC/ZGC, or are you simply "biting the bullet" and using 92G of heap because you can't afford to scale horizontally?

1: https://www.elastic.co/guide/en/elasticsearch/reference/7.11...

legerdemain · on March 23, 2021

Heaps "slightly below 32GB" are usually because of the -XX:+UseCompressedOops option, which allows Java to address up to 32GB of memory with a smaller pointer. Between 32-35GB of heap, you're just paying off the savings you would have gotten with compressed object pointers, but if you keep cranking your heap further after that, you'll start getting benefits again.

JanecekPetr · on March 23, 2021

This, exactly. One added issue is that ZGC does NOT support compressed oops at all.

capableweb · on March 23, 2021

> because you can't afford to scale horizontally?

Doesn't have to be because of affordance but rather it's more efficient and cheaper to scale vertically first, both in monetary costs and in time/maintenance costs.

vosper · on March 23, 2021

On hardware, but not on a cloud setup? We run several hundred big ES nodes on AWS, and I believe we stick to the heap sizing guidelines (though I’ve long wondered if fewer instances with giant heaps might actually work ok, too)

toast0 · on March 23, 2021

Cloud is trickier to price than real hardware. On real hardware, filling the ram slots is clearly cheaper than buying a second machine, if ram is the only issue. If you need to replace with higher density ram, sometimes it's more cost effective to buy a second machine. Adding more processor sockets to get more ram slots is also sometimes more, sometimes less cost effective than adding more machines. Often, you might need more processing to go with the ram, which can change the balance.

In cloud, with defined instance types, usually more ram comes with more everything else, and from pricing listed at https://www.awsprices.com/ in US East, it looks like within an instance type, $ / ram is usually consistent. The least expensive (per unit ram) class of instances is x1/x1e which are 122 Gb to 3904, so that does lean towards bigger instances being cost effective.

Exceptions I saw are c1.xlarge is less expensive than c1.medium, c4.xlarge is less than other c4 types and c4 is more expensive than others, m1.medium < m1.large == m1.xlarge < m1.small, m3.medium is more expensive than other m3, p2.16xlarge is more expensive than other p2, t2.small is less expensive than other t2. Many of these differences are a tenth of a penny per hour though.

manasvi_gupta · on March 23, 2021

Please specify Elasticsearch & JDK version. Also, index size and heap size per node.

From my experience, high heap sizes are unnecessary since Lucene (used by ES) has greatly reduced heap usage by moving things off-heap[1].

[1] - https://www.elastic.co/blog/significantly-decrease-your-elas...

vosper · on March 23, 2021

How (and how much) did these improvements manifest? For example, did you measure consistently faster response times when running ZGC rather than G1GC? If so, by how much? I’m always looking for a way to improve ES response times for our users.

pradeepchhetri · on March 24, 2021

We mainly capture GC metrics and alert on them. One good thing that happened is there is no longer GC related alerts happening in production anymore. Also tail latency for API calls from kibana to ES improved.

gher-shyu3i · on March 23, 2021

Did you notice a change in the peak memory usage?

chrisseaton · on March 23, 2021

I think the whole point is the pause time doesn't vary with the heap size.

darksaints · on March 24, 2021

Is there any chance this could make it's way into graal? Particularly native image?

AOT compiled, PGO optimized, statically linked executables, with 0.5ms worst case GC pause times, sounds like the holy grail for me.

aseipp · on March 24, 2021

There'd have to be a new implementation rewritten from scratch for native-image (the native-image GC is actually written in Java itself, while ZGC is written in C++), but I don't see any reason why it would be impossible.

Note that Oracle does have intentions to commercialize similar features; there's already a low latency GC for native-image apps, but it's inspired by G1, not ZGC, I believe, and only available in Graal Enterprise Edition. So any such ultra-low-latency GC might be in a similar basket, unless it was implemented by an outside party or something. Or Oracle changes course on this.

If you don't use native-image and can use a normal JDK to host your Graal app (by using Graal as the JVMCI compiler), then you might be able to use ZGC today with that setup, assuming support for ZGC read barriers has been implemented in Graal.

EDIT: apparently not yet :( https://github.com/oracle/graal/issues/2149

coldtea · on March 23, 2021

Question: is the GC suitable for use with something like Idea or is it more for server workloads? Would it reduce UI GC-pauses lag accordingly?

AnthonBerg · on March 23, 2021

I confirm that ZGC works with IntelliJ IDEA, and it seems to me that it makes IDEA respond quite a bit faster. It's not hard to get IntelliJ IDEs to use ZGC by editing the VM properties file.

perennus · on March 23, 2021

I tried it last week actually with OpenJDK15+Windows. With JDK16, IntelliJ didn't boot.

-XX:+UseZGC: Memory usage for my project dropped to a constant 600 megs. Using the IDE felt just as fast as the normal experience.

-XX:+UseShenandoahGC, -Xmx4g. Shenandoah GC used a constant 4 gigs of ram. It was a slower user experience for me.

In the end, I went back to the default settings, because the custom JDK changes the look and feel and I don't like it.

coldtea · on March 23, 2021

>-XX:+UseZGC: Memory usage for my project dropped to a constant 600 megs. Using the IDE felt just as fast as the normal experience.

Shame, I hoped it would feel faster than the normal experience, with (even infrequent) user-felt GC pauses completely eliminated.

Shadonototro · on March 23, 2021

Java again show its superiority in the managed language world

bitmapbrother · on March 23, 2021

>After reaching that initial 10ms goal, we re-aimed and set our target on something more ambitious. Namely that a GC pause should never be longer than 1ms. Starting with JDK 16, I’m happy to report that we’ve reached that goal too. ZGC now has O(1) pause times. In other words, they execute in constant time and do not increase with the heap, live-set, or root-set size (or anything else for that matter). Of course, we’re still at the mercy of the operating system scheduler to give GC threads CPU time. But as long as your system isn’t heavily over-provisioned, you can expect to see average GC pause times of around 0.05ms (50 µs) and max pause times of around 0.5ms (500 µs).

Very impressive and well done. Should Azul be worried?

novium · on March 23, 2021

Since ZGC is in OpenJDK it should already be available in Zulu as well

https://github.com/openjdk/jdk/tree/master/src/hotspot/share...

bondolo · on March 23, 2021

Azul also has a separate closed-source Zing JVM which includes their C4 collector which could be described as an "uncle" of ZGC.