None of those games were designed for UMA nor do they benefit as the API design ...

SigmundA · on May 17, 2021

From the article: "Meanwhile, unlike the CPU side of this transition to Apple Silicon, the higher-level nature of graphics programming means that Apple isn’t nearly as reliant on devs to immediately prepare universal applications to take advantage of Apple’s GPU. To be sure, native CPU code is still going to produce better results since a workload that’s purely GPU-limited is almost unheard of, but the fact that existing Metal (and even OpenGL) code can be run on top of Apple’s GPU today means that it immediately benefits all games and other GPU-bound workloads."

https://developer.apple.com/documentation/metal/setting_reso...

https://developer.apple.com/documentation/metal/synchronizin...

"Note In a discrete memory model, synchronization speed is constrained by PCIe bandwidth. In a unified memory model, Metal may ignore synchronization calls completely because it only creates a single memory allocation for the resource. For more information about macOS memory models and managed resources, see Choosing a Resource Storage Mode in macOS."

I am not trying to minimize the other engineering improvements, however I do believe there may be less credit being given to the UMA than deserved due to past lackluster UMA offerings. As I said it will be interesting to see how far Apple can scale UMA I am not sure they can catch discrete graphics but I am starting to think they are going to try.

kllrnohj · on May 17, 2021

To leverage shared memory in Metal you have to target Metal. Otherwise take for example glTexImage2D: https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/gl...

Apple can't just hang onto that void* that's passed in as the developer is free to re-use for something else after the call. It must copy, even on a UMA system. And even if it was adjusted such that glTexImage2D took ownership of the pointer instead, there'd still be an internal copy anyway to swizzle it as linear RGBA buffers are not friendly to typical GPU workloads. This is why for example per Apple's docs above when it gets to the texture section it's like "yeah just copy & use private." So even though in theory Metal's UMA exposure would be great for games that stream textures, it still isn't because you still do a copy anyway to convert it to the GPU's internal optimal layout.

Similarly the benefits of UMA only help if transfering data is actually a significant part of the workload, which is not true for the vast majority of games. For things like gfxbench it may help speedup the load time, but during the benchmark loop all the big objects are only used on the GPU (like textures & models)

SigmundA · on May 17, 2021

I believe most of the benchmarks where Metal based in the Anand article, also PBO have been around for quiet a while in OpenGL:

https://developer.apple.com/library/archive/documentation/Gr...

Any back and forth between CPU and GPU will be faster with unified memory especially with a coherent on die cache.

This is the same model from iOS so just about anyone doing metal will already be optimizing for it same with any other mobile development.

It doesn't seem like a minor architectural difference to me:

"Comparing the two GPU architectures, TBDR has the following advantages:

It drastically saves on memory bandwidth because of the unified memory architecture. Blending happens in-register facilitated by tile processing. Color, depth and stencil buffers don’t need to be re-fetched."

https://metalkit.org/2020/07/03/wwdc20-whats-new-in-metal.ht...

kllrnohj · on May 17, 2021

> I believe most of the benchmarks where Metal based in the Anand article

But that doesn't tell you anything. Being Metal-based doesn't mean they were designed nor benefit from UMA.

Especially since, again, Apple's own recommendation on big data (read: textures) is to copy it.

> Any back and forth between CPU and GPU will be faster with unified memory especially with a coherent on die cache.

Yes, but games & gfxbench don't do this which is what I keep trying to get across. There are workloads out there that will benefit from this, but the games & benchmarks that were run & being discussed aren't them. It's like claiming the sunspider results are from wifi 6 improvements. There are web experiences that will benefit from faster wifi, but sunspider ain't one of them.

Things like GPGPU compute can benefit tremendously here, for example.

> also PBO have been around for quiet a while in OpenGL:

PBO's reduce the number of copies from 2 to 1 in some cases, not from 1 to 0. You still copy from the PBO to your texture target, but it can potentially avoid a CPU to CPU copy first. When you call glTexImage2D it doesn't necessarily do the transfer right then, it instead may copy to a different CPU buffer to later be copied to the GPU.

> "Comparing the two GPU architectures, TBDR has the following advantages:

> It drastically saves on memory bandwidth because of the unified memory architecture. Blending happens in-register facilitated by tile processing. Color, depth and stencil buffers don’t need to be re-fetched."

> https://metalkit.org/2020/07/03/wwdc20-whats-new-in-metal.ht...

Uh, that blogger seems rather confused. TBDR has nothing to do with UMA, nor is Nvidia or AMD immediate mode anymore.

Heck, Mali was doing TBDR long before it was ever used on a UMA SoC.

SigmundA · on May 18, 2021

First API's don't support it, can't pin memory (which is what a PBO does). Then oh well they are not taking advantage of it. Move the goal post much?

TBDR came to prominence in UMA mobile architectures, it's a big part of what allows it to perform so well with limited memory bandwidth. The M1 is just an evolution of Apples mobile designs and PowerVR before that.

Mali GPU's are UMA and alway have been AFAIK

https://community.arm.com/developer/tools-software/graphics/...

kllrnohj · on May 18, 2021

> First API's don't support it, can't pin memory (which is what a PBO does). Then oh well they are not taking advantage of it. Move the goal post much?

No, they don't, so no, I didn't move the goal posts at all. PBOs are a transfer object. You cannot sample from them on the GPU. The only thing you can do with PBOs is copy them to something you can use on the GPU.

As such, PBOs do not let you take advantage of UMA. In fact, their primary benefit is for non-UMA in the first place. UMA systems have no issues blocking glTexImage2D until the copy to GPU memory is done, but non-UMA ones do. And non-UMA ones are what gave us PBOs.

> TBDR came to prominence in UMA mobile architectures, it's a big part of what allows it to perform so well with limited memory bandwidth.

Support that with a theory or evidence of literally any kind. There's nothing at all in TBDR's sequence of events that has any apparent benefit from UMA.

Here: https://developer.arm.com/solutions/graphics-and-gaming/deve...

Look at that the sequence of steps. ARM doesn't even bother including a CPU in there, so which step would UMA be helping with?

What UMA can do here is improve the power efficiency by reducing the cost of sending the command buffers to the GPU, but that's not going to get you a performance improvement as those command buffers are not very big. If sending data from the CPU to GPU was such a severe bottleneck then you'd see the impact of things like reducing the PCI-E bandwidth on discreet GPUs, but you don't.

astrange · on May 17, 2021

The modern approach to textures is to precompile them, so you can hand the data straight over. It's not as common to have to convert a linear to swizzled texture, though it can happen.

Also, the Apple advice for OpenGL textures was always focused on avoiding unnecessary copies. (for instance, there's another one that could happen CPU side if your data wasn't aligned enough to get DMA'd)

One reason M1 textures use less memory is the prior systems had AMD/Intel graphic switching and so you needed to keep another copy of everything in case you switched GPUs.

bombcar · on May 17, 2021

As SigmundA points out a huge advantage Apple has is control of the APIs (Metal, etc) and the ability to structure them years ago so that the API can simply skip entire things (even when ordered to do them) as it's known it's not needed. An analogy would be a copy-on-write filesystem (or RAM!) that doesn't actually do a copy when asked to, it returns immediately with a pointer, and only copies if asked to write to it.