GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity

jauntywundrkind · 2024-07-02T15:17:24 1719933444

It's CXL not PCIe. The latency is much more like NUMA hop or so with CXL, which makes this much more likely to be useful than trying to use host memory over PCIe.

CXL 3.1 was the first spec where they added any way to have a host CPU also be able to share memory (host to host), itself be part of RDMA. It seems like it's not exactly going to look like any other CXL memory device, so it'll take some effort to make other hosts or even the local host be able to take advantage of CXL. https://www.servethehome.com/cxl-3-1-specification-aims-for-...

RecycledEle · 2024-07-02T17:24:52 1719941092

Good job decreasing latency.

Now work on the bandwidth.

A single HBM3 module has the bandwidth of half-a-dozen data center grade PCIe 5.0 x16 NVME drives.

A single DDR5 DIMM has the bandwidth of a pair of PCIe 5.0 x4 NVME drives.

westurner · 2024-07-03T16:42:51 1720024971

"New RISC-V microprocessor can run CPU, GPU, and NPU workloads simultaneously" https://news.ycombinator.com/item?id=39938538

karmakaze · 2024-07-02T20:12:24 1719951144

Perhaps this would be a good application for 3D XPoint memory that was seemingly discontinued due to lack of a compelling use case.

jauntywundrkind · 2024-07-02T22:21:56 1719958916

Optane definitely had many great uses. It had stunningly good iops with very low latency, it had fantastic endurance, and no write amplification concerns. Optane was excellent for datases, just pricey! Far more pricey than Intel had promised initially, which was a disappointment, but still somewhat in league with enterprise SSDs of the time.

If you really wanted very low latency you needed Optane DIMMs. And that was problematic because typically you wanted motherboards loaded with ram. And it made it complex to figure out how to use those DIMMs, those parts of memory that would be slower but persistent. Using the DIMMs was hard.

But Optane existed as a damned fine NVMe product too! Latency wasn't as good because it was PCIe, was the main downside. CXL could remove this penalty, make it look more like ram that is a NUMA hop away, potentially, which would be grand. This ain't really required to use Optane well, one can still get epic iops at incredibly consistently low latency & proposer, but if you do have a latency sensitive demand it certainly can help!

Poor Optane. I have a hard time understanding how something of such excellent value floundered so. In truth there's not that many people who need many drive-writes-per-day but even if you didn't, the promise was this drive should last you a very very long time because it had such endurance. That long term sustainability seemed like an incredible value we simply failed to recognize & tap.

imtringued · 2024-07-07T11:26:32 1720351592

>And it made it complex to figure out how to use those DIMMs, those parts of memory that would be slower but persistent. Using the DIMMs was hard.

CXL changes the game due to its cache coherency protocol. You don't have to care, precisely because the system transparently deals with this directly in the hardware. It is just one giant address space. You don't need slow OS level page faults or to update the page table every time something is loaded or unloaded from memory.

The biggest problem with persistent memory is building an application with transactional semantics. All the hardware and software transactional memory is built around concurrency and not persistence. When you think about it, that is kind of backwards. Persistent memory has very loose performance requirements since I/O is assumed to be slow. Meanwhile parallelism and concurrency are about increasing performance and therefore it defeats the point if it ends up slower than without.

icelancer · 2024-07-07T08:56:00 1720342560

Every time I go through old NVME/m2 drives in the bins in our IT cage, I sigh and think what could have been.

petra · 2024-07-07T09:48:18 1720345698

Optane was co-owned by a flash memory manufacturer. Maybe an SSD with very long longevity didnt fit their plans.

Its may be enough that just another company would have managed to create/license something like Optane and both companies would have stuck with a large over capacity for a long time.

pbalcer · 2024-07-07T08:34:52 1720341292

I'll just leave this here: https://computeexpresslink.org/wp-content/uploads/2023/12/CX...

Combined with the fact that Intel created both CXL and Optane, it stands to reason that the plan was to combine them eventually. Unfortunately, that was never came to pass :(

wmf · 2024-07-02T23:07:15 1719961635

It was discontinued because it was too expensive to be viable.

karmakaze · 2024-07-03T14:21:23 1720016483

Usually cost comes down with volume so that is also tied to lack of uses. If a significant use case was known it could have been scaled to offset investments. Things were already developed for SSD and memory access tiers with no substantial demand/application for something in-between.

p1esk · 2024-07-02T14:46:56 1719931616

Using CPU memory to extend GPU memory seems like a more straightforward approach. Does this method provide any benefits over it?

Zandikar · 2024-07-02T15:37:24 1719934644

Depends on the PCIe/DMA topology of the system, but in short, in an ideal system you can avoid the bottleneck of the CPU interconnect (eg, AMD's Infinity Fabric) and reduce overall CPU load by (un)loading data directly from your NVMe storage to your PCIe accelerator [0]. You can also combine this with RDMA/RoCE (provided everything in the chain supports it) to make a clustered network with NVMeoF to serve data from a high speed NVMe flash array(s) to clusters of GPU's; even potentially using this to reduce cost/space/power by reducing the nead for high cost/high power CPU's. Prior to CXL's proliferation (which realistically we haven't achieved yet), this is mostly limited to bespoke HPC systems; most consumer systems lack the PCIe lanes/topology to really make use of this in a practical way.

On the consumer side, you're right, using the System ram is probably a better approach as most consumer motherboards would have the NVMe storage routing up to the CPU Interconnect then back "Down" to the GPU (or worse through the "southbridge" chipset(s) like on X570) so you take that hit anyway.

However if you have a PCIe switch on board that allows data to flow direct from storage to GPU without a round trip across the CPU, then NVMe/CXL/SCM modules would theoretically be better than system RAM. Depends on the switch, retimers, muxing, topology etc.

Regardless of what you're using for direct storage and how ideal your topology is, the MTps and GBps over PCIe is significantly slower than onboard VRAM (be it GDDR or especially HBM) and bandwidth limited to boot. Doesn't mean it's useless by any means, but important to point out that this doesn't turn a 20GB VRAM card into a 2.02TB VRAM card just because you DirectStorage'd a 2TB Drive to it, no matter how ideal the setup is. However, as PCIe increases in bandwidth and Storage-Class-Memory type devices and just storage tech in general continues to improve, it's rapidly becoming more viable. On PCIe gen 3, you're probably shooting yourself in the foot. on PCIe Gen 6, you can realistically see a very real benefit. But again, there's a lot of "depends" here and for now, you're probably better off buying a bigger or multiple GPUs if you're not on the cutting edge with the corporate credit line.

0: https://developer.nvidia.com/blog/gpudirect-storage/

byteknight · 2024-07-02T15:14:21 1719933261

I wonder if fighting with the CPU for allocation would be a bottleneck. Seems to me the only way to dedicate full bandwidth would be to have a separate PCI-e (as theyve done)