>> all sorts of edge to giant GPU deployment. Soon the GPU and its associated me...

jsheard · on Feb 22, 2024

Any seperation of a GPU from its VRAM is going to come at the expense of (a lot of) bandwidth. VRAM is only as fast as it is because the memory chips are as close as possible to the GPU, either on seperate packages immediately next to the GPU package or integrated onto the same package as the GPU itself in the fanciest stuff.

If you don't care about bandwidth you can already have a GPU access terabytes of memory across the PCIe bus, but it's too slow to be useful for basically anything. Best case you're getting 64GB/sec over PCIe 5.0 x16, when VRAM is reaching 3.3TB/sec on the highest end hardware and even mid-range consumer cards are doing >500GB/sec.

Things are headed the other way if anything, Apple and Intel are integrating RAM onto the CPU package for better performance than is possible with socketed RAM.

sandworm101 · on Feb 22, 2024

That depends on whether performance or capacity is the goal. Smaller amounts of ram closer to the processing unit makes for faster computation, but AI also presents a capacity issue. If the workload needs the space, having a boatload of less-fast ram is still preferable to offloading data to something more stable like flash. That is where bulk memory modules connected though slots may one day appear on GPUs.

duffyjp · on Feb 22, 2024

I'm having flashbacks to owning a Matrox Millenium as a kid. I never did get that 4MB vram upgrade.

https://www.512bit.net/matrox/matrox_millenium.html

mysterydip · on Feb 22, 2024

Is there a way to partition the data so that a given GPU had access to all the data it needs but the job itself was parallelized over multiple GPUs?

Thinking on the classic neural network for example, each column of nodes would only need to talk to the next column. You could group several columns per GPU and then each would process its own set of nodes. While an individual job would be slower, you could run multiple tasks in parallel, processing new inputs after each set of nodes is finished.

zettabomb · on Feb 22, 2024

Of course, this is common with LLMs which are too large to fit in any single GPU. I believe Deepspeed implements what you're referring to.

weebull · on Feb 23, 2024

No it won't. GPUs are good at ml partly because of the huge memory bandwidth. 1000s of bits wide. You won't find connectors that have that many terminals and maintain signal quality. Even putting a second bank soldered on the same signals can be enough to mess things up.

zettabomb · on Feb 22, 2024

I doubt it. The latest GPUs utilize HBM which is necessarily part of the same package as the main die. If you had a RAM slot for a GPU you might as well just go out to system RAM, way too much latency to be useful.

AnthonyMouse · on Feb 22, 2024

It isn't the latency which is the problem, it's the bandwidth. A memory socket with that much bandwidth would need a lot of pins. In principle you could just have more memory slots where each slot has its own channel. 16 channels of DDR5-8000 would have more bandwidth than the RTX 4090. But an ordinary desktop board with 16 memory channels is probably not happening. You could plausibly see that on servers however.

What's more likely is hybrid systems. Your basic desktop CPU gets e.g. 8GB of HBM, but then also has 16GB of DRAM in slots. Another CPU/APU model that fits into the same socket has 32GB of HBM (and so costs more), which you could then combine with 128GB of DRAM. Or none, by leaving the slots empty, if you want entirely HBM. A server or HEDT CPU might have 256GB of HBM and support 4TB of DRAM.

brookst · on Feb 22, 2024

Agree, this is likely future. It’s really just an extension of The existing tiered CPU cache model

ltbarcly3 · on Feb 22, 2024

I don’t think you really understand the current trends in computer architecture. Even cpus are being moved to have on package ram for higher bandwidth. Everything is the opposite of what you said.

sandworm101 · on Feb 22, 2024

Higher bandwidth but lower capacity. The real trend is different physical architectures for different compute loads. There is a place in AI for bulk albeit slower memory such as extremely large date sets that want to run internally on a discreet card without involving pci lanes.

ltbarcly3 · on Feb 24, 2024

This is also not true. You can transfer from main memory to cards plenty fast enough that it is not a bottleneck. Consumer GPU's don't even use pcie5 yet, which doubles the bandwidth of 4. Professional datacenter cards don't use pcie AT ALL, but they do put a huge amount of RAM on the package with the GPUs.