Any seperation of a GPU from its VRAM is going to come at the expense of (a lot ...

sandworm101 · on Feb 22, 2024

That depends on whether performance or capacity is the goal. Smaller amounts of ram closer to the processing unit makes for faster computation, but AI also presents a capacity issue. If the workload needs the space, having a boatload of less-fast ram is still preferable to offloading data to something more stable like flash. That is where bulk memory modules connected though slots may one day appear on GPUs.

duffyjp · on Feb 22, 2024

I'm having flashbacks to owning a Matrox Millenium as a kid. I never did get that 4MB vram upgrade.

https://www.512bit.net/matrox/matrox_millenium.html

mysterydip · on Feb 22, 2024

Is there a way to partition the data so that a given GPU had access to all the data it needs but the job itself was parallelized over multiple GPUs?

Thinking on the classic neural network for example, each column of nodes would only need to talk to the next column. You could group several columns per GPU and then each would process its own set of nodes. While an individual job would be slower, you could run multiple tasks in parallel, processing new inputs after each set of nodes is finished.

zettabomb · on Feb 22, 2024

Of course, this is common with LLMs which are too large to fit in any single GPU. I believe Deepspeed implements what you're referring to.