What's frustrating is there's no real reason why regular DDR5 can't reach 1TB/se...

sliken · 2025-02-26T05:58:59 1740549539

Right, like the mac studio ultra.

Or a dual socket AMD Turin.

Or a grace+hopper (assuming you offload to the hopper).

Latency is dictated by the laws of physics, more bandwidth is easy, but not cheap.

ein0p · 2025-02-26T08:43:44 1740559424

It can be relatively cheap too under the constraints imposed by typical AI workloads, at least when it comes to getting to a 1TB/s or so. All you need is high-spec DDR5 and _a ton_ of memory channels in your SOC. During transformer inference you will easily be able to use those parallel, multichannel reads. I get why you'd need HBM and several TB/s of memory bandwidth for extremely memory intensive training workloads. But for inference 1TB/s gives you a lot to work with (especially if your model is a MoE), and it doesn't have to be ultra expensive.