What's frustrating is there's no real reason why regular DDR5 can't reach 1TB/sec with a sufficient number of channels. The manufacturers are just holding it back to drip feed that memory bandwidth over several generations. Except for Apple, which lets you have 800GB/sec now, and will let you have 1TB/sec+ in M5 Ultra next year. It's $$$$, but still - the true alternatives are much less cost effective.
It can be relatively cheap too under the constraints imposed by typical AI workloads, at least when it comes to getting to a 1TB/s or so. All you need is high-spec DDR5 and _a ton_ of memory channels in your SOC. During transformer inference you will easily be able to use those parallel, multichannel reads. I get why you'd need HBM and several TB/s of memory bandwidth for extremely memory intensive training workloads. But for inference 1TB/s gives you a lot to work with (especially if your model is a MoE), and it doesn't have to be ultra expensive.