It'd be really nice to have good memory bandwidth usage metrics collected from a wide range of devices while doing inference.
For example, how close does it get to the peak, and what's the median bandwidth during inference? And is that bandwidth, rather than some other clever optimization elsewhere, actually providing the Mac's performance?
Personally, I don't develop HPC stuff on a laptop - I am much more interested in what a modern PC with Intel or AMD and nvidia can do, when maxxed out. But it's certainly interesting to see that some of Apple's arch decisions have worked out well for local LLMs.
For example, how close does it get to the peak, and what's the median bandwidth during inference? And is that bandwidth, rather than some other clever optimization elsewhere, actually providing the Mac's performance?
Personally, I don't develop HPC stuff on a laptop - I am much more interested in what a modern PC with Intel or AMD and nvidia can do, when maxxed out. But it's certainly interesting to see that some of Apple's arch decisions have worked out well for local LLMs.