CPU, yes, but more importantly memory bandwidth. An RTX 3090 (as one example) ha...

CPU, yes, but more importantly memory bandwidth.

An RTX 3090 (as one example) has nearly 1TB/s of memory bandwidth. You'd need at least 12 channels of the fastest proof-of-concept DDR5 on the planet to equal that.

If you have a discrete GPU, use an implementation that utilizes it because it's a completely different story.

Apple Silicon boasts impressive numbers on LLM inference because it has a unified CPU-GPU high-bandwidth (400GB/s IIRC) memory architecture.