It's not 2 FMAs, it's AVX-512 (and going with 32-bit words) ⇒ 2\*512/32 = 32 FMA...

menaerus · 2025-03-01T12:35:43 1740832543

1x 512-bit FMA or 2x 256-bit FMAs or 4x 128-bit FMAs is irrelevant here - it's still a single physical unit in a CPU that consumes 512 bits of data bandwidth. The question is why the CPU budget allows for 2x 512-bit or 4x 256-bit while H100, for example, has 14592 FP32 CUDA cores - in AVX terminology that would translate, if I am not mistaken, to 7926x 512-bit or 14592x 256-bit FMAs per clock cycle. Even considering the obvious differences between GPUs and CPUs, this is still a large difference. Since GPU cores operate at much lower frequencies than CPU cores, it is what it made me believe where the biggest difference comes from.

eqvinox · 2025-03-01T12:43:53 1740833033

AIUI an FP32 core is only 32 bits wide, but this is outside my area of expertise really. Also note that CPUs also have additional ALUs that can't do FMAs, FMA is just the most capable one.

You're also repeating 2×512 / 4×256 — that's per core, you need to multiply by CPU core count.

[also, note e.g. an 8-core CPU is much cheaper than a H100 card ;) — if anything you'd be comparing the highest end server CPUs here. An 192-core Zen5c is 8.2~10.5k€ open retail, an H100 is 32~35k€…]

[reading through some random docs, a CPU core seems vaguely comparable to a SM; a SM might have 128 or 64 lanes (=FP32 cores) while a CPU only has 16 with AVX-512, but there is indeed also a notable clock difference and far more flexibility otherwise in the CPU core (which consumes silicon area)]

TinkersW · 2025-03-02T00:27:29 1740875249

Nvidia calls them cores to deliberately confuse people, and make it appear vastly more powerful than it really is. What they are in reality is SIMD lanes.

So the H100(which costs vastly more than a Zen5..), has 14592 32 bit SIMD lanes, not cores.

A Zen 5 has 16x4(64) 32 bit SIMD lanes per core, so scale that by core count to get your answer. A higher end desktop Zen5 will have 16 cores, so 64x16 = 1024. The Zen5 also clocks much higher than the GPU, so you can also scale it up by perhaps 1.5-2x

While this is obviously less than the H100, the Zen5 chip costs $550 and the H100 cost $40k.

There is more to it than this, GPUs also have transcendental functions, texture sampling, and 16 bit ops(which are lacking in CPUs). While CPUs are much more flexible, and have powerful byte & integer manipulation instructions, along with full speed 64 bit integer/double support.

menaerus · 2025-03-02T09:36:40 1740908200

Thanks for the clarification on the NVidia, I didn't know that. What I also found is that NVidia groups 32 SIMD lanes into what they call a warp. Then 4 warps are grouped into what they're calling a streaming multiprocessor (SM). And lastly H100 has 114 SMs so 432114=14592 checks out.

> Zen 5 has 16x4(64) 32 bit SIMD lanes per core

Right, 2x FMA and 2x FADD so the highest-end Zen 5 die with 192 cores would total to 12288 32-bit SIMD lanes or half of that if we are considering only FMA ops. This is then indeed much closer to 14592 32-bit SIMD lanes of H100.

dzaima · 2025-03-02T02:50:53 1740883853

There are x86 extensions for fp16/bf16 ops. e.g. both Zen 4 and Zen 5 support AVX512_BF16, which has vdpbf16ps, i.e. dot product of pairs of bf16 elements from two args; that is, takes a total of 64 bf16 elts and outputs 16 fp32 elts. Zen 5 can run two such instrs per cycle.

adrian_b · 2025-03-01T14:34:20 1740839660

Like another poster already said, the power budget of a consumer CPU, like 9950X, executing programs at a double clock frequency in comparison with a GPU, allows for 16 cores x 2 execution units x 16 = 512 FP32 FMA per clock cycle, which provides the same throughput like an 1024 FP32 FMA per clock cycle iGPU from the best laptop CPUs, while consuming 3 times less power than a datacenter GPU, so the power budget and performance is like for a datacenter GPU with 3072 FP32 FMA per clock cycle.

However, because of its high clock frequency a consumer CPU has high performance per dollar, but low performance per watt.

Server CPUs with many cores have much better energy efficiency, e.g. around 3 times higher than a desktop CPU and the same with the most efficient laptop CPUs. For many generations of NVIDIA GPUs and Intel Xeon CPUs, until about 5-6 years ago, the ratio between their floating-point FMA throughput per watt has been of only 3.

This factor of 3 is mainly due to the overhead of various tricks used by CPUs to extract instruction-level parallelism from programs that do not use enough concurrent threads or array operations, e.g. superscalar out-of-order execution, register renaming, etc.

In recent years, starting with NVIDIA Volta, followed later by AMD and Intel GPUs, the GPUs have made a jump in performance that has increased the gap between their throughput and that of CPUs, by supplementing the vector instructions with matrix instructions, i.e. what NVIDIA calls tensor instructions.

However this current greater gap in performance between CPUs and GPUs could easily be removed and the performance per watt ratio could be brought back to a factor unlikely to be greater than 3, by adding matrix instructions to the CPUs.

Intel has introduced the AMX instruction set, besides AVX, but for now it is supported only in expensive server CPUs and Intel has defined only instructions for low-precision operations used for AI/ML. If AMX were extended with FP32 and FP64 operations, then the performance would be much more competitive with GPUs.

ARM is more advanced in this direction, with SME (Scalable Matrix Extension) defined besides SVE (Scalable Vector Extension). SME is already available in recent Apple CPUs and it is expected to be also available in the new Arm cores that will be announced in a few months for now, which should become available in the smartphones of 2026, and presumably also in future Arm-based CPUs for servers and laptops.

The current Apple CPUs do not have strong SME accelerators, because they also have an iGPU that can perform the operations whose latency is less important.

On the other hand, an Arm-based server CPU could have a much bigger SME accelerator, providing a performance much closer to a GPU.

menaerus · 2025-03-01T16:18:25 1740845905

I appreciate the response with a lot of interesting details, however, I don't believe it answers the question I had? My doubt was why is it so that the CPU design suffers from clock frequency issues in AVX-512 workloads whereas GPUs which have much more compute power do not.

I assumed that it was due to the fact that GPUs run at much lower clock frequencies and therefore available power budget but as I also discussed with another commenter above this was probably a premature conclusion since we don't have enough evidence showing that GPUs indeed do not suffer from same type of issues. They likely do but nobody measured it yet?

adrian_b · 2025-03-01T18:25:35 1740853535

The low clock frequency when executing AVX-512 workloads is a frequency where the CPU operates efficiently, with a low energy consumption per operation executed.

For such a workload that executes a very large number of operations per second, the CPU cannot afford to operate inefficiently because it will overheat.

When a CPU core has many execution units that are idle, so they do not consume power, like when executing only scalar operations or only operations with narrow 128-bit vectors, it can afford to raise the clock frequency e.g. by 50%, even if that would increase the energy consumption per operation e.g. 3 times. By executing 4 times or 8 times less operations per clock cycle, even if the energy consumption is 3 times higher the total power consumption is smaller and the CPU does not overheat and the desktop owner does not care that the completion of the same workload requires much more energy, because it is likely that the owner cares more about the time to completion.

The clock frequency of a GPU also varies continuously depending on the workload, in order to maintain the power consumption within the limits. However a GPU is not designed to be able to increase the clock frequency as much as a CPU. The fastest GPUs have clock frequencies under 3 GHz, while the fastest CPUs exceed 6 GHz.

The reason is that normally one never launches a GPU program that would use only a small fraction of the resources of a GPU allowing a higher clock frequency, so it makes no sense to design a GPU for this use case.

Designing a chip for a higher clock frequency greatly increases the size of the chip, as shown by the comparison between a normal Zen core designed for 5.7 GHz and a Zen compact core, designed e.g. for 3.3 GHz, a frequency not much higher than that of a GPU.

On Zen compact cores and on normal Zen cores configured for server CPUs with a large number of cores, e.g. 128 cores (with a total of 4096 FP32 ALUs, like a low-to-mid-range desktop GPU, or like a top desktop GPU of 5 years ago; a Zen compact server CPU can have 6144 FP32 ALUs, more than a RTX 4070), the clock frequency variation range is small, very similar to the clock variation range of a GPU.

In conclusion, it is not the desktop/laptop CPUs which drop their clock frequency, but it is the GPUs which never raise their clock frequency much, the same as the server CPUs, because neither GPUs nor server CPUs are normally running programs that keep most of their execution units idle, to allow higher clock frequencies without overheating.