CPU numbers are off, as FMA is considered 2 instructions, and Zen5 can do 2 of t...

CPU numbers are off, as FMA is considered 2 instructions, and Zen5 can do 2 of them per cycle in addition to two adds, so it would be 6 instructions per cycle not 4(GPU numbers are always quoted this way, so it is only fair to do the same for the CPU).

Also the 9950x has 32 threads, but is hyperthreaded, so it only has 16 actual cores, so the correct scaling factor is 16 cores * 16 SIMD lanes. Anyway the final number is 8.678 32 bit float TFLOPS.

The RTX 4090 has 82.58 32 bit TFLOPS according to Nvidia, but it also costs far more than the 9950x($1,600 vs $650), so I find this comparison rather odd.

So it costs 2.46 as much and delivers 9.5x the perf.

If you normalize for cost the perf advantage is about 3.8x, which is roughly the same numbers Intel reported years ago when they debunked the whole GPU is 100x better nonsense.

Anyway, I really hate the Cuda terminology where they refer to SIMD lanes as "threads".

There are also alot of the things to consider, where either the CPU or GPU has an advantage such as..

GPU advantages:

Hardware sin/cos support(with Nivida at least)

abs/saturate are often just modifiers

scaling by small powers of 2 is often free

16bit floats are fully supported

CPU advantages:

doubles are full speed and you can interleave with floats if you just need for a few calculations

access to wide variety of integer sizes and bit manipulation functions, GPU has some of this but not nearly as broad

lower level programing model