I always miss multithreaded benchmarks when using SSE/AVX instructions. AFAIK AV...

vardump · on April 15, 2018

> AFAIK AVX processing units are oversubscribed, there are less of them than CPU cores.

Typically I see about 2 SIMD instruction throughput per cycle per core on Intel CPUs. SIMD execution units are not shared between cores in any way.

Clock throttling might happen, but SIMD is usually still a pretty huge net win.

nvartolomei · on April 15, 2018

Here is an experience report on how AVX-512 instructions impact CPU performance https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

paulmd · on April 15, 2018

Note that Skylake-SP and Xeon-W/i7/i9s behave very differently in this regard. On Skylake-SP (eg Xeon Silvers like they're using) it's over 50% clockrate reduction when AVX-512 is in the pipe, on Xeon-W and the HEDT chips it's more like 10-20%.

https://twitter.com/InstLatX64/status/934093081514831872

floatboth · on April 15, 2018

On HEDT (and mainstream desktop) you can actually adjust AVX offset manually. With 0 offset and 5GHz clock, you can consume 500W (in Prime95 AVX) :D

frankchn · on April 15, 2018

I think AVX units on Intel cores have always been separate (i.e. not shared).

AMD Bulldozer processors have a shared floating point unit and some early pipeline stages like the instruction decoder per pair of cores. AMD Zen processors have since reverted to a more conventional design.

slashdev · on April 15, 2018

It's not that the execution units are shared, but that the frequency is throttled when AVX2, AVX512 instructions are encountered. In general AVX512 is not yet worthwhile when overall system throughput is at stake, and you don't have very vector heavy workloads. AVX2 is worthwhile most of the time. As of Haswell one lane of the AVX2 units was powered down when not in use and instructions execute more slowly when first encountering them. It executes them basically by stitching together two SEE operations. But this doesn't necessarily make it slower than SSE, just that the performance benefits might not materialize if there isn't enough AVX2 code being executed. I don't know if Skylake works that way as well.

jnordwick · on April 15, 2018

That doesnt make sense to me. On haswell/skylake ports 0 and 1 do most of the vector lifting. I don't think cores share any of the vector hardware.

Or are you trying to make some claim about hyperthreading?

alfanick · on April 15, 2018

I simply may be wrong :) There is other AVX related thing [0] - CPU is underclocked in Turbo when AVX is used.

[0]: https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd...

nimos · on April 15, 2018

I'm pretty sure each core has its own AVX units for the version they support.

AFAIK the only difference between AVX performance (ignoring clock speed) is gold/platinum(5000+) Xeons have 2x512 FMA ports available but everything else only supports FMA on port 1/2 and not 5. Stabbing a bit in the dark here, it's been a bit since I was looking at this stuff.

DSingularity · on April 15, 2018

That’s only correct for avx3 instructions -1 I.e. 512 bit wide vector unit.