Hacker News new | past | comments | ask | show | jobs | submit login

I always miss multithreaded benchmarks when using SSE/AVX instructions. AFAIK AVX processing units are oversubscribed, there are less of them than CPU cores.

I can imagine that running AVX is_sorted (or any other AVX procedure) in multiple threads would be actually slower than running non-vectorized procedure.

Of course, that's my purely anecdotal opinion.




> AFAIK AVX processing units are oversubscribed, there are less of them than CPU cores.

Typically I see about 2 SIMD instruction throughput per cycle per core on Intel CPUs. SIMD execution units are not shared between cores in any way.

Clock throttling might happen, but SIMD is usually still a pretty huge net win.


Here is an experience report on how AVX-512 instructions impact CPU performance https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...


Note that Skylake-SP and Xeon-W/i7/i9s behave very differently in this regard. On Skylake-SP (eg Xeon Silvers like they're using) it's over 50% clockrate reduction when AVX-512 is in the pipe, on Xeon-W and the HEDT chips it's more like 10-20%.

https://twitter.com/InstLatX64/status/934093081514831872


On HEDT (and mainstream desktop) you can actually adjust AVX offset manually. With 0 offset and 5GHz clock, you can consume 500W (in Prime95 AVX) :D


I think AVX units on Intel cores have always been separate (i.e. not shared).

AMD Bulldozer processors have a shared floating point unit and some early pipeline stages like the instruction decoder per pair of cores. AMD Zen processors have since reverted to a more conventional design.


It's not that the execution units are shared, but that the frequency is throttled when AVX2, AVX512 instructions are encountered. In general AVX512 is not yet worthwhile when overall system throughput is at stake, and you don't have very vector heavy workloads. AVX2 is worthwhile most of the time. As of Haswell one lane of the AVX2 units was powered down when not in use and instructions execute more slowly when first encountering them. It executes them basically by stitching together two SEE operations. But this doesn't necessarily make it slower than SSE, just that the performance benefits might not materialize if there isn't enough AVX2 code being executed. I don't know if Skylake works that way as well.


That doesnt make sense to me. On haswell/skylake ports 0 and 1 do most of the vector lifting. I don't think cores share any of the vector hardware.

Or are you trying to make some claim about hyperthreading?


I simply may be wrong :) There is other AVX related thing [0] - CPU is underclocked in Turbo when AVX is used.

[0]: https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd...


I'm pretty sure each core has its own AVX units for the version they support.

AFAIK the only difference between AVX performance (ignoring clock speed) is gold/platinum(5000+) Xeons have 2x512 FMA ports available but everything else only supports FMA on port 1/2 and not 5. Stabbing a bit in the dark here, it's been a bit since I was looking at this stuff.


That’s only correct for avx3 instructions -1 I.e. 512 bit wide vector unit.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: