I always miss multithreaded benchmarks when using SSE/AVX instructions. AFAIK AVX processing units are oversubscribed, there are less of them than CPU cores.
I can imagine that running AVX is_sorted (or any other AVX procedure) in multiple threads would be actually slower than running non-vectorized procedure.
Note that Skylake-SP and Xeon-W/i7/i9s behave very differently in this regard. On Skylake-SP (eg Xeon Silvers like they're using) it's over 50% clockrate reduction when AVX-512 is in the pipe, on Xeon-W and the HEDT chips it's more like 10-20%.
I think AVX units on Intel cores have always been separate (i.e. not shared).
AMD Bulldozer processors have a shared floating point unit and some early pipeline stages like the instruction decoder per pair of cores. AMD Zen processors have since reverted to a more conventional design.
It's not that the execution units are shared, but that the frequency is throttled when AVX2, AVX512 instructions are encountered. In general AVX512 is not yet worthwhile when overall system throughput is at stake, and you don't have very vector heavy workloads. AVX2 is worthwhile most of the time. As of Haswell one lane of the AVX2 units was powered down when not in use and instructions execute more slowly when first encountering them. It executes them basically by stitching together two SEE operations. But this doesn't necessarily make it slower than SSE, just that the performance benefits might not materialize if there isn't enough AVX2 code being executed. I don't know if Skylake works that way as well.
I'm pretty sure each core has its own AVX units for the version they support.
AFAIK the only difference between AVX performance (ignoring clock speed) is gold/platinum(5000+) Xeons have 2x512 FMA ports available but everything else only supports FMA on port 1/2 and not 5. Stabbing a bit in the dark here, it's been a bit since I was looking at this stuff.
I can imagine that running AVX is_sorted (or any other AVX procedure) in multiple threads would be actually slower than running non-vectorized procedure.
Of course, that's my purely anecdotal opinion.