This is a nice benchmark, thanks for doing it, but the point I was making is som...

This is a nice benchmark, thanks for doing it, but the point I was making is somewhat orthogonal to this.

Of course there's a performance advantage from using 512 bit registers for a memcpy - but a memcpy is rarely a major performance bottleneck by itself and is usually surrounded by other code. Unless that code is also AVX-512, you've just made it slower by optimizing the memcpy. My point was that a compiler can't usually decide whether it's worth making the optimization in light of the broader context.

The other point was whether using AVX-512 while sticking to xmm registers is faster than just using xmm SSE/AVX code. I don't have an AVX-512 capable machine at the moment, perhaps you'd like to check if your 128 bit version is any faster than just doing "gcc -march=skylake -O2 -mprefer-vector-width=128" (thereby retaining the microarchitecture optimizations, but sticking to AVX2)?