The initial AVX-512 implementation brought a lot of issues with it. The biggest ...

The initial AVX-512 implementation brought a lot of issues with it. The biggest problem was that Intel used 512-bit ALUs from the beginning and I think it was just too much that time (initial 14nm node) - even AMD's Zen4 architecture, which came years after Skylake-X, uses 256-bit ALUs for most of the operations except complex shuffles, which use a dedicated 512-bit unit to make them competitive. And from my experience, AMD's Zen4 AVX-512 implementation is a very competitive one. I just wish it had faster gathers.

Our typical workload at Sneller uses most of the computational power of the machine: we typically execute heavy AVX-512 workloads on all available cores and we compare our processing performance at GB/s per core. This is generally why we needed a faster decompression, because before Iguana almost 50% of the computational power was spent in a zstd decompressor, which is scalar. The rest of the code is written in Go, but it's insignificant compared to how much time we spend executing AVX-512 now.

(I work for Sneller)