Edit, comparison:
$ perf record target/release/gemm-benchmark -d 1024 Threads: 1 Iterations per thread: 1000 Matrix shape: 1024 x 1024 GFLOPS/s: 96.36 $ perf report --stdio -q | head -n3 97.18% gemm-benchmark gemm-benchmark [.] mkl_blas_def_sgemm_kernel_0_zen 1.94% gemm-benchmark gemm-benchmark [.] mkl_blas_def_sgemm_scopy_down16_bdz 0.78% gemm-benchmark gemm-benchmark [.] mkl_blas_def_sgemm_scopy_right4_bdz
$ perf record target/release/gemm-benchmark -d 1024 Threads: 1 Iterations per thread: 1000 Matrix shape: 1024 x 1024 GFLOPS/s: 129.12 $ perf report --stdio -q | head -n3 97.02% gemm-benchmark libmkl_avx2.so.1 [.] mkl_blas_avx2_sgemm_kernel_0 1.77% gemm-benchmark libmkl_avx2.so.1 [.] mkl_blas_avx2_sgemm_scopy_down24_ea 1.02% gemm-benchmark libmkl_avx2.so.1 [.] mkl_blas_avx2_sgemm_scopy_right4_ea
Edit, comparison:
After disabling Intel CPU detection: Benchmarked using https://github.com/danieldk/gemm-benchmark and oneMKL 2021.3.0.