You speak as though all low-level optimization is alignment, unrolling, pairing,...

You speak as though all low-level optimization is alignment, unrolling, pairing, etc.. These are basically micro optimizations that yield negligible gains unless applied over a large code base. That Pluto output is just a wall of code because it has been unrolled several times, it's not any sort of impressive optimization achievement.

A human would probably convert this to fixed point, convert the entire inner-most loop into a couple of address operations and a single fused multiply add, process the array 64 bytes per iteration without unrolling anything. At that point it's probably 10-20x more efficient than that slab of polyhedral bullshit and finally he'd come back to carefully pad here and there to avoid misalignment penalties.

As for optimizing for 4 cores - you take your shiny hand-polished assembly routine and spin it up on 4 threads, most likely in high level code since you're talking to the OS to get threads, synchronize them, etc. It's not wise to chase parallelism at a low level because that goes counter to minimizing the overhead costs in setting it up.

> But you also are often starting with the output of a good compiler.

No, nobody does this. I mean maybe if you're just learning assembly. Starting with the compiler-generated garbage does not help you other than maybe by giving you a benchmark to beat.