As someone who has the old dead tree version of Intel’s x86 and 64 architecture instruction set reference (the fat blue books), and in general as someone who carefully reads the data sheets and documentation and looks for guidance from the engineers and staff who wrote the said data sheets, I always have reservations when I hear that “intuitively you would expect X but Y happens.” There’s nothing intuitive about any of this except, maybe, a reasonable understanding of the semi-conductive nature of the silicon and the various dopants in the process. Unless you’ve seen the die schematic, the traces, and you know the paths, there is little to no reason to have any sort of expectations that Thing A is faster than Thing B unless the engineering staff and data sheets explicitly tell you.
There are exceptions, but just my 2c. Especially with ARM.
"Intuitively" here should be taken to mean approximately the same as "naively" – as in, the intuition that most of us learn at first that CPUs work ("as if") by executing one instruction at a time, strictly mechanistically, exactly corresponding to the assembly code. The way a toy architecture on a first-year intro to microprocessors course – or indeed a 6502 or 8086 or 68000 – would do it. Which is to say, no pipelining, no superscalar, no prefetching, no out-of-order execution, no branch prediction, no speculative execution, and so on.
Respectfully, I disagree. CPU architecture optimization is in continuous dance with compiler optimization where the former tries to adapt to the patterns most commonly produced by the latter, and the latter tries to adjust its optimizations according to what performs the faster within the former.
Therefore, it is not unreasonable to make assumptions based on the premise of "does this code look like something that could be reasonably produced by GCC/LLVM?".
It is true that as cores get simpler and cheaper, they get more edge cases - something really big like Firestorm (A14/M1) can afford to have very consistent and tight latencies for all of its SIMD instructions regardless of the element/lane size and even hide complex dependencies or alignment artifacts wherever possible. But compare that with simpler and cheaper Neoverse N1, and it's a different story entirely, where trivial algorithm changes lead to significant slowdown - ADDV Vn.16B is way slower than Vn.4H, so you have to work around it. This is only exacerbated if you look at much smaller cores.
LLVM and GCC deal with this by being able to use relatively precise knowledge of CPU's (-mtune) fetch, reorder, load and store queue/buffer depths, as well as latencies and dependency penalty cost of opcodes of the ISA it implements, and other details like loop alignment requirements, branch predictor limitations.
Generally, it's difficult to do better in straight-line code with local data than such compilers assuming that whatever you are doing doesn't make concessions that a compiler is not allowed to make.
Nonetheless, the mindset for writing a performant algorithm implementation is going to be the same as long as you are doing so for the same class of CPU cores - loop unrolling, using cmovs, and scheduling operations in advance, or ensuring that should spills happen, the load and store operations have matching arguments - all of that will be profitable on AMD's Zen 4, Intel's Golden Cove, Apple's Firestorm or ARM's Neoverse V3.
There are exceptions, but just my 2c. Especially with ARM.