Because it can't actually see that return instruction until way too late. Apple ...

Because it can't actually see that return instruction until way too late.

Apple M1 CPU here loads and decodes 8 instructions per cycle, and the instruction cache access, plus the decoding takes multiple cycles. I don't know exactly how many cycles (more cycles allow for bigger instruction caches, and Apple gave the M1 a massive 192KB instruction cache), but the absolute minimum is 3 cycles.

And until the instruction is decoded, it has no idea that there is even a return instruction there.

3 cycles at 8 instructions per cycle is a full 24 instructions, so the M1 CPU will be a full 24 instructions past the return before it even knows the return exists. The loop body in this article is only 7 instructions long, so overrunning by an extra 24 instructions on every iteration of the loop would be a massive performance hit.

What actually happens is that the branch predictor remembers the fact that there is a return at that address, and on the very next cycle after starting the icache fetch of the return instruction, it pops the return address of it's return address prediction stack and starts fetching code from there.

For short functions, the branch predictor might need to follow the call on one cycle, and the return on the next cycle. The call instruction hasn't even finished decoding yet, but the branch predictor has already executed both the call and the return.