Being able to manipulate the program counter directly plays hell with a supersca...

Being able to manipulate the program counter directly plays hell with a superscalar and especially OoO processor where you want to be able to predict what the program counter does very accurately so the instruction fetch and decode can run far ahead of the execution.

There are four kinds of instructions that play hell with pipeline and OoO design:

- instructions that might cause traps, dependent on the values processed

- instructions that you don't know whether they will change the control flow

- instructions that you don't know where the control flow is going to go to

- instructions where you don't know how long they will take to execute

RISC-V, for example, bans the first category entirely other than load/store, and carefully separates the other three so any one instruction only had at most one of those problems.

ARM load multiple has all of those problems. At least you can examine the register mask at instruction decode time and know whether it will change the PC or not and tag the instruction in the pipeline as being a Jump or not. Imagine if there was a version that took the bitmap from a register instead of being hard-coded...

Load/store multiple don't increase performance much if at all on a CPU with an instruction cache and/or an instruction prefetch buffer. On an original 68000 or ARM without any cache, sure, a series of load or store instructions requires interleaving reading the opcodes with reading or writing the data, while load/store multiple eliminates the opcode reads. An instruction cache also eliminates them, leaving only the code size benefits. But load/store multiple is a perfect candidate for using a simple runtime function instead, at least if you have lightweight function call/return as RISC designs usually do.