IACA is awesome to understand the behaviour of your own hot loops. Sadly it only works on Intel. Luckily LLVM has recently merged llvm-mca [1] (machine-code analyser) which hopefully will in time bring all the features of IACA and more to other platforms as well.
I am working on making these tools easily accessible from Julia [2] and one of the things missing from llvm-mca is the ability to mark a region of assembly code to be analysed instead of analysing the entire provided assembly.
« According to Agner’s instruction_tables.pdf load instruction that I use has 2 cycles latency »
and indeed that's what Agner Fog's tables say (and they further say this figure is « the delay that the instruction generates in a dependency chain »).
But all other sources seem to agree that the L1 dcache latency on the main Intel processor family is 4 cycles (since Nehalem).
Does anyone know what the distinction is here? If I do a a simple pointer chase through L1 dcache I wouldn't make progress at 2 cycles per step, would I?
Yes Agners doc is wrong or perhaps misleading here. He mentions that for loads and stores he tested a loop of load and store to the same ___location (essentially essentially testing store forwarding latency) to get the latency of that pair and then "arbitrarily divided" the measured latency between the load and the store.
He did that because to measure the latency of a single instruction the "___domain" (eg GP register, SIMD register memory, etc)) of the input(s) as to be the same as the output.
Stores have memory output but register/address input so it isn't really possible to measure the latency of a store in isolation. That's why Agners paired up loads and stores to get a latency figure and did the arbitrary split.
All that out of the way, it is possible to measure the latency of a load, at least for the load-to-address path like you mention with a pointer chasing load. That latency is 4 cycles minimum on all modern Intel. Add extra 1 cycle each for complex addressing modes or vector loads.
So yeah, a load latency of 2 is nonsense for modern Intel. Even talking about the latency is also nonsense since all the loads are independent so it's the throughput, not the latency that matters.
IIRC some chips have had 2 cycle L1 accesses: some P4 designs and one of the semi-recent POWER chips.
He does say he's reporting the minimum latency; I was thinking that no case would be faster than reading from L1 cache, but it makes sense that store-forwarding can be faster.
Right, but store forwarding is definitely not 2 cycles. 4 and 5 cycles is common for store forwarding although other values are possible. So I think Agner measured 4 cycles and divided it as 2 for the load and 2 for the store. For pure loads the latency is easy to measure and often widely reported and it's 4 cycles minimum for Intel.
Out-of-order execution doesn't help if you saturate the functional units. Hyperthreading also doesn't help; in fact, it hurts even more, since now the other hyperthread(s) are competing for the contended resource.
Well hyperthreading can help in the sense that the execution "slots" may not be wasted if the other hyperthread can use the ports that would be otherwise unused if only a single thread were running. That mostly only works if the workloads of each thread have very different port usage.
Consider one workload which saturates on p1, doing integer multiplication, and another one which is load/store heavy. Those two loads run on sibling hyperthreads could perhaps approach the best possible speedup of 2x since they use different ports.
Keep in mind there are two different types of contention, that I guess you could call "mandatory" and "non-mandatory" (I just made those terms up, suggestions welcome).
Mandatory contention is the contention you get because certain operations _must_ go to a port or combination of ports. The examples in the article were of this nature: if you have a dense series of loads the load port is fully loaded so the throughput is limited by the available "port pressure" is 1.0. Similarly for the BSWAP example.
OoO can get around this type of contention only to the extent that the contention is transient and it can see "beyond" the contention within its scheduler and reorder buffer windows. For example, if you have 1000 independent multiplications (execute on only 1 port) followed by 1000 independent loads (execute on 2 ports), it is going to take roughly 1000 + 1000 / 2 = 1500 cycles to execute that, because you'll have contention in both phases and they are too long for OoO to do much[1].
On the other hand, if you have 10 multiplications followed by 10 loads, you'll probably see this run in 10 cycles (limited by the multiplications) because the instructions are interleaved in a more fine-grained window so OoO can keep both types of instructions in flight.
So yeah, OoO can hide port contention, but only if there is a relatively fine-grained mix of different ports in the dynamic instruction stream: it can't help much in a long steady-state involving contention.
The other type of contention, non-mandatory, is caused by less than perfect scheduling. Consider a loop of two instructions that can ideally run at 1 iteration per cycle (e.g., because of a carried dependency chain). One instruction can run on port 1 only and the other on port 1 or port 2. In principle the mandatory contention is 1.0 on each port and so the loop still hits its theoretical max of 1 iter per cycle. However, this requires that the CPU schedules the second instruction always only to port 2, and never to port 1. If it ever schedules the second instruction to port 1, you lose a cycle, since port 1 is the needed every cycle by the first instruction.
So here the actual behavior depends on smart the CPU is in binding ports to instructions. In practice, modern Intel chips don't do this perfectly, so in such cases you see worse-than-ideal throughput. The algorithm being used isn't clear, but it seems usually much better than just random assignment to the available ports. This case only arises when instruction to port assignments are _partially_ overlapping as in the above example (this is common in practice).
---
[1] This an approximation: at the changeover point between mul and load you'll see OoO help a bit.
Regarding your example with 1000 muls followed by 1000 loads... That's why in my experiments I interleaved loads and bswaps, because that's what (hopefully) every decent compiler will do.
I interleaved loads and bswaps, because that's what (hopefully) every decent compiler will do
Your optimism about compiler behavior is charming. I think you may be disappointed if you expect compilers to interleave loads just because this approach works better on modern processors. My experience has been that GCC goes out of its way to hoist all of your carefully interleaved loads to a big block at the top.
I'm guessing it does this because it's a simple heuristic that was often helpful before out-of-order processors became common. Normally, the performance impact of this is very small, but when it does exist, it's usually negative. If I recall, ICC does a better job of interleaving, or at least leaving things interleaved.
Well, yeah.
I don't know much about the current state of the art (because I don't touch the CodeGen on a daily basis), but I kind of look into the future with hope that compilers will better handle at least those "simple" cases. :)
Compilers are rarely going to transform the 1000/1000 example into the 10/10 one, even if they were smarter.
Often such a transformation is simply impossible: the effect of interleaving the instructions may be different than what the source dictates.
Also the 1000/1000 example probably doesn't arise as a long stream of explicit instructions: it is probably just a short loop with 1000 iterations! That makes it even less likely that the compiler will simply start interleaving various instructions following the loop into the loop body somehow.
So, if instead of issuing a second bswap instruction, you were to issue an instruction that could go to a different ALU port, you would get that one for (almost) free?
I believe what you describe are called pipes, not ports.
EDIT: I was commenting on my day job, but what do I know :-) Reservation stations have ports, but the multiple executions units are commonly referred as pipes.
Basically, in one of Intel's execution engines you have a reservation station which is fed µops and allocates resources to each µop. A RS port is what connects ALUs, load-store units, vector units etc. to the RS.
I am working on making these tools easily accessible from Julia [2] and one of the things missing from llvm-mca is the ability to mark a region of assembly code to be analysed instead of analysing the entire provided assembly.
[1] https://llvm.org/docs/CommandGuide/llvm-mca.html [2] https://github.com/vchuravy/IACA.jl