I rerun the benchmark vs loop-5 and loop-7 from the second post. Runtime is basically the same on my machine.
I would have expected yours to be faster given that it needs to execute fewer instructions per loop iteration.
Though maybe the CPU can run `adc` on more ports compared to a load from memory?
Summary
'01-six-times-faster-than-c/bench-x64-8 1000 1' ran
1.00 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-7 1000 1'
1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
Summary
'01-six-times-faster-than-c/bench-x64-8 1000 1' ran
1.01 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-5 1000 1'
1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
I would have expected yours to be faster given that it needs to execute fewer instructions per loop iteration. Though maybe the CPU can run `adc` on more ports compared to a load from memory?