Why isn't this faster than the "standard" loop based version? The disclaimer at the start says it's not, but I would have thought that 4 64bit math operations would be much faster than a loop with 8 steps and comparisons and so on.
First and foremost: YES, I KNOW! This is completely senseless microbenchmarking and micro-optimization. I ONLY LOOK AT THIS FOR FUN.
That being said. The 64bit multiplication is faster, and also constant in time. I've measured this using the timestamp counter which doesn't necessarily count actual CPU clock cycles, but possibly an integer multiple of it (seems to output always numbers divisable by 8 for me).
/* anyone have a big-endian machine at hand to test this? */
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
#define MULT 0x8040201008040201ULL
#else
#define MULT 0x0102040810204080ULL
#endif
Even without the machine I believe that such a big endian implementation can't work. Hint: you can try it on the little endian machine too, the resulting bits must be appropriate, and I believe they wouldn't. That's the beauty of the magic constants, their construction isn't so easy as it appears to be.