Why isn't this faster than the "standard" loop based version? The disclaimer at ...

cnvogel · on Nov 5, 2015

First and foremost: YES, I KNOW! This is completely senseless microbenchmarking and micro-optimization. I ONLY LOOK AT THIS FOR FUN.

That being said. The 64bit multiplication is faster, and also constant in time. I've measured this using the timestamp counter which doesn't necessarily count actual CPU clock cycles, but possibly an integer multiple of it (seems to output always numbers divisable by 8 for me).

    (...)
    100 loop:01100100(152) 64bit:01100100(48)
    101 loop:01100101(128) 64bit:01100101(48)
    102 loop:01100110(160) 64bit:01100110(48)
    103 loop:01100111(128) 64bit:01100111(48)
    104 loop:01101000(192) 64bit:01101000(48)
    105 loop:01101001(88) 64bit:01101001(48)
    106 loop:01101010(152) 64bit:01101010(48)
    107 loop:01101011(128) 64bit:01101011(48)
    108 loop:01101100(160) 64bit:01101100(48)
    (...)

That's on an Intel(R) Core(TM) i5 CPU M 450 @ 2.40GHz and the cycles include the rdtsc, a few movs, and the call to the respective function.

https://github.com/vogelchr/bitstring_via_64bit_mult

nocsaer1 · on Nov 5, 2015

I couldn't resist so I tried a table look-up (16 elements) version which was about twice as fast as the multiply version.

Intel Core i5-3210M @ 2.50GHz

renox · on Nov 6, 2015

Funny, I read about the first part of the sentence 'I tried a table look-up' and thought immediately WHAT ABOUT THE CACHE?

Then I continued '16 elements', oh, OK then, next time I'll try to read the whole sentence first.

acqq · on Nov 5, 2015

Thanks for measuring.

From your github source:

    /* anyone have a big-endian machine at hand to test this? */
    #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
    #define MULT  0x8040201008040201ULL
    #else
    #define MULT  0x0102040810204080ULL
    #endif

Even without the machine I believe that such a big endian implementation can't work. Hint: you can try it on the little endian machine too, the resulting bits must be appropriate, and I believe they wouldn't. That's the beauty of the magic constants, their construction isn't so easy as it appears to be.

cnvogel · on Nov 5, 2015

You are right, the carry bits will clobber the neighboring bytes, I think.

acqq · on Nov 5, 2015

> the carry bits will clobber

Actually, not the carry bits, the "content" bits. The 8-bit content must appear on more places there for the magic to work.

acqq · on Nov 5, 2015

It is surely faster on modern 64-bit CPU's, and surely not faster on the 8-bit ones.