Hacker News new | past | comments | ask | show | jobs | submit login

Why isn't this faster than the "standard" loop based version? The disclaimer at the start says it's not, but I would have thought that 4 64bit math operations would be much faster than a loop with 8 steps and comparisons and so on.



First and foremost: YES, I KNOW! This is completely senseless microbenchmarking and micro-optimization. I ONLY LOOK AT THIS FOR FUN.

That being said. The 64bit multiplication is faster, and also constant in time. I've measured this using the timestamp counter which doesn't necessarily count actual CPU clock cycles, but possibly an integer multiple of it (seems to output always numbers divisable by 8 for me).

    (...)
    100 loop:01100100(152) 64bit:01100100(48)
    101 loop:01100101(128) 64bit:01100101(48)
    102 loop:01100110(160) 64bit:01100110(48)
    103 loop:01100111(128) 64bit:01100111(48)
    104 loop:01101000(192) 64bit:01101000(48)
    105 loop:01101001(88) 64bit:01101001(48)
    106 loop:01101010(152) 64bit:01101010(48)
    107 loop:01101011(128) 64bit:01101011(48)
    108 loop:01101100(160) 64bit:01101100(48)
    (...)

That's on an Intel(R) Core(TM) i5 CPU M 450 @ 2.40GHz and the cycles include the rdtsc, a few movs, and the call to the respective function.

https://github.com/vogelchr/bitstring_via_64bit_mult


I couldn't resist so I tried a table look-up (16 elements) version which was about twice as fast as the multiply version.

Intel Core i5-3210M @ 2.50GHz


Funny, I read about the first part of the sentence 'I tried a table look-up' and thought immediately WHAT ABOUT THE CACHE?

Then I continued '16 elements', oh, OK then, next time I'll try to read the whole sentence first.


Thanks for measuring.

From your github source:

    /* anyone have a big-endian machine at hand to test this? */
    #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
    #define MULT  0x8040201008040201ULL
    #else
    #define MULT  0x0102040810204080ULL
    #endif
Even without the machine I believe that such a big endian implementation can't work. Hint: you can try it on the little endian machine too, the resulting bits must be appropriate, and I believe they wouldn't. That's the beauty of the magic constants, their construction isn't so easy as it appears to be.


You are right, the carry bits will clobber the neighboring bytes, I think.


> the carry bits will clobber

Actually, not the carry bits, the "content" bits. The 8-bit content must appear on more places there for the magic to work.


It is surely faster on modern 64-bit CPU's, and surely not faster on the 8-bit ones.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: