Donald Knuth’s Algorithm D, its implementation in Hacker’s Delight and elsewhere

svat · on March 24, 2021

Just a note about the name of the algorithm: Knuth's The Art of Computer Programming indeed uses letters for algorithms, but the intended convention is that, for example, this algorithm be called "Algorithm 4.3.1D" (that's the name used in "Appendix C — Index to Algorithms and Theorems"). From page 2 of Volume 1 of TAOCP:

> The format above illustrates the style in which all of the algorithms throughout this book will be presented. Each algorithm we consider has been given an identifying letter (E in the preceding example), and the steps of the algorithm are identified by this letter followed by a number (E1, E2, E3). The chapters are divided into numbered sections; within a section the algorithms are designated by letter only, but when algorithms are referred to in other sections, the appropriate section number is attached. For example, we are now in Section 1.1; within this section Euclid’s algorithm is called Algorithm E, while in later sections it is referred to as Algorithm 1.1E.

In the case of this particular "Algorithm D", the letter "D" is chosen probably because it's an algorithm for "Division of nonnegative integers" (it follows algorithms "A", "S" and "M" for you-can-guess-what); indeed the same volume also has another "Algorithm D (Factoring with sieves)" (=Algorithm 4.5.4D), and an "Algorithm D (Division of polynomials over a field)" (=Algorithm 4.6.1D). I think that, as several times in the same section it's referred to simply as "Algorithm D" (for example), this convention is not very obvious unless one has looked at either the second page of the first volume, or the appendix. Or I guess within a certain field a particular section of TAOCP is implicit, so it's clear from context.

chjj · on March 24, 2021

I'd encourage people to also read Niels Möller and Torbjörn Granlund's division paper: https://gmplib.org/~tege/division-paper.pdf

It offers a modification to Knuth's algorithm D which avoids all hardware division by utilizing an approximate reciprocal. The 3-by-2 division is particularly interesting as it allows you to skip the quotient loop (D3). It's a nice little optimization that I've spent some time tinkering with in my own code. This technique is notably used in GMP and other arbitrary precision integer libraries.

remcob · on March 24, 2021

IIRC, I tried this method, but did not see any performance improvement. The method replaces a single `divq` instruction with a table lookup and several multiplications. On modern processors this is no longer a worthwhile trade off.

chjj · on March 24, 2021

The 2-by-1 and 3-by-2 division functions described in the paper result in a very measurable speedup in my code. I think you're confusing those with the reciprocal calculation itself (which can be computed with a lookup table). I agree that part doesn't really lend itself to any significant performance benefit and is probably better calculated with a single hardware division instead.

I feel it necessary to point out that the 3-by-2 division actually has multiple benefits which are easy to miss:

1. The quotient loop can be skipped as I mentioned.

2. The "Add back" step is less likely to be triggered.

3. Since a 2-word remainder is computed with the division, you can skip 2 iterations on the multiply+subtract step.

My reimplementation of GMP documents both the 2-by-1 and 3-by-2 divisions pretty thoroughly[1][2].

[1] https://github.com/bcoin-org/libtorsion/blob/master/src/mpi....

[2] https://github.com/bcoin-org/libtorsion/blob/master/src/mpi....

zmodem · on March 24, 2021

I had not seen that GMP reimplementation before, but it looks very readable. Thanks!

I used the Möller & Granlund paper for a portable implementation of 2-by-1 division [1], and on my machine (which is admittedly not exactly new) it runs much faster than the DIV instruction.

[1] https://www.hanshq.net/eprime.html#div

mratsim · on March 24, 2021

The 3-by-2 division is from Burnikel and Ziegler [1][2] with a nice comparison from Hasselstrom between division approaches.

OpenSSL has introduced another 3 by 2, but this one is written with constant-time concerns for cryptography [4]

[1]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47....

[2]: https://github.com/status-im/nim-stint/blob/bae3fc6/stint/pr...

[3]: http://bioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Hass...

[4]: https://github.com/openssl/openssl/pull/7589

nickcw · on March 24, 2021

I have implemented this algorithm twice, once in ARM assembly and once in C.

In some ways the assembly implementation was easier as you aren't forever fighting with the compiler to get it to generate the correct instructions.

The 32 X 32 to 64 bit multiply is one example. There isn't a way of expressing that in C - you do a 64 X 64 bit multiply of 32 bit variables casted to 64 bits and hope the compiler gets the hint that it should be doing a 32 X 32 to 64 bit multiply not a 64 X 64 bit multiply. This of course involves reading the assembly output.

The other thing you'd like is addition with carry. It is possible to express this in C and have the compiler recognise what you are doing and generate the correct instructions but it is a pain! I noted the other day that zig has @addWithOverflow which is a step in the right direction.

I can see why the Author of the article found so many things to criticize in the algorithm studied, but I think they are mostly an artifact of the fact that C (and most high level languages) is a poor choice for low level arithmetic.

haberman · on March 24, 2021

> I noted the other day that zig has @addWithOverflow which is a step in the right direction.

This seems reasonably easy to implement in C:

    #include <stdint.h>

    typedef struct {
        uint64_t val;
        uint64_t carry;
    } Result;

    Result addWithOverflow(uint64_t x, uint64_t y) {
        Result res = {x + y, x + y < x};
        return res;
    }

On GCC and Clang this does the right thing (grabs the carry flag): https://godbolt.org/z/dzP3TPGTW

I agree a dedicated intrinsic like in Zig is even better, but it is nice that modern compilers make it reasonably straightforward to get the desired code even without dedicated intrinsics.

nickcw · on March 24, 2021

@addWithOverflow is easy to implement in C, but you still need to look at the assembler code to see if the compiler did the right thing.

I don't like that tension between C code you've arranged in just the right way and hoping that it continues to generate the assembler output it did once.

I guess a portable set of compiler intrinsics it what is really needed to solve this problem.

There is an x86 intrinsic for 32 x 32 -> 64 bit multiply now-a-days

https://docs.microsoft.com/en-us/cpp/intrinsics/emul-emulu?v...

And one for 32 + 32 -> 32 + carry also.

Though I don't know how portable __emulu is.. does it work on ARM for instance?

bombcar · on March 24, 2021

Too bad you can’t put some sort of assert saying “this code should generate this assembly assuming x86-64” or something - at least get a warning if the compiler upgrade changes it.

Karliss · on March 24, 2021

If the performance is important you can have performance regression tests. If not, then it probably doesn't matter too much if it always generates optimal assembly for this specific line of code.

globular-toast · on March 24, 2021

What about "addWithCarry", ie. ADC? Getting the carry bit seems only half the battle. I'd not seen this compiler tool before so had a quick play around with it. It's very cool! But I couldn't get a compiler to recognise it could use ADC.

haberman · on March 24, 2021

This seems to work. The compiler has to see the previous add to know what is in the carry flag.

    #include <stdint.h>

    uint64_t addWithCarry(uint64_t x, uint64_t y, uint64_t z) {
        return x + y + (y + z < y);
    }

Output here: https://godbolt.org/z/zfP7PnGcs

teleforce · on March 24, 2021

Perhaps it is best to implement algorithm D in D language ;-)

Joking aside, since D natively support inline assembler it won't be a poor compromise and trade-off for algorithm D implementation in D.

[1]https://en.m.wikipedia.org/wiki/Inline_assembler

taylortx · on March 24, 2021

It seems rather petty to criticize the use of uint64_t in the 64-bit versions in multiple places when the original does the same:

32-bit version:

  const unsigned b = 65536; // Number base (16 bits).
  unsigned un1, un0,        // Norm. dividend LSD's.
           vn1, vn0,        // Norm. divisor digits.
           q1, q0,          // Quotient digits.
           un32, un21, un10,// Dividend digit pairs.
           rhat;            // A remainder.

So a properly adapted 64-bit version should use uint64_t and a 32-bit number base.

I did not attempt to follow the author's arguments why he advocates for 32-bit variables or why that deviation from the original would be correct. Using 64-bit does not seem to be any bottleneck at all here, so why bother.

tangus · on March 24, 2021

But why? The Hacker's Delight author justifies it:

>Several of the variables below could be "short," but having them fullwords gives better code on gcc/Intel.

But does the same happen in the "adapted" code? Does forcing 64 bit variables on a 32 bit architecture give better code?

Wouldn't it have been better to just leave the compiler to choose the best appropriate size? (I guess you do that by just declaring the variable as unsigned? I'm no C guru...)

Also, the fact that all of them renamed the variable un32 to un64 tells you that they didn't have the slightest idea about what they were doing :D

remcob · on March 24, 2021

Here is my optimized in-place Rust implementation [1].

It is a very tricky algorithm to get right. There are many edge cases that only happen for ~2^-64 fractions of the input space, so hard to find even with fuzz-testing. Best strategy is to implement it for small limbs first, and fuzz that well.

[1] https://github.com/0xProject/OpenZKP/blob/master/algebra/u25...

fjfaase · on March 24, 2021

Very interesting exposition and comments. I think it shows one of the 'weaknesses' of TAOCP is that it does not formalize or at least make explicit the relationship between the available operations and the final functionality of the algorithm. In that sense the TAOCP remains a receipe book and not something that is 'executable' in the sense of being able to generate algorithms tailored to the available operations (and their performance) or at least to be used in a tool that could support the synthesizes or verification of algorithms.

enriquto · on March 24, 2021

This is my favorite algorithm ever, and the first serious one that I tried (and failed) to implement in my life when I was a teen. I remember dearly implementing the first three algorithms (A, S, M) in the nasm assembler on my debian "hamm". Then I tried for many weeks to get D right, but never managed to be sure all the cases worked.

These algorithms are probably easier to implement in assembly than in a higher-level language.

stellalo · on March 24, 2021

Looks like the links to the book are broken in the article. The book in question is this: https://en.m.wikipedia.org/wiki/Hacker%27s_Delight

nullc · on March 24, 2021

The book's website went down recently but it's on archive.org: https://web.archive.org/web/20181106040121/https://www.hacke...

pansa2 · on March 24, 2021

So, Golang and V8 both use this algorithm when you choose to use an arbitrary-precision integer type. In Python (and Ruby?), all integers are arbitrary-precision - do those languages also use this algorithm for division?

bradfitz · on March 24, 2021

Go in practice doesn't actually use the code cited.

Most of the math/bits package is recognized by the compiler and intrinsified down into 1 or a few CPU instructions (not function calls) on almost all architectures.

In a git checkout of Go, look at e.g. `git grep Div64 src/cmd/compile`.

pansa2 · on March 24, 2021

Thanks, I missed that the Go code in the article was only performing a "128-bit / 64-bit" division.

However, based on this comment [0], it looks like Go does use Knuth's algorithm when dividing arbitrary-precision integers.

[0] https://github.com/golang/go/blob/2e94401277128f9e08e3319903...

smikhanov · on March 24, 2021

Out of interest, when would that code be used? When compiling for an architecture that doesn’t have intrinsics defined?

dragonwriter · on March 24, 2021

> In Python (and Ruby?), all integers are arbitrary-precision

IIRC, both (well, CPython and MRI) internally use distinct “small integer” and “big integer" representations, but that is transparent to the user. I'd imagine that they different low-level operations are involved depending on the internal type.

globular-toast · on March 24, 2021

This is the same as Common Lisp implementations (it might be in the spec, not sure). It's completely transparent, but the nice thing in CL is there is a built-in way to force the compiler to stick to fixnums or bignums so you can avoid branching on overflows etc.

TylerE · on March 24, 2021

"Normal" ints in Python are certainly native. You don't get bignums until you need them.

I think, on 32bit platforms (though we're about 10 years removed from actual expirence) they'd go int -> long -> bignum as needed.

pansa2 · on March 24, 2021

> IIRC, both (well, CPython and MRI) internally use distinct “small integer” and “big integer" representations, but that is transparent to the user.

This seems to be correct for MRI. Before Ruby 2.4, both `Fixnum` and `Bignum` were exposed to the programmer. In 2016 these were unified into `Integer` from the programmer's point of view, but with "no internal representation change" [0].

> "Normal" ints in Python are certainly native. You don't get bignums until you need them.

However, I don't think this is true for current versions of Python. As part of the 2.x => 3.0 transition, Python underwent a similar unification of integer types. Python 2.x exposes both `int` and `long`, and Python 3.x has only `int`. However, internally, 3.x's `int` is represented by `PyLongObject` and the old `PyIntObject` has been removed. Even the small integer cache (which holds the values -5 to 256) was changed from holding `PyIntObject`s in 2.x [1] to hold `PyLongObject`s [2].

So it looks like in Python 3.x, all integers are arbitrary-precision. I believe this was a significant contributor to early versions of Python 3.x being slower than 2.x.

[0] https://rubykaigi.org/2016/presentations/tanaka_akr.html

[1] https://github.com/python/cpython/blob/8d21aa21f2cbc6d50aab3...

[2] https://github.com/python/cpython/blob/39f643614d03748a5fad4...

toolslive · on March 24, 2021

but... are they unboxed?

pansa2 · on March 24, 2021

Nothing is unboxed in CPython. Other implementations differ - for example, MicroPython supports several encodings for unboxed values.

https://github.com/micropython/micropython/blob/4fc2866f4569...

AlexanderDhoore · on March 24, 2021

In python everything is boxed.

EDIT See https://news.ycombinator.com/item?id=12418985