Just a note about the name of the algorithm: Knuth's The Art of Computer Programming indeed uses letters for algorithms, but the intended convention is that, for example, this algorithm be called "Algorithm 4.3.1D" (that's the name used in "Appendix C — Index to Algorithms and Theorems"). From page 2 of Volume 1 of TAOCP:
> The format above illustrates the style in which all of the algorithms throughout this book will be presented. Each algorithm we consider has been given an identifying letter (E in the preceding example), and the steps of the algorithm are identified by this letter followed by a number (E1, E2, E3). The chapters are divided into numbered sections; within a section the algorithms are designated by letter only, but when algorithms are referred to in other sections, the appropriate section number is attached. For example, we are now in Section 1.1; within this section Euclid’s algorithm is called Algorithm E, while in later sections it is referred to as Algorithm 1.1E.
In the case of this particular "Algorithm D", the letter "D" is chosen probably because it's an algorithm for "Division of nonnegative integers" (it follows algorithms "A", "S" and "M" for you-can-guess-what); indeed the same volume also has another "Algorithm D (Factoring with sieves)" (=Algorithm 4.5.4D), and an "Algorithm D (Division of polynomials over a field)" (=Algorithm 4.6.1D). I think that, as several times in the same section it's referred to simply as "Algorithm D" (for example), this convention is not very obvious unless one has looked at either the second page of the first volume, or the appendix. Or I guess within a certain field a particular section of TAOCP is implicit, so it's clear from context.
It offers a modification to Knuth's algorithm D which avoids all hardware division by utilizing an approximate reciprocal. The 3-by-2 division is particularly interesting as it allows you to skip the quotient loop (D3). It's a nice little optimization that I've spent some time tinkering with in my own code. This technique is notably used in GMP and other arbitrary precision integer libraries.
IIRC, I tried this method, but did not see any performance improvement. The method replaces a single `divq` instruction with a table lookup and several multiplications. On modern processors this is no longer a worthwhile trade off.
The 2-by-1 and 3-by-2 division functions described in the paper result in a very measurable speedup in my code. I think you're confusing those with the reciprocal calculation itself (which can be computed with a lookup table). I agree that part doesn't really lend itself to any significant performance benefit and is probably better calculated with a single hardware division instead.
I feel it necessary to point out that the 3-by-2 division actually has multiple benefits which are easy to miss:
1. The quotient loop can be skipped as I mentioned.
2. The "Add back" step is less likely to be triggered.
3. Since a 2-word remainder is computed with the division, you can skip 2 iterations on the multiply+subtract step.
My reimplementation of GMP documents both the 2-by-1 and 3-by-2 divisions pretty thoroughly[1][2].
I had not seen that GMP reimplementation before, but it looks very readable. Thanks!
I used the Möller & Granlund paper for a portable implementation of 2-by-1 division [1], and on my machine (which is admittedly not exactly new) it runs much faster than the DIV instruction.
I have implemented this algorithm twice, once in ARM assembly and once in C.
In some ways the assembly implementation was easier as you aren't forever fighting with the compiler to get it to generate the correct instructions.
The 32 X 32 to 64 bit multiply is one example. There isn't a way of expressing that in C - you do a 64 X 64 bit multiply of 32 bit variables casted to 64 bits and hope the compiler gets the hint that it should be doing a 32 X 32 to 64 bit multiply not a 64 X 64 bit multiply. This of course involves reading the assembly output.
The other thing you'd like is addition with carry. It is possible to express this in C and have the compiler recognise what you are doing and generate the correct instructions but it is a pain! I noted the other day that zig has @addWithOverflow which is a step in the right direction.
I can see why the Author of the article found so many things to criticize in the algorithm studied, but I think they are mostly an artifact of the fact that C (and most high level languages) is a poor choice for low level arithmetic.
I agree a dedicated intrinsic like in Zig is even better, but it is nice that modern compilers make it reasonably straightforward to get the desired code even without dedicated intrinsics.
@addWithOverflow is easy to implement in C, but you still need to look at the assembler code to see if the compiler did the right thing.
I don't like that tension between C code you've arranged in just the right way and hoping that it continues to generate the assembler output it did once.
I guess a portable set of compiler intrinsics it what is really needed to solve this problem.
There is an x86 intrinsic for 32 x 32 -> 64 bit multiply now-a-days
Too bad you can’t put some sort of assert saying “this code should generate this assembly assuming x86-64” or something - at least get a warning if the compiler upgrade changes it.
If the performance is important you can have performance regression tests. If not, then it probably doesn't matter too much if it always generates optimal assembly for this specific line of code.
What about "addWithCarry", ie. ADC? Getting the carry bit seems only half the battle. I'd not seen this compiler tool before so had a quick play around with it. It's very cool! But I couldn't get a compiler to recognise it could use ADC.
It seems rather petty to criticize the use of uint64_t in the 64-bit versions in multiple places when the original does the same:
32-bit version:
const unsigned b = 65536; // Number base (16 bits).
unsigned un1, un0, // Norm. dividend LSD's.
vn1, vn0, // Norm. divisor digits.
q1, q0, // Quotient digits.
un32, un21, un10,// Dividend digit pairs.
rhat; // A remainder.
So a properly adapted 64-bit version should use uint64_t and a 32-bit number base.
I did not attempt to follow the author's arguments why he advocates for 32-bit variables or why that deviation from the original would be correct. Using 64-bit does not seem to be any bottleneck at all here, so why bother.
But why? The Hacker's Delight author justifies it:
>Several of the variables below could be "short," but having them fullwords gives better code on gcc/Intel.
But does the same happen in the "adapted" code? Does forcing 64 bit variables on a 32 bit architecture give better code?
Wouldn't it have been better to just leave the compiler to choose the best appropriate size? (I guess you do that by just declaring the variable as unsigned? I'm no C guru...)
Also, the fact that all of them renamed the variable un32 to un64 tells you that they didn't have the slightest idea about what they were doing :D
Here is my optimized in-place Rust implementation [1].
It is a very tricky algorithm to get right. There are many edge cases that only happen for ~2^-64 fractions of the input space, so hard to find even with fuzz-testing. Best strategy is to implement it for small limbs first, and fuzz that well.
Very interesting exposition and comments. I think it shows one of the 'weaknesses' of TAOCP is that it does not formalize or at least make explicit the relationship between the available operations and the final functionality of the algorithm. In that sense the TAOCP remains a receipe book and not something that is 'executable' in the sense of being able to generate algorithms tailored to the available operations (and their performance) or at least to be used in a tool that could support the synthesizes or verification of algorithms.
This is my favorite algorithm ever, and the first serious one that I tried (and failed) to implement in my life when I was a teen. I remember dearly implementing the first three algorithms (A, S, M) in the nasm assembler on my debian "hamm". Then I tried for many weeks to get D right, but never managed to be sure all the cases worked.
These algorithms are probably easier to implement in assembly than in a higher-level language.
So, Golang and V8 both use this algorithm when you choose to use an arbitrary-precision integer type. In Python (and Ruby?), all integers are arbitrary-precision - do those languages also use this algorithm for division?
Go in practice doesn't actually use the code cited.
Most of the math/bits package is recognized by the compiler and intrinsified down into 1 or a few CPU instructions (not function calls) on almost all architectures.
In a git checkout of Go, look at e.g. `git grep Div64 src/cmd/compile`.
> In Python (and Ruby?), all integers are arbitrary-precision
IIRC, both (well, CPython and MRI) internally use distinct “small integer” and “big integer" representations, but that is transparent to the user. I'd imagine that they different low-level operations are involved depending on the internal type.
This is the same as Common Lisp implementations (it might be in the spec, not sure). It's completely transparent, but the nice thing in CL is there is a built-in way to force the compiler to stick to fixnums or bignums so you can avoid branching on overflows etc.
> IIRC, both (well, CPython and MRI) internally use distinct “small integer” and “big integer" representations, but that is transparent to the user.
This seems to be correct for MRI. Before Ruby 2.4, both `Fixnum` and `Bignum` were exposed to the programmer. In 2016 these were unified into `Integer` from the programmer's point of view, but with "no internal representation change" [0].
> "Normal" ints in Python are certainly native. You don't get bignums until you need them.
However, I don't think this is true for current versions of Python. As part of the 2.x => 3.0 transition, Python underwent a similar unification of integer types. Python 2.x exposes both `int` and `long`, and Python 3.x has only `int`. However, internally, 3.x's `int` is represented by `PyLongObject` and the old `PyIntObject` has been removed. Even the small integer cache (which holds the values -5 to 256) was changed from holding `PyIntObject`s in 2.x [1] to hold `PyLongObject`s [2].
So it looks like in Python 3.x, all integers are arbitrary-precision. I believe this was a significant contributor to early versions of Python 3.x being slower than 2.x.
> The format above illustrates the style in which all of the algorithms throughout this book will be presented. Each algorithm we consider has been given an identifying letter (E in the preceding example), and the steps of the algorithm are identified by this letter followed by a number (E1, E2, E3). The chapters are divided into numbered sections; within a section the algorithms are designated by letter only, but when algorithms are referred to in other sections, the appropriate section number is attached. For example, we are now in Section 1.1; within this section Euclid’s algorithm is called Algorithm E, while in later sections it is referred to as Algorithm 1.1E.
In the case of this particular "Algorithm D", the letter "D" is chosen probably because it's an algorithm for "Division of nonnegative integers" (it follows algorithms "A", "S" and "M" for you-can-guess-what); indeed the same volume also has another "Algorithm D (Factoring with sieves)" (=Algorithm 4.5.4D), and an "Algorithm D (Division of polynomials over a field)" (=Algorithm 4.6.1D). I think that, as several times in the same section it's referred to simply as "Algorithm D" (for example), this convention is not very obvious unless one has looked at either the second page of the first volume, or the appendix. Or I guess within a certain field a particular section of TAOCP is implicit, so it's clear from context.