Exactly what the title sounds like. At least one person has gone very far down this rabbit hole, possibly two people. A joy to read through, thank you for posting.
This is the sort of comparison that would be great to do for a compiler runtime implementation. Integer division comes to mind. Hard to justify the time.
Does what you ask for, but for 8-bit CPUs like the 6502, you probably would want to do additional work (I would at least read the output of the C compiler to see whether there’s an ‘easily’ cuttable corner left)
The division routines generated by libdivide assume that fast integer multiplication is available. If it isn't, you may be better off with other approaches.
Although this isn't about the 6502, this reminds me of how I had to write x86_64 assembly for class, and was trying to use LEA to avoid a multiplication. After I wrote it, I wondered if the compiler did the same thing I did, but it didn't; the compiler's version used a register unnecessarily. Compare:
At first, I wondered why the compiler introduced a temporary.
However, I had just done the exercises from the book The Structure and Interpretation of Computer Programs that ask you to create recursive and iterative routines that multiply integers through successive doubling and halving, by taking advantage of the fact that Cx = C + (C - 1)x, if C is odd, and Cx = 2 * (C / 2)x, if C is even (although that's not exactly how they explained it; I don't perfectly remember and I don't feel like looking it up, even though this wasn't that long ago).
I realized: the compiler must be using a variant of the Russian peasant method of multiplication (which was the name of the method, as the book had explained) as a form of strength reduction. I was impressed, but I also realized that this meant that it was suboptimal for minimizing temporary registers. However, I couldn't think of some algorithm that would multiply it the way I did when I wrote that assembly, since I hadn't really had an algorithm in mind when writing it.
> I was impressed, but I also realized that this meant that it was suboptimal for minimizing temporary registers. However, I couldn't think of some algorithm that would multiply it the way I did when I wrote that assembly, since I hadn't really had an algorithm in mind when writing it.
If the compiler were minimising touched architectural registers, it could use the same multiplication method it already does, but output this code:
lea r8, [r9 + 4 * r9]
lea r8, [r8 + r8]
Or this:
lea r8, [r9 + 4 * r9]
add r8, r8
If it used your multiplication method, but with the register allocation decisions it appears to use, it could output this code:
lea rdi, [r9 + 8 * r9]
lea r8, [rdi + r9]
So the compiler's decision to use temporary RDI looks to me like a register allocation decision only, independent of multiplication method. (But maybe affected by an lea/add decision).
You're right. I don't know why, but I didn't realize that neither method of multiplication required you to use more than 2 registers here. The use of a temporary register was probably coincidental. Maybe it wouldn't have used an extra register if it had needed to spill to the stack in order to use one, and instead maybe it would have written one of the alternatives you wrote. Thanks for letting me know.
I don't miss those early days of programming. It's fun to optimize and it's amazing what you can do with minimal resources, but it took an incredible amount of effort to get a 70s-era computer to do elementary school arithmetic fast enough to be useful. People spent their lives working on analyses like this, only to find in a scant few years that the next chip has it built-in as a single instruction.
To be fair, those chips made multiply a single instruction, but they would take tens or hundreds of clock cycles to complete. What it really saved you was precious working memory, no longer needing to consume a significant chunk of memory with a lookup table in order to multiply at a decent speed. If your platform has 16kb of working memory then a 2k lookup table is a huge ask. Even a small 512k lookup table is serious chunk of your memory budget, you have to really consider if wasting hundreds of clock cycles per multiply is a better option.
I'm not even sure how to compare it fairly. For example, on the Super Nintendo you have to load 2 registers and then wait 8 cycles for the result. But of course those cycles could also be used for something different, if necessary.
I love these early days of programming as I do it as hobby. What's pretty amazing is how much you can do without any multiplication or division at all. With add, substract, and shifts you can do a surprising amount on a computer.
I remember “discovering” shift and add as a high school student for doing multiply by 40 (a useful operation for screen mapping) as an improvement over the more naïve x×y = x+x+…(y times)…+x.
But seeing the table of squares thing, I see the genius of being able to take (x+y)²-(x-y)²=(x²+2xy+y²)-(x²-2xy+y²)= 4xy and what a brilliant piece of math that is.
I do quite a bit of 6502 assembly (hobby only). And the only types of multiplications I’ve needed were either multiplying by a power of 2 (often if I’m using a data table where an entry is 2 bytes long and I need to calculate the index into the table), or multiplying by a known constant, which is solved by shift and add or a lookup table if I only ever need to multiply a small number of known values.
You can do it by any number if you toss in addition. If you’re multiplying by a constant, you can factor it out.
For example multiply by 40 (coincidentally the number of characters per line on common microcomputers), you could shift to multiply by 32, shift to multiply by 8, and add them together. Took some additional storage but was faster, particularly for 16 bit, than a general routine.
"Doctor, my arm really hurts when I bend it this way, do you have any advice?" "Don't bend your arm that way"
You can honestly get around not needing arbitrary multiplication and division by just avoiding numbers that aren't powers of two. No oddly sized data structures, for example. It works well unless, of course, you're building a calculator.
I'm the exact opposite from you on this. One of the main reasons why I switched to extremely tiny microcontrollers for my hobby projects is because they bring a lot of those days back for me.
I quite enjoyed working within those tight constraints. It's very satisfying squeezing every last drop of blood from the system.
What I don't miss is the lack of reliability. Modern systems are so dependable compared to what we had to deal with back then. Also it's nice being able to download your tools for free and find documentation online.
It's funny you used the single-instruction block move as an example. The successor to the 6502, the 65C816, does have a single-instruction block move. Actually it has two, one for when you want to move from the top down, and another when you wnat to move from the bottom up (in the case where regions overlap).
Just for fun... I think this should be on the efficient frontier of 8 x 8 -> 16 bit multipliers, at an average cost of around 43 cycles and memory of 1330 bytes or so. Compare 45.5 cycles and 1580 bytes for the mult66.a fastest in the collection linked.
lmul0 = $06 ; pointer into square table low
lmul1 = $08 ; pointer into square table high
labs0 = $0a ; pointer into table of abs(i-1)
prod_low = $0c
\* = $0200
; Align tables to start of page
abstable
!for i, 0, 255 {
!byte <(abs(i-1))
}
squaretable1_lsb
!for i, 0, 511 {
!byte <((i*i)/4)
}
squaretable1_msb
!for i, 0, 511 {
!byte >((i*i)/4)
}
; multiply
;
; f(x) = x*x/4
;
; return f(X+Y) - f(|Y-X|)
;
; 8 bit x 8bit unsigned multiply, 16 bit result
;
; On Entry:
; A: multiplier
; Y: multiplicand
; On Exit:
; (prod_low, A): product
mult
sta lmul0
sta lmul1
eor #ff
sta labs0
ldx (labs0),Y ;X = abs(Y-A)
lda (lmul0),Y
sec
sbc squaretable1_lsb,X
sta prod_low
lda (lmul1),Y
sbc squaretable1_msb,X
rts
;call this once to initialise high bytes of pointers to table
mult_init
lda #>abstable
sta labs0+1
lda #>squaretable1_lsb
; high byte (#2 in this instance)
sta lmul0+1
lda #>squaretable1_msb
; high byte (#4 in this instance)
sta lmul1+1
rts
Nice. Bonus is that it doesn't use self modifying code and another plus one for using ldx(labs0),Y - Looks like this one can also be optimized for repeated multiplication by the same factor by skipping the first 4 instructions.
Oh crap. The LDX (labs0),Y instruction doesn't actually exist! If you are cool with undocumented opcodes you can replace that with LAX (labs0),Y which does exist. Otherwise it would need another 2 cycles for TAX.
Yes I wrote, based on mult65.a credited to Nick Jameson. That is one of the better ones; I took a look and thought about how it could be sped up a little.
The customization section is interesting, though it is missing things such as multiply & conquer, table wrapping, and the mask & jump techniques, but then, you aren't going to find every custom technique covered in any one text. Sine, cosine, and atan, usually used dual interweaving tables that wrapped, and for distance, you could use a short table and a long table that used Manhattan distance to do the lookup depending on whether X > Y or Y > X.
Edit: I see a link to another github on that page to sqrt implementations. I suspect that will make for interesting reading too.
Yet again I am struck by how incredible it is that Braben and Bell, two undergraduate students, could produce something as ground-breaking as Elite—all the way down to writing their own multiplication routines.
My Apple II "Mode 7" demo relies heavily on a "table of squares" multiply alogorithm, though it's signed (vs unsigned) which is even more of a pain on 6502.
https://www.youtube.com/watch?v=o8427JsfGwk
So many 16-bit demo effects assume you have a decently fast multiply or divide instruction, which makes it really hard to do the same effects on 6502 (as I found with the second reality remake)
He is missing the state-of-the-art result for 16 x 16 -> 32 bit. That would be Repose in this CSDB thread : https://csdb.dk/forums/?roomid=11&topicid=91766 at 190 cycles (depending on how exactly you count).
No, he improved it since the version TonyLobster tested. Quoting:
It's 2023 - is your 16x16=32 bit unsigned multiply under 190 cycles yet? What? No?!? Let me help you out buddy!
Much to my surprise, I have surpassed by previous best by 10.5 cycles or 5%, when I thought it couldn't be optimized further. How is this possible you ask?
The first trick had to do with how cycles are counted. As a shortcut, we usually average branches/boundary crossings. However, some carries happen only 7% of the time, so optimizing for the average case actually slows you down. You should be optimizing for the no carry case.
The other trick was playing with using registers to accumulate results. It wasn't as simple as it sounds; the order of multiplies had to be optimized to leave a target value in the A register.
The goal: multiply the two 16-bit numbers in x0, x1 and y0, y1 (low bytes and high bytes) and return the results in the fastest way possible (which may leave some results in registers, preferably in a useful way like the high bytes). I measure the timing in this convention to expose a timing which could be used in-line, as is often the case for demo effects. In this case, the 3rd byte is returned in A, which could be useful for a multiple accumulate where only the high 16-bits are kept.
The time is 188.1 cycles, averaged over all possible inputs. The (accurate) time of my version on codebase is 198.6
To be clear, z=xy or (z3, z2, z1, z0) = (x1, x0) (y1, y0).
;This section performs the 4 sub-multiplies to form
; the partials to be added later in self-mod code.
;Addresses like "x1y0l+1" refer to low(x1*y0) stored
; to the argument of a later "adc #{value}"
;mult8_set_mier_snippet {multiplier}
;mult8_snippet {multiplicand}, {low result}, {high result}
+mult8_set_mier_snippet x1 ;17
+mult8_snippet y0, x1y0l+1, x1y0h+1 ;35
+mult8_snippet y1, x1y1l+1, z3 ;32
+mult8_set_mier_snippet x0 ;17
+mult8_snippet "Y", x0y1l+1, "X" ;28
+mult8_snippet y0, z0, "A" ;28
;results in X=x0y1h and A=x0y0h
;multiply part total: 149-165, average 157
;z3 z2 z1
clc
x0y1l adc #0 ;x0y0h + x0y1l
tay ;6
txa
x1y0h adc #0 ;x0y1h + x1y0h
tax
bcc + ;9
inc z3
clc ;(+6) taken 7% of the time
+ tya
x1y0l adc #0 ;+ x1y0l
sta z1 ;7
txa
x1y1l adc #0 ;+ x1y1l
bcc done ;7
inc z3 ;(+4) taken 42% of the time
Column 1 takes 13 cycles, column 2 takes 16 cycles, and column 3 takes 2.1 cycles.
The total time to add the partial products is 29-39 with an average of 31.1.
The results are returned as a 32-bit result:
z3, A(=z2), z1, z0.
The total time is 156.96875+31.1=188.06875 (178-204).
I’m writing a pinball game in assembly for the original Gameboy, and recently had to hand-craft multiply, division and cosine routines to trade off accuracy for spare cycles. Something like this would have come in handy!!
The graphical error plots for the versions that approximate are a joy to behold. You can clearly see the ringing and how the error bars tend to grow larger with bigger inputs.
This is the sort of comparison that would be great to do for a compiler runtime implementation. Integer division comes to mind. Hard to justify the time.