Comparing 6502 Multiply Routines

JonChesterfield · on Dec 11, 2023

Exactly what the title sounds like. At least one person has gone very far down this rabbit hole, possibly two people. A joy to read through, thank you for posting.

This is the sort of comparison that would be great to do for a compiler runtime implementation. Integer division comes to mind. Hard to justify the time.

Someone · on Dec 11, 2023

> Integer division comes to mind.

In case you don’t know about it: https://libdivide.com/

Does what you ask for, but for 8-bit CPUs like the 6502, you probably would want to do additional work (I would at least read the output of the C compiler to see whether there’s an ‘easily’ cuttable corner left)

duskwuff · on Dec 11, 2023

> In case you don’t know about it: https://libdivide.com/

The division routines generated by libdivide assume that fast integer multiplication is available. If it isn't, you may be better off with other approaches.

djmips · on Dec 12, 2023

6502 Divide by a constant. https://forums.atariage.com/blogs/entry/10805-unsigned-integ...

And double coverage - NesDev forum post (same info) https://forums.nesdev.org/viewtopic.php?t=11336

trealira · on Dec 12, 2023

Although this isn't about the 6502, this reminds me of how I had to write x86_64 assembly for class, and was trying to use LEA to avoid a multiplication. After I wrote it, I wondered if the compiler did the same thing I did, but it didn't; the compiler's version used a register unnecessarily. Compare:

  ;; Goal: r8 := 10 * r9

  ;; Mine:
  ;; 10x = x + 9x
  ;;  9x = x + 8x
  lea r8, [r9 + 8 * r9]
  add r8, r9

  ;; Compiler's code:
  ;; 10x = 2 * 5x = 5x + 5x
  ;;  5x = x + 4x
  lea rdi, [r9 + 4 * r9]
  lea r8, [rdi + rdi]

At first, I wondered why the compiler introduced a temporary.

However, I had just done the exercises from the book The Structure and Interpretation of Computer Programs that ask you to create recursive and iterative routines that multiply integers through successive doubling and halving, by taking advantage of the fact that Cx = C + (C - 1)x, if C is odd, and Cx = 2 * (C / 2)x, if C is even (although that's not exactly how they explained it; I don't perfectly remember and I don't feel like looking it up, even though this wasn't that long ago).

I realized: the compiler must be using a variant of the Russian peasant method of multiplication (which was the name of the method, as the book had explained) as a form of strength reduction. I was impressed, but I also realized that this meant that it was suboptimal for minimizing temporary registers. However, I couldn't think of some algorithm that would multiply it the way I did when I wrote that assembly, since I hadn't really had an algorithm in mind when writing it.

jlokier · on Dec 12, 2023

> I was impressed, but I also realized that this meant that it was suboptimal for minimizing temporary registers. However, I couldn't think of some algorithm that would multiply it the way I did when I wrote that assembly, since I hadn't really had an algorithm in mind when writing it.

If the compiler were minimising touched architectural registers, it could use the same multiplication method it already does, but output this code:

  lea r8, [r9 + 4 * r9]
  lea r8, [r8 + r8]

Or this:

  lea r8, [r9 + 4 * r9]
  add r8, r8

If it used your multiplication method, but with the register allocation decisions it appears to use, it could output this code:

  lea rdi, [r9 + 8 * r9]
  lea r8, [rdi + r9]

So the compiler's decision to use temporary RDI looks to me like a register allocation decision only, independent of multiplication method. (But maybe affected by an lea/add decision).

trealira · on Dec 12, 2023

You're right. I don't know why, but I didn't realize that neither method of multiplication required you to use more than 2 registers here. The use of a temporary register was probably coincidental. Maybe it wouldn't have used an extra register if it had needed to spill to the stack in order to use one, and instead maybe it would have written one of the alternatives you wrote. Thanks for letting me know.

rwmj · on Dec 12, 2023

lea has some performance issues on x86-64, although only in some variations: https://reviews.llvm.org/D32277

saulpw · on Dec 11, 2023

I don't miss those early days of programming. It's fun to optimize and it's amazing what you can do with minimal resources, but it took an incredible amount of effort to get a 70s-era computer to do elementary school arithmetic fast enough to be useful. People spent their lives working on analyses like this, only to find in a scant few years that the next chip has it built-in as a single instruction.

jandrese · on Dec 11, 2023

To be fair, those chips made multiply a single instruction, but they would take tens or hundreds of clock cycles to complete. What it really saved you was precious working memory, no longer needing to consume a significant chunk of memory with a lookup table in order to multiply at a decent speed. If your platform has 16kb of working memory then a 2k lookup table is a huge ask. Even a small 512k lookup table is serious chunk of your memory budget, you have to really consider if wasting hundreds of clock cycles per multiply is a better option.

boredhedgehog · on Dec 12, 2023

I'm not even sure how to compare it fairly. For example, on the Super Nintendo you have to load 2 registers and then wait 8 cycles for the result. But of course those cycles could also be used for something different, if necessary.

wvenable · on Dec 11, 2023

I love these early days of programming as I do it as hobby. What's pretty amazing is how much you can do without any multiplication or division at all. With add, substract, and shifts you can do a surprising amount on a computer.

dhosek · on Dec 12, 2023

I remember “discovering” shift and add as a high school student for doing multiply by 40 (a useful operation for screen mapping) as an improvement over the more naïve x×y = x+x+…(y times)…+x.

But seeing the table of squares thing, I see the genius of being able to take (x+y)²-(x-y)²=(x²+2xy+y²)-(x²-2xy+y²)= 4xy and what a brilliant piece of math that is.

wkjagt · on Dec 12, 2023

I do quite a bit of 6502 assembly (hobby only). And the only types of multiplications I’ve needed were either multiplying by a power of 2 (often if I’m using a data table where an entry is 2 bytes long and I need to calculate the index into the table), or multiplying by a known constant, which is solved by shift and add or a lookup table if I only ever need to multiply a small number of known values.

gary_0 · on Dec 12, 2023

With shifts, you can multiply and divide by any number you want, as long as it's a power of two. ;)

whartung · on Dec 12, 2023

You can do it by any number if you toss in addition. If you’re multiplying by a constant, you can factor it out.

For example multiply by 40 (coincidentally the number of characters per line on common microcomputers), you could shift to multiply by 32, shift to multiply by 8, and add them together. Took some additional storage but was faster, particularly for 16 bit, than a general routine.

wvenable · on Dec 12, 2023

"Doctor, my arm really hurts when I bend it this way, do you have any advice?" "Don't bend your arm that way"

You can honestly get around not needing arbitrary multiplication and division by just avoiding numbers that aren't powers of two. No oddly sized data structures, for example. It works well unless, of course, you're building a calculator.

JohnFen · on Dec 11, 2023

I'm the exact opposite from you on this. One of the main reasons why I switched to extremely tiny microcontrollers for my hobby projects is because they bring a lot of those days back for me.

Vive la différence! :)

tonyedgecombe · on Dec 11, 2023

I quite enjoyed working within those tight constraints. It's very satisfying squeezing every last drop of blood from the system.

What I don't miss is the lack of reliability. Modern systems are so dependable compared to what we had to deal with back then. Also it's nice being able to download your tools for free and find documentation online.

m463 · on Dec 12, 2023

I remember at the time thinking how much they could pull off with so little hardware.

An IBM mainframe could do a giant block move (with padding) in one instruction.

but a normal person couldn't afford that. But a 6502 was in reach.

With the 6502 you had to be clever and do something similar in software.

But yeah. It was enlightening when the 6502 didn't have an add instruction.

(you had to do clear carry, then add with carry)

These kind of tradeoffs persist through time. People do all kinds of ridiculous things with an inexpensive ESP8266 or ESP32.

cbm-vic-20 · on Dec 12, 2023

It's funny you used the single-instruction block move as an example. The successor to the 6502, the 65C816, does have a single-instruction block move. Actually it has two, one for when you want to move from the top down, and another when you wnat to move from the bottom up (in the case where regions overlap).

https://snes.nesdev.org/wiki/MVN_and_MVP_block_copy

m463 · on Dec 12, 2023

I had to look that one up...

370 - 32-bit, ibm, 1970 6502 - 8-bit, mos technology, 1975 65C816 - 16-bit, wdc, 1985

and even dug into this move characters long: http://www.simotime.com/asmins01.htm#MVCL

lots of cobwebs in those memories...

dzdt · on Dec 12, 2023

Just for fun... I think this should be on the efficient frontier of 8 x 8 -> 16 bit multipliers, at an average cost of around 43 cycles and memory of 1330 bytes or so. Compare 45.5 cycles and 1580 bytes for the mult66.a fastest in the collection linked.

  lmul0           = $06   ; pointer into square table low
  lmul1           = $08   ; pointer into square table high
  labs0           = $0a   ; pointer into table of abs(i-1)
  prod_low        = $0c

  \* = $0200

  ; Align tables to start of page
  abstable
    !for i, 0, 255 {
        !byte <(abs(i-1))
    }
  squaretable1_lsb
    !for i, 0, 511 {
        !byte <((i*i)/4)
    }
  squaretable1_msb
    !for i, 0, 511 {
        !byte >((i*i)/4)
    }

  ; multiply
  ;
  ; f(x) = x*x/4
  ;
  ; return f(X+Y) - f(|Y-X|)
  ;
  ; 8 bit x 8bit unsigned multiply, 16 bit result
  ;

  ; On Entry:
  ;  A: multiplier
  ;  Y: multiplicand
  ; On Exit:
  ;  (prod_low, A): product

  mult
    sta lmul0
    sta lmul1
    eor #ff
    sta labs0
    ldx (labs0),Y ;X = abs(Y-A)
    lda (lmul0),Y
    sec
    sbc squaretable1_lsb,X
    sta prod_low
    lda (lmul1),Y
    sbc squaretable1_msb,X
    rts

  ;call this once to initialise high bytes of pointers to table
  mult_init
    lda #>abstable          
    sta labs0+1
    lda #>squaretable1_lsb          
  ; high byte (#2 in this instance)
    sta lmul0+1
    lda #>squaretable1_msb          
  ; high byte (#4 in this instance)
    sta lmul1+1
    rts

djmips · on Dec 12, 2023

Nice. Bonus is that it doesn't use self modifying code and another plus one for using ldx(labs0),Y - Looks like this one can also be optimized for repeated multiplication by the same factor by skipping the first 4 instructions.

Did you write this one?

dzdt · on Dec 12, 2023

Oh crap. The LDX (labs0),Y instruction doesn't actually exist! If you are cool with undocumented opcodes you can replace that with LAX (labs0),Y which does exist. Otherwise it would need another 2 cycles for TAX.

Yes I wrote, based on mult65.a credited to Nick Jameson. That is one of the better ones; I took a look and thought about how it could be sped up a little.

djmips · on Dec 12, 2023

Haha, just like chatgpt - that's the same mistakes it makes

justinlloyd · on Dec 12, 2023

The customization section is interesting, though it is missing things such as multiply & conquer, table wrapping, and the mask & jump techniques, but then, you aren't going to find every custom technique covered in any one text. Sine, cosine, and atan, usually used dual interweaving tables that wrapped, and for distance, you could use a short table and a long table that used Manhattan distance to do the lookup depending on whether X > Y or Y > X.

Edit: I see a link to another github on that page to sqrt implementations. I suspect that will make for interesting reading too.

sambeau · on Dec 12, 2023

Yet again I am struck by how incredible it is that Braben and Bell, two undergraduate students, could produce something as ground-breaking as Elite—all the way down to writing their own multiplication routines.

deater · on Dec 12, 2023

My Apple II "Mode 7" demo relies heavily on a "table of squares" multiply alogorithm, though it's signed (vs unsigned) which is even more of a pain on 6502. https://www.youtube.com/watch?v=o8427JsfGwk

So many 16-bit demo effects assume you have a decently fast multiply or divide instruction, which makes it really hard to do the same effects on 6502 (as I found with the second reality remake)

dzdt · on Dec 12, 2023

He is missing the state-of-the-art result for 16 x 16 -> 32 bit. That would be Repose in this CSDB thread : https://csdb.dk/forums/?roomid=11&topicid=91766 at 190 cycles (depending on how exactly you count).

mwcremer · on Dec 12, 2023

That same thread seems to indicate Repose's code is mult15.a, which indeed holds the cycles record at 206.60.

dzdt · on Dec 12, 2023

No, he improved it since the version TonyLobster tested. Quoting:

It's 2023 - is your 16x16=32 bit unsigned multiply under 190 cycles yet? What? No?!? Let me help you out buddy! Much to my surprise, I have surpassed by previous best by 10.5 cycles or 5%, when I thought it couldn't be optimized further. How is this possible you ask? The first trick had to do with how cycles are counted. As a shortcut, we usually average branches/boundary crossings. However, some carries happen only 7% of the time, so optimizing for the average case actually slows you down. You should be optimizing for the no carry case. The other trick was playing with using registers to accumulate results. It wasn't as simple as it sounds; the order of multiplies had to be optimized to leave a target value in the A register.

The goal: multiply the two 16-bit numbers in x0, x1 and y0, y1 (low bytes and high bytes) and return the results in the fastest way possible (which may leave some results in registers, preferably in a useful way like the high bytes). I measure the timing in this convention to expose a timing which could be used in-line, as is often the case for demo effects. In this case, the 3rd byte is returned in A, which could be useful for a multiple accumulate where only the high 16-bits are kept.

The time is 188.1 cycles, averaged over all possible inputs. The (accurate) time of my version on codebase is 198.6

To be clear, z=xy or (z3, z2, z1, z0) = (x1, x0) (y1, y0).

;This section performs the 4 sub-multiplies to form ; the partials to be added later in self-mod code. ;Addresses like "x1y0l+1" refer to low(x1*y0) stored ; to the argument of a later "adc #{value}" ;mult8_set_mier_snippet {multiplier} ;mult8_snippet {multiplicand}, {low result}, {high result} +mult8_set_mier_snippet x1 ;17 +mult8_snippet y0, x1y0l+1, x1y0h+1 ;35 +mult8_snippet y1, x1y1l+1, z3 ;32 +mult8_set_mier_snippet x0 ;17 +mult8_snippet "Y", x0y1l+1, "X" ;28 +mult8_snippet y0, z0, "A" ;28 ;results in X=x0y1h and A=x0y0h ;multiply part total: 149-165, average 157 ;z3 z2 z1 clc x0y1l adc #0 ;x0y0h + x0y1l tay ;6 txa x1y0h adc #0 ;x0y1h + x1y0h tax bcc + ;9 inc z3 clc ;(+6) taken 7% of the time + tya x1y0l adc #0 ;+ x1y0l sta z1 ;7 txa x1y1l adc #0 ;+ x1y1l bcc done ;7 inc z3 ;(+4) taken 42% of the time

Column 1 takes 13 cycles, column 2 takes 16 cycles, and column 3 takes 2.1 cycles. The total time to add the partial products is 29-39 with an average of 31.1. The results are returned as a 32-bit result: z3, A(=z2), z1, z0. The total time is 156.96875+31.1=188.06875 (178-204).

sunpazed · on Dec 11, 2023

I’m writing a pinball game in assembly for the original Gameboy, and recently had to hand-craft multiply, division and cosine routines to trade off accuracy for spare cycles. Something like this would have come in handy!!

OnlyMortal · on Dec 11, 2023

I assume you used a table of cosine values bit shifted?

sunpazed · on Dec 12, 2023

Yes, LUTs are your friend. I used to code scene demos on the C64 back in the 90s.

OnlyMortal · on Dec 14, 2023

http://www.6502.org/source/floats/wozfp1.txt

Might be useful.

OnlyMortal · on Dec 12, 2023

Likewise on the Amiga doing 3D vector demos.

jandrese · on Dec 11, 2023

The graphical error plots for the versions that approximate are a joy to behold. You can clearly see the ringing and how the error bars tend to grow larger with bigger inputs.

Dwedit · on Dec 12, 2023

If you want to multiply by a constant, you instead do Bitsifts and Additions/Subtrations in a particular sequence.