More

mras0 · 2025-04-23T18:30:44 1745433044

Not really the ethos of C(++), though of course this particular bug would be easily caught by running a debug build (even 20 years ago). However, this being a game "true" debug builds were probably too slow to be usable. That was at least my experience doing gamedev in that timeframe. Then again code holding up for 20 years in that line of biz is more than sufficient anyway :)

mabster · 2025-04-23T22:19:23 1745446763

When I was doing gamedev about 5 years ago, we were still debugging with optimisation on. You get a class of bugs just from running in lower frame rates that don't happen in release.

mras0 · 2025-01-25T23:48:40 1737848920

Have an Amiga w/ 060, and that instruction doesn't seem to modify any A registers. (Only did a very quick test of those exact instruction words)

basementcat · 2025-01-26T00:36:48 1737851808

I appreciate that I can ask an esoteric question about the behavior of a 30 year old microprocessor and multiple people respond with test results on actual hardware within a few hours. Can y'all also post the mask revision (if known) and whether it is an EC or LC device? (In case it impacts behavior)

mras0 · 2025-01-26T01:00:58 1737853258

Rev5 "full" 060 (not EC/LC). Quick capture of crappy methodology: https://imgur.com/a/XwQ1Tnp (PCR with revision number is in d0)

ahefner · 2025-01-26T14:59:42 1737903582

Today by, way of your screenshot, I discover Asm-Pro. Just got into Amiga recently (by way of receiving one from my uncle's closet..) and have been meaning to backfill my shameful lack of 68k asm knowledge. Thanks!

mras0 · 2025-01-27T15:09:20 1737990560

Good luck! As the sibling comment mentions, Asm-Pro is IMO a good choice these though it requires KSv2+ (so you can't use it on an stock A500). I mostly use vasm for "larger" stuff though (http://sun.hasenbraten.de/vasm/) and cross-compile (though it can run on Amiga as well) / test in an emulator.

For a quick test/code loop nothing beats the Asm-* family - remember to save often and make backups though :)

snvzz · 2025-01-26T23:58:09 1737935889

Asm-Pro is a modern, maintained, feature-rich asmone derivative.

Another modern/maintained one, focusing on low end 68000/68010 machines, is asmtwo.

AsmOne was a very popular assembler at the time, and has seen a few derivatives. I remember an old one called trash'm'one.

mras0 · 2025-01-27T15:03:56 1737990236

Update: a1=-1 is a bad choice of test value (0 is much better), as it won't show the issue if present! However, it doesn't change the outcome of the test on 060, but I would have noticed that 020 is affected as well (as also mentioned in the comments to the article).

dougg3 · 2025-01-26T06:12:08 1737871928

My test of an 040 (no A1 change, D1 changed) was on a chip with the following markings:

XC68LC040RC25B

02E23G QEDP9348D MALAYSIA

mras0 · 2025-01-03T19:29:26 1735932566

I happened to look at this recently, and while I understand the argument (but not the math) of having to do fewer Miller-Rabain rounds, why would you do so in PRACTICAL settings? Unlike ECC you're likely only generating long term keys, so shorter key generation time seems like a bad tradeoff. Composite candidates are going to be rejected early, so you're (with high probability) not doing expensive calculations for most candidates. My reading of [BSI B.5.2](https://www.bsi.bund.de/SharedDocs/Downloads/EN/BSI/Publicat...) confirms this.

Of course random bit flips could interfere, but other measures should thwart this in high-stakes environments (at least to some degree).

FiloSottile · 2025-01-03T20:06:30 1735934790

The number of Miller-Rabin rounds has to be bounded, so if you're not going to base your bound on reaching a cryptographically negligible change of false positives, what are you going to base it on? Should we do 10? 15?

The problem with "x should be enough, but why not do more?" arguments is that they can be applied recursively, and never answer the question "ok so when should we stop?"

mras0 · 2025-01-03T20:35:10 1735936510

The BSI recommendation I linked says 60 times here (considering the worst case and the security level they're targeting). Just wondering why we'd want to do less for a (presumably rare) one-time operation.

maginx · 2025-01-03T22:11:51 1735942311

The performance of this can matter in some scenarios. In embedded systems, smart cards etc., generating the primes can take a significant amount of time (15-20 seconds typical) even with the 'low' number of iterations. Higher iterations means the user will have to wait even longer. Actually, in such systems it is not unusual to see occasional time-out errors and such when the smart card is unlucky with finding the primes. The timeout value may be a fixed, general value in a part of the stack difficult to change for the specific call.

Another case is short-term keys/certificates, where a fresh key pair is generated, and cert issues, for each and every signature. This setup makes revocation easier to handle (the certificate typically has a lifetime of a few minutes so it is 'revoked' almost immediately).

There are also scenarios where the keys are generated on a central system (HSM's etc.) to be injected into a production line or similar. There performance can matter as well. I worked on a system where the HSM's were used for this. They could typically generate an RSA key in 1-2 seconds, so 12 HSM's were needed to keep up with demand - and this was again with the reduced number of rounds.

mras0 · 2025-01-04T19:25:48 1736018748

Thanks for the reply (and it definitely answers my original question). For both short-lived keys, and where production time matters it makes perfect sense. I was mostly thinking about the scenario where you're generating a key for e.g. a long-lived certificate (maybe even a CA) or high-stakes PGP stuff. Just seems like you'd want spend a few more seconds in that case.

FiloSottile · 2025-01-03T21:31:10 1735939870

Why not do 120 then? We can show that the chance of false negative of 5 rounds is cryptographically negligible, so 5, 60, and 120 are all the same. If the only argument for 60 is that it's more and this is a really important rare operation, doesn't it also apply to 120?

I'm not trying to be glib, there is no rational stopping point if we reject the objective threshold.

mras0 · 2025-01-03T22:03:33 1735941813

I don't pretend to understand all the involved math, but what I'm trying to say is that as far as I understand T rounds gives 4^-T probability that we've chosen a "bad" prime (really composite) per normal Miller-Rabin in the worst case. Doing ~5 rounds has been shown to be enough to choose a good prime candidate, under most (very probable!) conditions when we get it a random, and thus the argument is that that ~5 rounds is fine. We agree so far?

I'm just asking, why not run the conservative 60 round test, rather than ~5 when you're doing a very rare, one time, key generation? I understand that it's very unlikely to reject any numbers, but at least BSI thinks it's worth it for important keys.

If I understand the recommendation right, you wouldn't do 60 for a 2048 bit key and then 120 for 4096, rather 61 rounds would be enough for 4096 if 120 is alright for 2048.

FiloSottile · 2025-01-03T22:27:32 1735943252

You're asking why not apply the formula for adversarially selected candidates even if we are randomly selecting candidates. There is simply no reason to, except "maybe we made a mistake" but then why would we not think we made a mistake also in calculating the 1/4 value, or in any other part of the code?

Phrased another way, do you have an argument for why run the conservative 60 round test, instead of asking for an argument for why not run it?

Again, you are "very unlikely" to win the Powerball jackpot. Rounds 6-60 have a cryptographically negligible chance of rejecting a composite. It's different, otherwise we'd have to worry about the "very unlikely" chance of the attacker guessing an AES-128 key on the first try.

(I don't follow you on the key sizes, if you apply the 1/4 probability, the candidate size is irrelevant.)

mras0 · 2025-01-04T19:36:34 1736019394

Thanks, understand what you mean. I probably botched the 1/4 probability thing, was thinking 4^-60 gives 2^-120 bit assurance (roughly, security margin in my mind), and an extra round would quadruple it, but doesn't work that way I realize.

maginx · 2025-01-03T22:15:02 1735942502

As I understand it, the number of rounds needed (for a given acceptable failure probability) goes down the larger the number is. Note (very importantly) that this is assuming you are testing a RANDOM integer for primality. If you are given an integer from a potential malicious source you need to do the full number of rounds for the given level.

mras0 · 2024-12-28T23:20:01 1735428001

Great article and analysis as always, thanks! Somewhat crazy to remember that a (as you argue) minor CPU erretum made world wide headlines. So many worse ones out there (like you mention from Intel) but others as well, that are completely forgotten.

For the Pentium, I'm curious about the FPU value stack (or whatever the correct term is) rework they did. It's been a long time, but didn't they do some kind of early "register renaming" thing that had you had to manually manage doing careful fxchg's?

Sesse__ · 2024-12-29T20:46:30 1735505190

Yes, internally fxch is a register rename—_and_ fxch can go in the V-pipe and takes only one cycle (Pentium has two pipes, U and V).

IIRC fadd and fmul were both 3/1 (three cycles latency, one cycle throughput), so you'd start an operation, use the free fxch to get something else to the top, and then do two other operations while you were waiting for the operation to finish. That way, you could get long strings of FPU operations at effectively 1 op/cycle if you planned things well.

IIRC, MSVC did a pretty good job of it, too. GCC didn't, really (and thus Pentium GCC was born).

ack_complete · 2024-12-29T21:08:10 1735506490

FMUL could only be issued every other cycle, which made scheduling even more annoying. Doing something like a matrix-vector multiplication was a messy game of FADD/FMUL/FXCH hot potato since for every operation one of the arguments had to be the top of the stack, so the TOS was constantly being replaced.

Compilers got pretty good at optimizing straight line math but were not as good at cases where variables needed to be kept in the stack during a loop, like a running sum. You had to get the order of exchanges just right to preserve stack order across loop iterations. The compilers at the time often had to spill to memory or use multiple FXCHs at the end of the loop.

Sesse__ · 2024-12-29T22:30:56 1735511456

> FMUL could only be issued every other cycle, which made scheduling even more annoying.

Huh, are you sure? Do you have any documentation that clarifies the rules for this? I was under the impression that something like `FMUL st, st(2) ; FXCH st(1), FMUL st, st(2)` would kick off two muls in two cycles, with no stall.

Tuna-Fish · 2024-12-29T22:52:59 1735512779

Agner Fog's manuals are clear on this. Only the last of FMUL's 3 cycles can overlap with another FMUL.

You can immediately overlap with a FADD.

lallysingh · 2024-12-29T18:57:35 1735498655

AFAIK, the FPU was a stack calculator. So you pushed things on and ran calculations on the stack. https://en.wikibooks.org/wiki/X86_Assembly/Floating_Point

Sesse__ · 2024-12-29T20:47:51 1735505271

It's only a stack machine in front, really. Behind-the-scenes, it's probably just eight registers (the stack is a fixed size, it doesn't spill to memory or anything).

lallysingh · 2024-12-30T03:48:24 1735530504

Definitely was 8 regs: https://intranetssn.github.io/www.ssn.net/twiki/pub/CseIntra... also where you'd see 'long double'

mras0 · 2024-07-12T16:13:28 1720800808

Thing is that isn't a single unified Amiga community with a clear goal. Roughly speaking you have people who (like this project) want to keep the "OS/UI experience" and move that forward to newer architectures and build new stuff for that (PPC was the main target for a long time), and what you call diehard retrocomputing people who are more closely tied to the hardware used back in the day - M68K and custom chipset (or at least the "period correct" HW extensions made possible by original Commodore made machines).

I'm probably mostly in the latter camp, but still appreciate efforts like this. It's not a rational thing, but more about "reliving the past" but with more skills/money/available software. Like classic cars or whatever, where you know you could have a better overall experience with a newer car, but you want to tinker on the old one (where maybe you now got the upgrade model you yearned for back in the day).

mras0 · on Feb 26, 2024

Simplified: Traditional SIM card is "smart" card with a chip (exactly like modern credit cards) from your network operator that gives access to mobile network(s). The chip has software + data that allows this.

eSIM: same (or similar) chip is soldered into your phone. Software is different, and allows storing the data part from multiple operators at the same time. You use your phone to switch between operators and add/remove potential operators (add part is over the internet via your phone, either via wifi or current active operator).

By way of analogy you can download new "SIM cards" to your esIM, switch between them, delete them etc. It's a feature of your phone, either it supports eSIM (has such a chip and supporting software on the phone) or it doesn't. You can't add it later.

mras0 · on Dec 19, 2023

Once the drive is spinning that doesn't matter though. Floppies are slow by modern standards mostly because you only get a new (decoded) bit around every 2nd (or for DD drives 4th) microsecond, and the drive takes some time stepping to the next cylinder (track).

mras0 · on May 12, 2023

The addressing modes do exist on 68020+ (https://www.nxp.com/files-static/archives/doc/ref_manual/M68... section 2.2), but I don't think any compiler will generate them (even with -mcpu=68020) as I think they're not actually faster than doing it the normal way (even if that costs a register), maybe slower, but I'd have to check the cycle counts.

EDIT: Yeah, so if I'm not misreading MC68020UM the memory indirect mode is slower

   move.l #3,([6,a0,d0.l*8],4) ; 9 cycles best case

vs.

    move.l 6(a0,d0.l*8),a0 ;4
    move.l #3,4(a0) ; 3

mras0 · on May 12, 2023

Only supported on 68020+ though, but yeah think it would be something like move.l #3,([6,a0,d0.l*8],4) (scale only goes up to 8 though)

mras0 · on March 3, 2023

One thing to consider is that the compiler can't simply replace a division by just a right shift for signed variables (it will round towards -inf for negative numbers), so even today there's a tiny bit of benefit of explicitly using shifts if you know that the number can't be negative (and the compiler can't prove it) or you don't care about the rounding (https://godbolt.org/z/vTzYYxqz9).

Of course that tiny bit of extra work is usually negligible, but might explain why the idiom has stuck around longer than you might otherwise expect.

benj111 · on March 4, 2023

If the number can't be negative surely you should be using unsigned ints?