This is a good list! Another fun quirk: because x86 is a register-memory architecture and allows all kinds of variants of reg/mem operand encodings, there are a handful of equivalent encodings with exactly the same lengths (and just slightly different ModR/M bytes). You can take advantage of this to do software fingerprinting or, in my case, steganography without changing an executable’s size or semantics[1].
The (in)famous A86/A386 assembler claimed to use this technique to identify code produced with it. As far as I know, it was just a simple inversion of reg-reg ModRMs depending on the register values being used, although I didn't test it too exhaustively.
Speaking of things with the same names, what happened to Linux a386 and what the heck was it even? I can't find a link nowadays, but it was something like User Mode Linux, but not really.
You won't find "eiz" in any Intel manual, and NASM doesn't recognize it either. x86 has no zero register¹.
This appears to be some GNU-specific syntax meaning "encode SIB byte with no index register". The only case where this would be required by the hardware is when using ESP/RSP as a base register, and every assembler should produce the correct encoding for that if you simply write [ESP].
So using "eiz" on GAS lets you control what is put into the (unused) scale field. One might call that a feature, but it is a meaningless encoding detail similar to which variant of "register to register" opcodes is emitted, something that I don't think any assembler gives you control over.
¹ except maybe on the microarchitectural level, but that isn't visible to the programmer
It’s relatively feature complete, but there are some ideas for increasing the steganographic capacity listed in the README and issues! In particular, we could use the flexibility of the multi-byte NOP sequences to hide some more information.
An ADD instruction will update the carry flag but the INC instruction does not!
Neither does DEC, and I believe the original reason for this was multiple-precision arithmetic routines --- you need to propagate the carry flag, but also update loop counters and pointers.
It seems that the actual behavior of the undefined flags is related to the internal implementation of the shift operation, and is different between different architectures.
According to https://www.sandpile.org/x86/flags.htm which unfortunately hasn't been updated nor is exhaustive, all of the P6 family behave the same, but different from the P5 and the P4, and probably the earlier generations too. "Undefined" values are often used by anti-debugging/anti-emulation/VM detection code to determine if the CPU is real hardware or not, so it's actually quite important to emulate them correctly.
IIRC from our similar code, the LEA (load effective address) instruction is useful for doing adds / subtracts of arbitrary integers without updating the flags.
Flags turn out to be quite the annoyance for the kind of in-process virtualization needed by Time Travel Debug. You need to instrument code with minimal overhead so, on the one hand, you don't want to save/restore flags all the time .... And on the other hand it still all has to work when flags get used.
This behaviour goes all the way back to (at least) the Z80 and i8080, and I bet it was intentional for exactly the reason you describe (updating loop counters inbetween a sequence of ADC/SBC instructions without destroying the carry flag from the previous ADC/SBC).
> "Undefined" values are often used by anti-debugging/anti-emulation/VM detection code to determine if the CPU is real hardware or not, so it's actually quite important to emulate them correctly.
That seems quite brittle, unless all real CPUs implement them the same way. If they do, one has to wonder whether it's really undefined or an undocumented part of the x86 spec instead.
I'd guess it's "undocumented", not "undefined". Don't know how the situation is on x86, but on the Z80 there were indeed some slight differences in undocumented behaviour between CPU vendors, but those are so obscure that it hardly affected any real world code (they affected the undocumented flag bits 3 and 5, and their behaviour was only properly 'decoded' in the 2000's (https://github.com/floooh/emu-info/blob/master/z80/memptr_en...).
The only CPU I know with actual 'undefined' behaviour is the 6502 for some of the undocumented/illegal opcodes which can yield different results based on things like current CPU temperature (see the ANE/XAA instruction description: https://www.masswerk.at/nowgobang/2021/6502-illegal-opcodes)
> You can add quite a few until you get to 15 bytes. This length is a hard limit on current x86-compatible CPUs. Any instruction longer than 15 bytes is considered invalid and will generate an exception.
The longest structurally valid x86 instruction is 26 bytes, from some research I did a few years ago[1]. But as others have noted, structurally valid does not mean that any x86 CPU will accept them: they’ll all produce #UD or similar, including these 16 byte ones.
Ostensibly, only one prefix from each group can matter. Although I did notice that XACQUIRE and LOCK are both group 1 prefixes, which makes that statement kind of a lie (but it's an intentional design to make XACQUIRE do nothing on processors that don't support it).
In any case, there's only a finite number of prefixes you can meaningfully stick onto an instruction, and repeating the same prefix will do absolutely nothing.
Figure 6 has examples (in the context of instruction generation). But note that those instructions are “nonsense,” in the sense that they’re guaranteed to either #UD at 15 bytes or well before then.
Ah, thanks, I missed that! Kind of disappointing that it doesn't give the instructions' dissassembly (so to speak -- obviously as you point out in reality they are nonsense), but I suppose I can sit down and figure that out myself. :)
Ha, I was just reading some of the discussion and thinking it sounded quite similar to the JIT we use in our own (Linuxy) Time Travel Debugger.
It's similarly am x86-on-x86 JIT / emulator but (and I'm sure WinDbg's TTD is similar) most of what you want to do there is just code copying for any instructions that don't need special instrumentation.
And you want to run entirely in the cache of JITted code so you're close to native speed, rather than exit to C code and make decisions.
Funny enough, we tried going down the road of doing a JIT, but for our use cases pure emulation was often fast enough and it wasn't worth the extra complexity to do JIT. Part of that is probably due to the architecture of how TTD works, but part of it is how efficient we were able to get with the emulation mode. Later, a different emulator was created for running x86 code on ARM64, and that one uses an extremely efficient JIT. (I didn't get to work on the x86-on-arm64 emulator though. Would have been fun)
Working on the JIT in TTD was a lot of fun though, and it's a shame it didn't make sense to keep it around.
Very impressive you got such good performance out of straight emulation! And WinDbg TTD runs with truly parallel threaded execution within one process, which we do not (yet).
I'm surprised a JIT wasn't worth it but, from what I'm aware of, I can see a few reasons why the trade offs are different.
Very cool! Writing emulators for simple CPUs is challenging enough because it's extremely easy to overlook a quirk or implement a flag calculation incorrectly that is used infrequently and have it silently lurking like a mine waiting for the rare piece of code to trip it. x86 is very complex. Instruction decoding alone looks like a nightmare. I guess the one saving grace here is that since most of us still run on x86, side-by-side validation is a lot easier.
Yes, that made it 100x easier, because I was able to write unit tests that literally did a single step over the instruction and compare the register context to the emulated register context. The single step approach didn't work for all instructions (and didn't work for undefined flags) but was extremely effective and let me brute force a whole lot of testing.
> Other instructions leave some of the flags in an undefined state, although in some cases they are conditionally undefined.
> I’ve heard that the Atom line of CPUs used a cheaper/slower way of doing bit shifts in the ALU, which results in the undefined flags having different values, although I haven’t tested this myself.
That seems... problematic. Do compilers know this and just ignore the flags if/when they use these instructions?
Yes. In general, compilers only test emit flag testing instructions after instructions that are explicitly defined as having flag-modifying semantics.
(This is a common feature of ISAs, and makes sense in terms of component reuse: it allows you to cheaply reuse the ALU, for example, without needing extra wires/control signals telling it not to update the flag state.)
Nice article, but nothing really surprising when you follow the progression starting from 8080.
One correction though:
> EAX is called the “Accumulator register” is not just a convention, it actually makes a difference to the encoding (and potentially the performance, as a result)
No. As the decoder doesn't decode byte-by-bytes, but whole strings together and registers are all renamed meaning that %eax isn't really different from %ebx etc. the only difference is that it saves one byte of code space which is an extremely minor density improvement and you'd have to make a very contrived example to be able to measure the difference.
My point is the smaller encoding is what gives you the performance benefit. When done across an entire function/module, you can get a measurable increase in hit rate for the instruction cache. Not that the instruction itself is faster. Apologies if that wasn't clear.
No, I got that point and it's vanishingly small. It _only_ affects miss rate (ie. does nothing if you are hitting in the cache) and the effect on miss rate is minuscule.
Has anyone used TTD, or rr for that purpose in Linux world, to debug a complex multi-threaded application? What is the main use-case of these tools?
For example, limitations of rr seem to suggest that it is almost of no use for multi-threaded programs so I have never actually tried it. I don't know about the TTD though.
> rr limitations
> ...
> emulates a single-core machine. So, parallel programs incur the slowdown of running on a single core. This is an inherent feature of the design.
It can still debug multi-threaded programs, but if your bug relies on multi-core interactions it won't appear under rr (you can still catch a lot of race conditions though, and it has a 'chaos mode' which generate irregular scheduling intended to increase the likelyhood of such bugs appearing). It's been used to debug race conditions as well as other bugs in firefox, for example.
It's still much better than the alternative, which is not having any kind of way to reproduce race conditions :-) QEMU's a pretty complex multithreaded program and my experience with chaos mode has been that it's quite good at tripping up the races (and you only need to catch the race once).
It's not limited to any era - it's a general-purpose tool that can perfectly reproduce bugs, even most race conditions. The lack of multi-threading will mostly just slow things down. Being able to perfectly reproduce a bug is immensely valuable.
I didn't say anything about the usefulness of rr other than in the context of concurrency bugs so I don't quite understand a defending attitude of yours.
> limitations of rr seem to suggest that it is almost of no use for multi-threaded programs
Only if your main use case is debugging race conditions only possible with multiple cores, which is a tiny subset of all bugs.
I use rr all the time for a complex multi-threaded (although not extremely parallel) application and it works wonders. It frequently saves me hours of debugging. Practically none of the issues I use it for are race conditions (not because it doesn't work well for those, but because I rarely get any).
Even if I had to suffer an 8x slowdown from running it on a single core, it would still be worth it nearly every time.
As others have said here, in practice rr works very well for debugging a wide range of bugs in multithreaded programs, including race conditions.
Where it falls down:
* It can only use a single core, so highly parallel programs run very slowly when recorded by rr.
* Some race conditions may be difficult or impossible to reproduce under rr recording (but rr's chaos mode helps a lot with this).
* rr imposes sequential consistency on the recorded program, so bugs due to weak memory models do not show up under rr. Such bugs are pretty rare on x86 (because x86's memory model is pretty strong); this may be more of an issue on ARM.
TTD (the WinDbg one) works very well for complex multithreaded apps. (The caveat being the performance hit you get from emulation). It's one of the big advantages it has over rr.
The main use case I saw for TTD was debugging complex memory corruption issues. Certain types of issues like stack corruption became trivial to debug under TTD. It was also very useful for capturing a repro. If a customer complained about something and I couldn't immediately reproduce it or get a crash dump, I'd ask them to record a TTD trace. More than 75% of the time I'd say it was enough to root cause the bug, without spending tons of time figuring out the repro steps.
There are reference manuals with a lot of details. Specifically you'd want to read the "Intel® 64 and IA-32 Architectures Software Developer’s Manual", Volumes 2, which includes specifications of each instruction including pseudo-code of how the processor executes them. You can freely access a PDF of the entire manual here: https://www.intel.com/content/www/us/en/developer/articles/t...
But there's a lot of details there — that volume alone spans about 4000 printed pages in the manual — so you're bound to mess things up.
I've written emulators for simpler instruction sets (RISC-V and the 6502) and found that the best way to ensure they work correctly is to do instruction-by-instruction comparisons against a reference. The good news is that if you've got a PC computer, you've got easy access to a reference machine! So you can do side-by-side comparisons where you set up an initial state on both a real machine and an emulated machine, then execute some code on both, and compare the final states. By using the trap flag you could also single-step through the instructions to extract the intermediate states after each instruction and ensure they match the emulated state.
So that's my guess about how he did his too. It's painstaking work, but also kind of fun when you get into it.
Yes, that's exactly what I did. Most useful thing I did was setting up a unit test framework to test a single instruction, and then generating a huge number of variations of those instructions.
I talked a bit about the Intel SDM in the last post I wrote (linked at the top). The SDM is great for getting the specific details, but isn't always approachable for high level overview things.
Apologies. I'm just using a Hugo template and haven't spent a lot of time figuring out how to customize it. You're right though, the contrast isn't great.
[1]: https://github.com/woodruffw/steg86