> Something like that really deserves to be written in C so you don't need to in...

jart · 2024-07-06T09:16:41 1720257401

It's trivial if you can compile the binary with `cc -static -r` and you don't need to extract dwarf data too. If relocations are stripped, you're fine so long as you can count on st_size being present (i.e. doesn't contain handwritten assembly) and you're able to parse the machine language. If symbols are stripped, then it simply can't be solved. It's just reverse engineering at that point.

boricj · 2024-07-06T13:01:00 1720270860

If you reduce the problem statement to an ELF x86 program written in C, with a symbol table and a complete relocation table (not just the one you get when dynamically linking), then sure it's trivial, you have almost all the information you need to make an object file (issues like switch jump tables can still crop up in that case). If you don't have that relocation table from `cc -r` however, you'll run into problems.

Without this relocation table on hand, you'll have to recreate it in order to make the section bytes relocatable again. This means analyzing code/data and identifying relocation spots, like you've said. But that `0x00400000` integer constant within an instruction or a word data, is it referring to the function at that address or is it the TOSTOP flag value? Who knows, but each time you get it wrong you'll corrupt four bytes in that object file.

I'm dealing with one rather gnarly scenario, which is a PlayStation video game without any source code, symbols [1] or linker map, just a bag of bytes in an a.out-like format. The MIPS architecture also happens to be an absolute nightmare to delink code from (one of the many pitfalls for example is the interaction between HI16/LO16 relocation pairs, branch delay slots and linkers with a peephole optimizer).

I've been at it for two years and I've only recently managed to pull it off on the entire game code [2]. Writing out the object file when you have the program bytes, the symbol table and the relocation tables is the easy part. Writing an analyzer that recreates a missing relocation table for the 80% of easy cases isn't too difficult. Squashing out the remaining 20% of edge cases is hard. All it takes is one mistake that affects a code path that's taken for some very exotic undefined behavior to occur in the delinked code.

Delinking with a missing relocation table (and without manually annotating the relocation spots yourself) is a thing that looks easy at first glance, but is deceptively hard to nail all of the edge cases. I'd gladly be proven wrong, but if you do have the full, original relocation table on hand then you're cheating with `cc -r` on code you just built yourself. Almost no real-world artifact spotted in the wild one would care about is ever built with that flag.

[1] I did end up recovering lots of data out of a leftover debugging symbols file from an early build later on, but that's a story for another time.

[2] Note that I'm working on top of a Ghidra database that contains symbols, type definitions and references, so the bulk of analysis is actually performed upstream of my tooling. Even then, the MIPS relocation synthesizer is a thousand lines of absolute eldritch horrors, but I do acknowledge that the x86 relocation synthesizer I have is quite tame in comparison.

jart · 2024-07-09T03:10:02 1720494602

> the MIPS relocation synthesizer is a thousand lines of absolute eldritch horrors

Wow I only really know amd64, arm64, and i8086. What is it about MIPS that makes it so evil?

boricj · 2024-07-09T07:21:36 1720509696

That would warrant an entire blog post to describe all the pitfalls [1], but I'll condense it down to the highlights.

On MIPS, loading a pointer is classically done in two instructions, a LUI and an ADDIU, which forms a HI16/LO16 relocation pair I need to identify precisely in order to delink code. I'm using Ghidra's references in my analyzers, but these are attached to only one instruction operand, typically a register or an immediate constant.

So my MIPS analyzer has to traverse the graph of register dependencies for a reference within an instruction and find which two instructions are the relocation pair. It's trickier than it sounds because references can have an addend that's baked in the immediate constants (so we can't just search for the pattern of the address bits inside the instructions) and complex memory access patterns inside large functions can create large graphs (ADDU in particular generates two branches to analyze, one per source register). It's bad enough that I have one method inside my analyzer in particular that is recursive and takes six arguments, four of which rotate right one step at each iteration.

But that graph traversal can't be done in reverse program order, because there are instruction patterns that can terminate the graph traversal too early with the right mix of branches, instruction sequencing and register reuse. I've had to integrate code flow analysis to figure out which parent code block has to be actually considered during the register graph traversal.

But the most evil horror is the branch delay slot. One particular peephole optimizer consists of vacuuming up a branch target instruction inside a branch delay slot and shift the branch target one instruction forward, which effectively shortens the execution flow by one instruction. It also duplicates the instruction, which is catastrophic if it had a HI16 relocation because now we have LO16 relocations with multiple HI16 parents, which can't be represented by object file formats. I have to detect and undo that optimization on the fly by shifting the branch targets one instruction back, which I accomplish by adjusting the relocation addends for the branches.

I've only written relocation analyzers for x86 and MIPS so far. I don't know what other horrors are lurking inside other architectures, but I expect that all RISC architectures with split relocations will require some form of that register graph traversal and code flow analysis [2]. What I do know is that my MIPS relocation analyzer [3] is probably the most algorithmically complex piece of code I've ever written so far, one that I've rewritten a half-dozen times over two years due to all the edge cases that kept popping up. I also had to create an extensive regression test suite to keep the analyzer from breaking down in subtle ways every time I need to modify it. I expect that there are still edge cases to fix in there that I haven't encountered yet.

[1] I've written about some of them here, but it's far from the whole story: https://boricj.net/tenchu1/2024/05/15/part-10.html

[2] That piece of code is split off in its own Java class: https://github.com/boricj/ghidra-delinker-extension/blob/mas...

[3] In case you're curious: https://github.com/boricj/ghidra-delinker-extension/blob/mas... (remember that the register graph and code flow bits are split off inside another class)