The AMD Branch (Mis)Predictor Part 2: Where No CPU Has Gone Before

cwillu · on March 8, 2022

The final pithy quote about how a prominent linux developer claimed no cpu would ever do such a thing, loses some of its impact when you realize that a _more_ prominent linux developer replied saying “I think, when it comes to speculative execution, our general expectation that CPUs don't do idiotic things got somewhat weakened in the past year or so ...”

throwawaylinux · on March 9, 2022

> linux developer

This goes to show you that the software and digital logic design space is very different, and you should not assume any amount of experience and knowledge in one will prepare you for the realities of the other.

Sure it seems astounding that a CPU would speculate not-taken over a static, direct, unconditional branch. Surely no sane logic designer would ever do such a thing!

When you realize doing that might require more logic in a downstream stage after fetch and after the branch was decoded, and from there you need more control logic and wires going to other stages and logic to re-steer the pipeline in possibly a different manner or stage than would otherwise be required, and all that requires validation and verification (which can easily become a multiplicative effort), and if doing all of that did not bring you performance gains that outweighed the cost, and if you didn't understand the possible security implication or considered it to also not be worth the cost, then actually it's quite understandable for a reasonable implementation to behave this way.

Even the second quote smells a bit like dunning kruger. Everything has bugs or misfeatures, especially with hindsight. Everyone (including Linus and many other kernel developers) thought that the massive speculation capability and correspondingly astounding performance results on privilege ___domain crossing for Intel CPUs was the duck's nuts back before spectre/meltdown. And they certainly didn't foresee or discover the security issues with doing that.

(EDIT Disclaimer: Am linux developer and armchair microarchitect)

Someone · on March 9, 2022

Indeed. If unconditional branches are rare enough it may not be worth the cost.

Also, it’s not like software designers don’t waste cycles. In many languages, for example, arguments to log statements are evaluated even if logging is disabled. There sometimes is the hope that the runtime will figure that out and move that inside the if logging is enabled check, but that’s only a hope (COBOL had the ALTER statement for efficiently handling such things, but that’s being frowned upon :-))

Also: a conditional branch can be always taken, for example when the software does an if cpu supports instruction extension at startup, sets a flag accordingly, and later branches depending on that flag.

That, too, is a case where the CPU will speculatively execute what, to it, may look like garbage. So, you have to make sure speculative execution handles that.

throwawaylinux · on March 9, 2022

Yep. Unconditional branches require BTB entries anyway if they're to be predicted ahead of decode, and if AMD's BTB was large and accurate enough then the incremental performance advantage of detecting not-taken direct branches after decode might not have been worthwhile.

Security, sure arguably it could have been caught there. But Linux kernel development does not have the high ground when it comes to security problems.

gpderetta · on March 9, 2022

That belief is surprising, because I remember an at least 10 year old optimization advice from Intel to put ud2 after a ret or jmp to avoid speculation.

Also some amount of speculation even for untaken branches is unavoidable: the fetch stage need to know whether to fetch the next instruction or not before the unconditional jump is decoded, which is multiple pipeline stages downstream.

CorrectHorseBat · on March 8, 2022

Too bad Grsecurity isn't able to work together with upstream, their security knowledge seems to be legit.

staticassertion · on March 8, 2022

Alternatively phrased, too bad upstream isn't able to work with Grsecurity.

throwawaylinux · on March 9, 2022

grsecurity has a vested interest in having (or being seen to have) better security than upstream.

Maybe upstream is able to work with hundreds of different and competing companies and thousands of different people and clashing personalities and opinions, but they just happen to irrationally refuse to treat grsecurity fairly.

Or maybe it's the simpler explanation that has more obvious motives.

wahern · on March 9, 2022

As I recall, one aspect of the tension originated from Linus' express hostility to pure security mitigations. IOW--and I don't know if this is the case today, especially after Spectre--Linus and a large part of the community were resistant to mitigations and code refactors that didn't actually add functional behavior, and especially if they were visible to userspace, harming backward compatibility.

But those types of patches were precisely what grsecurity was pushing. So you basically had a fundamental clash of philosophies (independent of the clash of personalities) about whether the work grsecurity was doing was even valid from a software engineering standpoint. Arguably because of the relentless onslaught of exploits and growing enterprise influence, the Linux community slowly relented, but never with a mea culpa, AFAIK.

A similar dynamic played out with /dev/urandom, where the Linux community and subsystem maintainers were adamant that the existing blocking and entropy accounting behaviors were crucial to security and non-negotiable. But now look where things stand--both are gone after 10-15+ years of debate. Again, without any mea culpas.

Ditto for sandbox-friendly interfaces. Once upon a time any grievance that an interface which relied on /proc, /dev, or similar was suboptimal from a sandboxing standpoint (kernel surface area, headaches with chroots, etc) was met with derision. Eventually we got getrandom(2), and other recent application interfaces, like process descriptors, were revised so they were usable without /proc or similar.

You could see alot of this play out on Slashdot, HN, etc as commenters frequently parroted Linux developers' talking points and assertions.

There's plenty of blame to go around.

throwawaylinux · on March 9, 2022

Many security mitigations have been merged into the kernel for a long time, probably longer than grsecurity has been around, but certainly something like NX-bit you could use as an example which was around the same time as early grsec (early 2000s).

Sure Linus has been hostile or funny about such things at times, but that has not prevented him from being convinced or them from being merged.

staticassertion · on March 9, 2022

Right, that's why they gave away their work for free for decades?

throwawaylinux · on March 9, 2022

What's why?

staticassertion · on March 9, 2022

You're implying that Grsecurity's incentive to be safer than upstream means they won't cooperate. But Grsecurity made their patches freely available for the vast majority of their existence. I think it just shows ignorance of the situation to imply that they're only interested in making money.

throwawaylinux · on March 9, 2022

> You're implying that Grsecurity's incentive to be safer than upstream means they won't cooperate.

> But Grsecurity made their patches freely available for the vast majority of their existence.

I didn't say anything about whether they made their patches freely available or not. Not sure what that has to do with what I wrote.

> I think it just shows ignorance of the situation to imply that they're only interested in making money.

I didn't imply that. I said they have a vested interest in perception of being more secure than upstream. Which they do. Replying with vague snark and going off on some wild tangent imagining things that I never implied isn't helpful if you can't address what I wrote.

I didn't say anyone is to blame as such, nor does grsecurity have any requirement or moral duty to put effort into upstreaming their work. But if you look at the motivation there is a pretty reasonable explanation why they have not done so, in my opinion that's more reasonable and likely than the idea that it's upstream being particularly unfair or uncooperative to this one group.

nix23 · on March 9, 2022

Too Bad Linus once called them Clowns [3]...but hey there's also [1] and [2] ;)

1: https://hardenedbsd.org/

2: https://www.openbsd.org/

3: https://www.spinics.net/lists/kernel/msg2540934.html

2008 (Linus):

>>I think the OpenBSD crowd is a bunch of [self-stimulating] monkeys

2019 (Greg KH):

>>OpenBSD was right on disabling hyper-threading

loeg · on March 9, 2022

HardenedBSD are clowns. OpenBSD has some good ideas, some clownish ideas, but as a user you cannot separate the two and as a developer community, they aren't interested in reducing the clown factor. GRSecurity also has some great ideas and some silly.

nix23 · on March 10, 2022

>HardenedBSD are clowns.

Ahh that's why OPNsense uses it? But hey you write like a real professional in that field so i believe you, instead the people who actually work day in day out on a professional firewall product.

Maybe you have to acknowledge that there is just one clown here ;)

loeg · on March 10, 2022

I can't speak to OPNsense's judgment. I am a real professional in that field -- I've worked on FreeBSD in the past, and interacted with HBSD folks professionally.

userbinator · on March 9, 2022

To my knowledge none of these side-channels, even the original ones several years ago, have been exploited practically in the wild. IMHO gathering the amount of detail needed to attempt such an attack, as exemplified by the demos that have been given, would itself be prohibitively difficult. Thus the impact of yet another one remains negligible to a personal computer user, but of course the cloud providers would be super-paranoid about it.

It's worth noting that the memory protection scheme, introduced with the 286, was never intended to be a strong security barrier, but instead a means of isolating bugs and making them easier to debug.

joe_guy · on March 9, 2022

While not an in the wild big, I was a disbeliever that it could even be exploited from a browser, but just last year Google proved me wrong https://security.googleblog.com/2021/03/a-spectre-proof-of-c...

userbinator · on March 9, 2022

If you have a precise enough timing and the machine is quiet enough, then yes you could always do it given enough time, but the question then becomes how to interpret the data that you do manage to read. Keys look like random data, and passwords are probably recognisable with enough semiautomated effort; but even if you manage to isolate those "needles in the haystack", you still have next to no idea what "lock" the key is for.

Karliss · on March 9, 2022

> Key looks like random data

Not always. When using AES algorithm the key gets expanded and the expanded form can be distinguished from random data. Combination of RSA private key and public parameters should also be distinguishable from random data due to integer multiplication properties RSA uses. There are tools for searching encryption keys in program memory using such tricks to recognize them. Such tools would typically be used when a program like game or DRM media player runs on attacker controlled computer, but in theory would also help when fragments of memory can be leaked from a server due to side channel attacks or bugs like heart bleed.

Groxx · on March 9, 2022

I suppose this may be one unintended point in a crappy antivirus's favor: by slowing down everything and causing a ton of CPU noise, it might make these kinds of things harder to exploit.

Lascaille · on March 9, 2022

That's a great article.

I do worry sometimes that something is up with CPU development, that we're tending towards more and more complicated designs with workflows that are very hard to analyse and simulate even for the designers themselves, but the actual workload execution ability performance per core isn't shifting upwards all that much, and then weird mitigations have to be applied that reduce that execution ability in practice.

Something makes me think that perhaps a different design paradigm should prevail, with particular attention paid to segregation of workloads and of core partitioning, perhaps an abandoning of hyperthreading and even to the extent of having 100% physical separation of cores and their caches.

But I'm very much not an expert in the field.

A little birdie inside of me every now and then wakes up and whispers 'is it a coincidence that these design paradigms are yielding so many vulnerabilities?'

eternityforest · on March 9, 2022

Hyperthreading and the like seems to give a pretty big performance boost. And process separation is currently a bit irrelevant on desktop for most people who have single user systems and a browser that has their entire life stored in it.

It's a bit concerning that we are not really making amazing progress in single threaded performance.

I think we might not be going quite far enough with complex instructions, and maybe we need to be able to define out own with custom microcode or something.

Like, a CPU that had just a few instructions, enough to bootstrap and that's it, and a very fast FPGA that could switch configuration quickly to make your own instructions.

Right now it takes time to send stuff to the GPU. We need acceleration we can access in one instruction, so that normal compilers can generate it all behind the scenes.

Or, alternately, modular pluggable acceleration. You could fit like 8 little MicroSD sized special purpose accelerators on a laptop MB.

Maybe file serving can be hardware accelerated. Is there an actual reason you need a beefy server to serve a ton of media, or can CDN caching be optimized the way ethernet switches are, so you have a tiny credit card size device serving 10gbps on three watts?

Lascaille · on March 9, 2022

>Maybe file serving can be hardware accelerated

Encryption aside a lot of file serving is accelerated. In Linux, apache has just called a kernel function (I forget what it's called, something unsurprisingly along the lines of socket_sendfile) to get data out on the wire, and that function itself is hardware accelerated by the network hardware through the large send offload v1 & v2 functions offered by the network card driver.

gpderetta · on March 9, 2022

You are probably thinking of splice(2).

Lascaille · on March 9, 2022

I was thinking of sendfile() but I see now it's become a wrapper around splice(). My knowledge is a little dated to say the least.

rob74 · on March 9, 2022

If you apply your suggestion to existing x86 designs, you would get chips that are slower than current ones while needing more silicon for security features - so they would be slower and also more expensive just to mitigate some extremely-difficult-to-exploit vulnerabilities. I don't think anyone would actually buy those chips. Maybe they could be marketed for applications with higher security requirements, but that would make them a niche product and thus even more expensive.

(also not an expert however)

timschmidt · on March 9, 2022

https://millcomputing.com/ springs to mind.

colejohnson66 · on March 9, 2022

The Mill architecture is vaporware. It’s been what? A decade? And there’s not even an FPGA demo yet?

Not to mention that the whole idea of the Mill requires a “sufficiently smart compiler”. We tried VLIW (based on the same compiler ideals) with the Itanium and it was a major flop.

Lascaille · on March 9, 2022

Itanium had a decent service life and a reasonable install base. It didn't take off because of economies of scale and Intel not really wanting to introduce a big x86/x64 competitor.

The Itanium platform was totally unaffected by Spectre/Meltdown type attacks, you may note.

jcranmer · on March 9, 2022

> It didn't take off because of economies of scale and Intel not really wanting to introduce a big x86/x64 competitor.

Intel invested heavily into Itanium, and the 90s started seeing the deaths of competing processor architectures in part because of how hyped up the Intel Itanium was getting. It was intended to be the 64-bit version of x86, hence the abbreviation IA-64. AMD was the one who created what we now know as x86-64, and my understanding is that Microsoft more or less forced Intel to implement AMD's x86-64 specification.

In its later years (let's say by 2010, since that's the midpoint of Itanium's life as a shipping product, but I don't have any clear dates as to when the shift happens), it does seem to be that Intel was reluctant to continue supporting Itanium. But that was definitely not the case beforehand.

hajile · on March 9, 2022

Intel, HP, etc invested over 10 billion dollars into Itanium (closer to 14-17 billion dollars today). For perspective, AMD's market cap in 2003 was around 5 billion. They could have built 20% of the US carrier fleet of the time with that much money.

Ultimately, the compiler never materialized and I'm convinced they knew it wouldn't (but in the meantime, they almost killed all the RISC competition). The Halting problem means that at best you have heuristic optimizations that constantly fall through.

The only way to solve these problems with a high degree of success is to analyze the code as it is running which is what other CPUs actually do. This is so true that later versions of Itanium (latest being released in 2017) were just VLIW wrappers around a rather traditional, speculative core.

timschmidt · on March 9, 2022

Languages like Rust permit sufficient analysis at compile time to make architectures like mill and itanium shine. One of the reasons Rust is so exciting beyond the oft repeated memory and thread safety.

colejohnson66 · on March 9, 2022

Knowing who owns what data wasn't Itanium's issue. It was that the compiler was expected to manage the pipeline ahead of time using static analysis when dynamic analysis is required for best performance. Speculative execution (ignoring the issues of Spectre/Meltdown) is a much better model than what the Itanium and Mill require. It's impossible to know ahead of time all the possible states of an arbitrary program. Because of that, branch predictors are still king.

timschmidt · on March 9, 2022

I'm pretty sure that what I said about Rust in this context stands. The same math it uses to guarantee thread safety works also for speculative loading during execution. Or could with little effort. Rust requires enough information about the data to know for certain which routine will be handling it at compile time. This is sufficient for speculative loading.

You are correct in as much that achieving this level of optimization through static analysis alone without cooperation in the language would be a very challenging problem. The folks at https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/ are taking a stab at it nonetheless - their approach requires language extensions.

I'd prefer not to ignore the issues of Spectre and Meltdown - doing so is what got us here in the first place. I still run many older machines in roles for which they are well suited and would gladly trade performance for more predictable and secure behavior.

toast0 · on March 9, 2022

> It didn't take off because of economies of scale and Intel not really wanting to introduce a big x86/x64 competitor.

Intel didn't want to make 64-bit x86, but the market prefered amd64 over itanium.

throwawaylinux · on March 9, 2022

AMD is supposed (or does) have very good branch predictors. What's interesting to me is they don't do a re-steering of the wrong prediction well before later instructions are issued to the back end.

Maybe their BTB is that good that they didn't see it worth investing in the control logic for it.

RcouF1uZ4gsC · on March 8, 2022

SPS Theorem

Pick two:

Shared Physical Hardware

Performance

Security

mjevans · on March 8, 2022

Pick, __at most__, two

BuckRogers · on March 13, 2022

How does that relate to what Apple has with the M and A-series chips? They seem to have all 3.

B1FF_PSUVM · on March 9, 2022

Somehow the "Final remarks" section reminded me of the more paranoid "wheels within wheels" Frank Herbert fiction.

Not pleasantly, the FH fiction wasn't, either.

unixbane · on March 9, 2022

the funny part is that PCs are shown to be unusuable for security sensitive things as usual and all this research will be used to optimize video games

morpheuskafka · on March 9, 2022

What does "code gadget" mean in this context?

gpderetta · on March 9, 2022

A code gadget is any pre-existing string of binary code in an executable that by chance happens to have some specific properties that make it possible to be repurposed by an attacker, usually by jumping in the middle of it.

IIRC the term was introduced with the Return Oriented Programming style of attacks and reused for spectre-like speculation attacks.

rodgerd · on March 9, 2022

If you're very clever (unlike me) and you break into a modern running system, you'll find that there are a lot of protections to stop you just inserting your own code and running it.

Smart people will look at the current executable code in the system, and work out how to manipulate the existing executable code to build the attacker's programs (return oriented programming); the identified chunks are called gadgets.

divyekapoor · on March 9, 2022

Very well done.

jimmaswell · on March 9, 2022

I went AMD for the first time with my new laptop. Some 8 core Ryzen. Games crash all the time, not sure if it'd the architecture's fault, and I can't properly virtualize Windows 98 (which I like to just for fun/nostalgia) apparently due to the architecture. I feel like I'm sticking with Intel from now on like my old gut instinct said to.

jquery · on March 9, 2022

With respect to games crashing, it's almost certainly not the architecture's fault. Far more likely is a system configuration or driver issue. I had more issues with my previous Intel build than my current Zen2 Ryzen (64 core TR). I have almost no issues with virtualization or gaming now. That said, my hardware is rarely pushed to its limits so that might be a factor.

As an aside, given how Intel squandered its near-monopoly for years until AMD came up from behind and kicked 'em in the pants, I'll be sticking with AMD out of principle. I'm especially upset at some of the really dirty tricks Intel pulled in the past.

BuckRogers · on March 13, 2022

To share the other side of the coin. Your experience, while valid, is not actually the norm from what I've read and experienced. I had 4 Ryzen chips since the initial launch in 2017 and none of them were as stable or fast as my current 8-core Intel system. All subsystems tested faster as well, including storage performance. I'm someone who buys and has the latest and greatest side by side.

Random IOPS is off the charts faster on Intel my 1:1 tests between real-world system with the same drive. I had mediocre (but suitable) experiences with Athlon chips 22 years ago, with the same basic outcome: great CPUs, mediocre platform (chipset engineering etc).

I understand your emotions about AMD and Intel, you may not be "wrong". I have no strong feelings on that, just want what works. But for holistic engineering my 40 years of PC building experience is persuasive towards Intel products.

vardump · on March 9, 2022

"With respect to games crashing, it's almost certainly not the architecture's fault."

Indeed. Especially for "AAA" games designed to run on major AMD-based consoles such as Xbox and Playstation. Those games might actually be slightly more likely to be broken on Intel systems!