Put an io_uring on it: Exploiting the Linux kernel

tptacek · on March 8, 2022

This is one of the all-time great LPE writeups.

A summary:

1. io_uring includes a feature that asks the kernel to manage groups of buffers for SQEs (the objects userland submits to tell uring what to do). If you enable this feature, the kernel overloads a field normally used to track a userland pointer with a kernel pointer.

2. The special-case code that handles I/O operations for files-that-are-not-files, like in procfs, missed the check for this "overloaded pointer" hack, and so can be tricked into advancing a kernel pointer arbitrarily, because it thinks it's working with a userland pointer.

3. The pointer you manipulate thusly is eventually freed, which lets you free kernel objects within a range of possible pointers.

4. io_uring allows you to control the CPU affinity of the kernel threads it generates on your behalf, because of course it does, so you can get your userland process and all your related io_uring kthreads onto the same CPU, and thus into the same SLUB cache area, which gives you enough control to target specific kernel objects (of a size bounded I think by the SQE?) reliably.

5. There's a well-known LPE trick for exploiting UAFs: the setxattr(2) syscall copies arbitrary extended attributes for files from userland to kernel buffers (that's its job), and the userfaultfd(2) syscall lets you defer page faults to userland; you can chain setxattr and userfaultfd to allocate and populate a kernel buffer of arbitrary size and contents and then block, keeping the object in memory.

6. Since that's a popular exploit technique, there's a default-yes setting in most distros to require root to use userfaultfd(2) --- but you can do the same thing with FUSE, where deferring I/O operations to userland is kind of the whole premise of the interface.

7. setxattr/userfaultfd can be transformed from a UAF primitive to an arbitrary kernel leak: if you have an arbitrary-free vulnerability (see step 3), you can do the setxattr-then-block thing, then trigger the free from another thread and target the xattr buffer, so setxattr's buffer is reclaimed out from under it, then trigger the allocation of a kernel structure you want to leak that is of the same size, which setxattr will copy into (another UAF); now you have a kernel structure that the kernel is treating like a file's extended attributes, which you can read back with getxattr. Neat!

8. At this point you can go hunting for kernel structures to whack, because you can use the arbitrary leak primitive to leak structs that in turn embed the (secret) addresses of other kernel structures.

9. Find a pointer to a socket's BPF filter and use the UAF to inject a BPF filter directly, bypassing the verifier, then trigger the BPF filter and do whatever you want, I guess.

I'm sure I got a bunch of this wrong; corrections welcome. Again: really spectacular writeup: a good bug, some neat tricks, and a decent survey of Linux kernel LPE techniques.

jamal-kumar · on March 9, 2022

Her eBPF talk is pretty cool too

https://www.youtube.com/watch?v=vADX3GtEJ0A

egberts1 · on March 8, 2022

Whoa!

One frickin’ GIANT driver coherency setting, I/O Ring, that is.

junon · on March 8, 2022

Yes, unfortunately I figured this might happen. People have been warning of some major issues with its design for a while now wrt security. Paired with the fact it's not much faster in practice than epoll in a large majority of usecases, I really worry it's going to footgun some people.

staticassertion · on March 8, 2022

I don't think we concluded that there were any fundamentally unsafe aspects of io_uring. We decided to look at it because it's an interface of great interest to us as a company, and we suspected that the combination of: new code, performance oriented code, concurrency oriented code, would be a great place to find some bugs.

Whenever we are interested in adopting new technologies we do a security review, so this naturally came out of that. We'll be posting more posts on other areas of interest for us.

wahern · on March 9, 2022

io_uring relies on a pool of independent kernel threads performing operations on buffers (and other resources) provided by non-privileged userspace processes. While conceptually simple, it's a sharp departure from the standard userspace/kernel syscall and process model. It was inevitable that it would stumble over little nooks & crannies of the kernel that silently made risky assumptions dependent on the standard model.

> I don't think we concluded that there were any fundamentally unsafe aspects of io_uring.

Does io_uring still have the trap that a ring context initialized before a process drops privileges with setuid can still dispatch operations on root-privileged kernel worker threads? That's a nasty problem, partly related to the fact that Linux has no process-global UID--every thread's UID (effective and saved, plus GID, supplementary GIDs, etc) has to be managed separately per thread, which requires herculean hacks in libc to provide POSIX setuid semantics, which except for very specialized, Linux-specific software (e.g. runc) is all that most people care about or even consider. IIRC there was a related issue when passing a ring context to a different process altogether, but it was already fixed or at least mitigated.

There's some irony in io_uring both being so performant and becoming popular; it has microkernel written all over, which is an approach that Linux (and Linus) notoriously ridiculed so many years ago as requiring interfaces that were both too slow and too complicated. Except, oddly, instead of where a proper microkernel would preserve and sharpen privilege boundaries (including capability objects, VM isolation, etc), io_uring hacks around them and reduces their effectiveness.

10000truths · on March 9, 2022

> io_uring relies on a pool of independent kernel threads performing operations on buffers (and other resources) provided by non-privileged userspace processes. While conceptually simple, it's a sharp departure from the standard userspace/kernel syscall and process model. It was inevitable that it would stumble over little nooks & crannies of the kernel that silently made risky assumptions dependent on the standard model.

The problem here has nothing to do with kernel threads reading user-space data asynchronously. The problem is that user-provided struct field in a system call could be interpreted as a kernel-space address and operated on, and one of the kernel functions missed a check for that overload.

> Does io_uring still have the trap that a ring context initialized before a process drops privileges with setuid can still dispatch operations on root-privileged kernel worker threads? That's a nasty problem, partly related to the fact that Linux has no process-global UID--every thread's UID (effective and saved, plus GID, supplementary GIDs, etc) has to be managed separately per thread, which requires herculean hacks in libc to provide POSIX setuid semantics, which except for very specialized, Linux-specific software (e.g. runc) is all that most people care about or even consider. IIRC there was a related issue when passing a ring context to a different process altogether, but it was already fixed or at least mitigated.

Those permissions semantics are by design, even in POSIX. A file’s associated permissions are determined at the time of creation, not at the time of access. It is expected that one can open a root-owned resource as root, drop privileges, and still access the resource. And if you think about it in the context of io_uring, there isn’t really any other sensible way to do it - there’s no way for the kernel to determine which task submitted an SQE because it’s just a write in a memory address space that may be shared by any number of tasks.

yxhuvud · on March 9, 2022

> io_uring relies on a pool of independent kernel threads performing operations on buffers

Do note that this setup was rewritten at kernel version 5.13 or something like that, and the current model seem to be some sort of hybrid variant of kernel and userspace threads. From what I gather it was a huge improvement compared to how it was before.

hinkley · on March 9, 2022

Containers look an awful lot like user processes for microkernels as well.

At this point I’m wondering if Tannenbaum will live long enough to say “I told you so.”

dralley · on March 8, 2022

> Paired with the fact it's not much faster in practice than epoll in a large majority of usecases, I really worry it's going to footgun some people.

"it's not faster than epoll" is somewhat dependent on your hardware and kernel. For one thing, Jens Axobe has been working on a lot of io-uring optimizations lately, but you probably won't see them unless you're using a kernel from the last few months. And by "a lot" I really mean 3x to 4x faster in the last year on the benchmarks he has been using.

So if all your comparisons are on an enterprisey linux distro, you probably aren't getting a complete picture of epoll vs io-uring performance. epoll has been around a while, it's had more hours poured into optimization and probably regresses less frequently.

hinkley · on March 10, 2022

And here I am waiting around for a runtime to get the memo that the denominator on cost:benefit is going up, and for them to convince libraries to use to stop watching and waiting and act on it, for the changes to stabilize and then for my coworkers and I to get off our duffs and upgrade to that version of the code.

These pipelines can be pretty deep and fairly long if you're not writing systems code.

pengaru · on March 9, 2022

> Paired with the fact it's not much faster in practice than epoll in a large majority of usecases

It just takes significant development effort to take advantage of io_uring's unique ability to coalesce many SQEs into a single syscall. I'd argue a minority of use cases involve a single syscall per logical work slice, not the majority. But most programs aren't yet written to exploit it, not for lack of potential benefit.

There's similarity to SIMD extensions in this regard, but wide batching of syscalls is arguably far more generally applicable than SIMD instructions.

When you make a low-effort conversion of an epoll-utilizing program to io_uring, it's similar to plugging some SIMD intrinsics into your existing program without actually refactoring anything to profitably go wide and avoid continuously converting to/from the wide types. You'll find it's either no faster or even slower than before, but that doesn't mean SIMD can't make the application faster, it just takes some proper doing.

It took years for SIMD to become well utilized once C compilers added intrinsics exposing them to programmers. I expect a similar delay before userspace evolves to exploit io_uring, and we could really use better language-level constructs to make writing such async code more ergonomic.

frevib · on March 8, 2022

For disk IO it’s faster, there are many benchmarks on the internet.

For network IO, it depends. Only two things make it theoretically faster than epoll; io_uring supports batching of requests, and you can save one sys call compared to epoll in an event loop. There some other things that could make it faster like SQPOLL, but this could also hurt performance.

Network IO discussion: https://github.com/axboe/liburing/issues/536

pengaru · on March 9, 2022

> Network IO discussion: https://github.com/axboe/liburing/issues/536

I see an issue with a narrative but zero discussion at that link.

Furthermore, your io_uring benchmark being utilized in that issue isn't even batching CQE consumption. I've submitted a quick and dirty untested PR adding rudimentary batching at [0]. Frankly, what seems to be a constant din of poorly-written low-effort benchmarks portraying io_uring in a negative light vs. epoll is getting rather old.

[0] https://github.com/frevib/io_uring-echo-server/pull/16

tveita · on March 9, 2022

That seems uncharitable.

The linked issue barely mentions frevib's echo server, and in the one place it does it's the fastest!

Further they show that performance improves when using io_uring for readiness polling but standard read/write calls to actually write - that suggests io_uring_for_each_cqe does not explain the cases where epoll is faster.

> I've submitted a quick and dirty untested PR

That's not improving the situation much then - surely any performance fix should come with at least a rudimentary benchmark?

jlokier · on March 8, 2022

In my tests, for NVMe storage I/O I found io_uring was slower than a well-optimised userspace thread pool.

Perhaps the newer kernel is faster, or there is some subtlety in the io_uring queueing parameters that I need to tune better.

dataflow · on March 9, 2022

Maybe you're doing large I/Os whereas their benchmarks are doing small random I/Os (like 4K). Are you measuring IOPS or throughout?

jlokier · on March 9, 2022

Measuring IOPS of random, small reads (4kiB, O_DIRECT, single NVMe). (It's for optimising a database engine doing random lookups, but the benchmark is just random reads, no other logic.)

Just now I have tested 1,133 kIOPs with threads and 598 kIOPS with io_uring. The SQE queue depth for io_uring and max threads for thread test are set the same, 512.

I'd like to think this is due to a particularly well-optimised thread pool :-)

FridgeSeal · on March 8, 2022

I’m confused by this, isn’t one of the mains points of uring is that it’s faster?

infamouscow · on March 8, 2022

With io_uring you can easily get 100K IOPS per-thread using NVMe flash. If you can push that many IOPS to one disk, using io_uring may simplify your problem.

dralley · on March 9, 2022

Jens Axobe has been getting >12 million IOPs with an Optane drive.

jwilk · on March 9, 2022

> most distros sync on stable releases

[citation needed]