"We developed an exploit that allows unprivileged local users to start a
root shell by abusing the above issue. That exploit was shared
privately with <security () kernel org> to assist with fix development.
Somebody from the Linux kernel team then emailed the proposed fix to
<linux-distros () vs openwall org> and that email also included a link to
download our description of exploitation techniques and our exploit
source code.
Therefore, according to the linux-distros list policy, the exploit must
be published within 7 days from this advisory. In order to comply with
that policy, I intend to publish both the description of exploitation
techniques and also the exploit source code on Monday 15th by email to
this list."
Interesting.. they didn't write what conditions have to be met for it to be exploitable. Also interesting that someone screwed up and accidentally forwarded an email including the exploit to a broad mailing list...
Part of the nf modules are active if you have iptables, which you have if you run ufw (for example), so pretty broad exploit if that's all that's required, but the specific module in question in the patch, nf_tables, is not loaded on my Ubuntu 20.04LTS 5.40 kernel running iptables/ufw at least.
> but the specific module in question in the patch, nf_tables, is not loaded on my Ubuntu 20.04LTS 5.40 kernel running iptables/ufw at least
This doesn't matter since Linux has autoloading of most network modules, and you can cause the modules to be loaded on Ubuntu since it supports unprivileged user/net namespaces.
For comparison, on my Debian Bookworm (aka "testing" but in hard freeze and full freeze in a few days I think, stable release in june) here...
...$ lsmod|grep nf_table (tried without any just to make sure)
...$ unshare -U -m -n -r
unshare: unshare: failed: Operation not permitted
...$ /sbin/nft add table inet filter
Error: Could not process rule: Operation not permitted
add table inet filter
^^^^^^^^^^^^^^^^^^^^^^
root # cat /proc/sys/kernel/unprivileged_userns_clone
0
Most, I think Debian has patch to be disabled at runtime via sysctl. The reason is that most containers or sandboxing techniques are root only unless you mix it with user namescapes. So most container or sandbox software use suid(firejail) , root daemon(docker) or user namescapes (podman and flatpak). Looking at the cves, user namespaces is probably the safer option
Rewritten what? The container runtime will need the same access regardless of what it's written in, and rewriting all of Linux (the kernel) would be... ambitious, although it is adopting rust incrementally.
Some of the issue though is that a monolithic kernel provides more access than necessary to many things. When they made the locks granular, those might be reasonable boundaries for permissions? At this point I'd rather figure out how to make windows drivers work in redox or something crazy like that.
> Somebody from the Linux kernel team then emailed the proposed fix to <linux-distros () vs openwall org> and that email also included a link to download our description of exploitation techniques and our exploit source code.
> Therefore, according to the linux-distros list policy, the exploit must be published within 7 days from this advisory. In order to comply with that policy, [...]
What? Someone publishes information about your vuln to a random mailing list, and this somehow creates an obligation on you to follow that mailing list's policies? I don't get it.
> Please note that the maximum acceptable embargo period for issues disclosed to these lists is 14 days. Please do not ask for a longer embargo. In fact, embargo periods shorter than 7 days are preferable.
Maybe linux-distros has a poc or GTFO rule in place to keep the unchecked "I can get root on your box with this one weird trick but I won't tell you how" emails to a minimum. Just a guess though.
What’s actually reasonable here. I’m all for exploit code becoming public eventually, but I think it’s silly to drop it immediately after a fix has been released, or before, in almost all scenarios (unless there’s been 90+ days or the issue marked as wontfix)
Odds are that well-resourced attackers already have the exploit by now. Making it public lets users decide if this is important to them and come up with their own mitigations.
Once they issue the patch...it's only a matter of time till a good chunk of reasonably decent coders can develop the exploit. Once the premise is released...yeah the top exploit coders will have this in a few hours.
What a dumb policy. Why have the disclosure time be so soon? This thing will be in the wild before folks can upgrade if I'm understanding this correctly.
"I vaguely recall at least around 6-7 such holes, and a quick google
search seems to reveal that at least those would have been mitigated
by unprivileged user namespaces being disabled:
CVE-2019-18198
CVE-2020-14386
CVE-2022-0185
CVE-2022-24122
CVE-2022-25636
CVE-2022-1966 resp. CVE-2022-32250"
It won't make that big of a difference. If you exploit the networking layer you could intercept any local traffic, which will mostly be unencrypted, and communicate with outside attackers. You are probably owned by that point unless you treated localhost as untrusted.
It's like why it doesn't matter if you are running as root or not. The user account has access to whats important, like a database or keychain.
Microkernel does seem the only sensible path forward. Even if the kernel is slowly rustified, going to be playing security whack-a-mole for a long time.
Back in the day when the micro-kernel/monolith flamewars were raging, the arguments for monolith were about improved performance and lower memory usage. I haven't seen much discussion on this topic for years, but at least those two arguments have not aged well.
Hm, if you're making the underlying hardware slower, don't you want the kernel to be even faster though?
VMs are much more than micro kernels. It's about allowing the user to install whatever they want in their machine. Containers are just a userland abstraction. Not sure where the link to microkernels is there.
Why not? Hypervisor type 1 has less overhead, but it's still not quite the same as running directly on the box. I don't think micro kernels would replace those anyway. To be honest, I don't even really see the connection between running most of the kernel in user space and allowing concurrent systems to run in the same hardware.
seL4 with its VMM is a better hypervisor architecture than, say, Xen.
Xen is unfortunately large, and the full hypervisor runs privileged.
With seL4, VM exceptions are forwarded to VMM, which handles them.
From a security standpoint, a VM escape would only yield VMM privileges, which are no higher than that of the VM itself. This is much better than a compromise of Xen, which would compromise all VMs in the system.
Makatea[0] is an effort to build a Qubes-like system using seL4 and its virtualization support. It is currently funded by a NLNet grant.
Spectre and friends seem to have killed Liedtke’s fast synchronous IPC, unfortunately. Of course, there’s still asynchronous IPC, exokernels (perhaps the closest thing to today’s containers), and so on.
Right now it seems microvms are the way. Build an extremely minimal tailored kernel+userland for network-facing components. If you don't have nf_tables built-in (and it's not loadable because not present) this vulnerability isn't a problem. I mean, right now to use it one would have to chain it with a RCE on your userland app (or on the kernel but just skip the nf_tables step then...). Then one would have to escape the VM, then if you're using firecracker or crosvm, you'll have to break seccomp. Still imaginable, but by then I guess the next kernel (or userland app) fix release is already available :-) and you're already rebooting your microvm.
If you can CI/CD in minutes a reduced kernel+app and reboot in 100ms your network-facing thing (be it nginx or haproxy) you might just take latest vanilla anyway...
For rack servers you could probably get away with a number of microkernel os today. Desktop has clear options in that regard, but you are giving up op n source.
Alternatively, perhaps we should start thinking about whether it is a good idea to have multiple users of different privilege sharing the same hardware.
"User" in a modern Linux system is just a weird name for "security ___domain". Many programs run as their own user to limit their ability to attack the rest of the system if they get compromised; and limit the ability of a different compromised component from attacking them.
My desktop, on which I am the only person with an account, has 49 "users", of which 11 are actively running a process.
At work, every daemon we run has a dedicated user.
To elaborate, seL4 claims to be the fastest kernel around[0], a claim that remains unchallenged.
To put it into context, the difference in IPC speed is such that you'd need an order of magnitude more IPC for a multiserver system based on seL4 to actually be slower than Linux.
A multiserver design would imply increased IPC use, but not an order of magnitude.
Sorry I'm pretty naive to this space. I didn't immediately see any performance info on that page save for this paper [0] which shows seL4 competitive with NetBSD, but far from Linux. Is there something else I should look at?
No, it doesn’t. Here’s the full quote from their website:
> seL4 is the world’s fastest operating system kernel designed for security and safety
Linux is arguably not designed for security and safety but it blows seL4 out of the water when it comes to performance. There’s a reason it only gets used in contexts where security is critical; I would have expected that you would be aware of this considering you were the one who is promoting it.
No, you don’t get to define the benchmarks like that. People use an OS so they can run real-world programs on top of it, not spin it in a loop and see how fast it can do IPC. In a monolithic kernel there’s no need to switch contexts for many things; that’s the entire point of using one. I’m sure that seL4 has a perfectly fast implementation of those operations but that’s because it sits and does those all day as part of its basic functionality. Optimizing overhead doesn’t win you extra points when you’re comparing against an OS that doesn’t have it all.
seL4 is an order of magnitude faster at this "overhead" thing. We're talking nanoseconds vs microseconds difference.
The multiserver architecture does indeed imply an elevated use of IPC, but it does in no way outweigh the difference in IPC cost.
In this model, data sharing, and the implied locking, is minimized, which as a consequence helps SMP scaling.
Dragonfly, while not multiserver proper, took a different direction than Freebsd and Linux by optimizing IPC and not implementing fine-grained locks, and instead favoring concurrent lockless and lockfree servers.
As a consequence, Dragonfly scales much better than Freebsd, and in many benchmarks manages to outperform Linux.
This is despite the tiny development team, particularly so when considered relative to the amount of funding these two systems get.
I am sickened by the effort that's being wasted on a model that we know is bad and does not work. Linux will never be high assurance, secure or scale past a certain point.
Fortunately, no matter how long it'll take, the better technology will win; there's no "performance hack" that a bad system can pull to catch up with the better technology once it's there.
> To elaborate, seL4 claims to be the fastest kernel around[0], a claim that remains unchallenged.
Can I run Firefox or PostgreSQL on seL4? Or another real-world program of comparable complexity? And how does the performance of that compare to Linux or BSD?
That's really the only benchmark that matters; it's not hard to be fast if your kernel is simple, but simple is often also less useful. Terry Davis claimed TempleOS was faster than Linux, and in some ways he was right too. But TempleOS is also much more limited than Linux and, in the end, not all that useful – even Terry ran it inside a VM.
I've heard these sort of claims about seL4 before, and I've tried to look up some more detailed information about seL4 before, and I've never really found anything convincing on the topic beyond "TempleOS can do loads more context switches than Linux!" type stuff.
> delete an existing nft rule that uses an nft anonymous set. And an example of the latter operation is an attempt to delete an element from that nft anonymous set after the set gets
deleted
I'd be very interested to hear how this can be done by an unprivileged user.
Try to race set add/removals, sure, but if it depends on the set itself getting deleted, that seems… harder.
Andy Lutomirski described some concerns of his own:
> I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN over /any/ network namespace and to thus access the network configuration API to be a huge risk. For example, unprivileged users can program iptables. I'll eat my hat if there are no privilege escalations in there.
Honest question: Why did they build an exploit that uses the bug? I always assumed that use-after-free is equivalent to "game over" (i.e. I assumed that local privilege escalation is a given) and it is clear that such a bug must be fixed.
By that I mean, it might be easy or hard to exploit a bug to achieve LPE, but it seems to be redundant to prove that it is possible.
Making a PoC is a great way to convince both yourself and the maintainers that the bug is actually exploitable in the wild and thus a big fucking deal. Alternatively, you might discover that there are some other things going on which turns out to make the bug unexploitable.
Let me rephrase my question: Is there actually such a thing as an "unexploitable use-after-free"? How would that look like? How would you reason that it is actually unexploitable?
Context: My experience with C programming is that practically every bug that is related to memory management tends to blow up right into your face, at the most inconvenient time possible.
struct foo *whatever = new_foo();
// use 'whatever'
free_foo(whatever);
if (whatever->did_something) {
log_message("The whatever did something.");
}
// never use 'whatever' after this point
The 'whatever' variable is used after what it points to is freed, but it's not exploitable. Worst case, if new memory gets allocated in its place and an attacker controls the data in the offset of the 'did_something' field, the attacker can control whether we log a message or not, which isn't a security vulnerability.
What happens if the code gets pre-empted between free_foo(whatever) and the if-statement, memory allocation gets changed, and subsequently dereferencing the pointer to read whatever->did_something causes a page fault?
I am making assumptions here: That pre-emption is possible (at least some interrupts are enabled), that "whatever" points to virtual memory (some architectures have non-mappable physical memory pointers), and that a page fault at this point is actually harmful.
However I do want to point out that the reasoning why your example is not exploitable isn't as easy as it first seems.
No preemption is needed, the call to free might unmap the page the pointer points to. I was considering adding a paragraph about that but didn't bother. A page fault isn't a privilege escalation issue though, it's a pretty normal thing.
> How would that look like? How would you reason that it is actually unexploitable?
For use-after-free to be exploitable, by definition an attacker must be able to put arbitrary content at the memory region. This is not always easy: may require certain [mis]configuration, data layout and so on.
> practically every bug that is related to memory management tends to blow up right into your face, at the most inconvenient time possible.
I will not contest this claim, however there is a difference between "blow up" and "exploit". Malicious packet being able to segfault a server is one thing, malicious packet resulting in RCE is quite another. This may be a lost in translation moment when under colloquial use "exploit" does not include DoS.
In the next not yet released version 0.8.0 there will be a new option to disable a specific namespace type per sandbox. For example, disabling the network namespace would prevent this exploit.
This is more flexible than globally disabling all user namespaces as some programs might use other more harmless namespaces like Steam uses mount namespaces to setup runtime libraries.
For modern distros, the nft package includes an alternative binary that takes the place of /sbin/iptables and translates the input to an nft compatible format. As far as the kernel is concerned, iptables is still iptables. Old iptables can be accessed by calling the iptables-legacy binary which will auto load the old iptables ko.
Yes, AFAIU (not an expert), iptables and nftables are two command line tools and abstractions (chains vs. tables) for interacting with the same underlying netfilter API.
If it says (nf_tables), you are using the compatibility layer from the iptables-nft package.
It works quite well. Apps like Docker that inserts rules using the legacy iptables syntax are oblivious to the fact that they are actually inserting nftables rules.
It also provides an easy migration path. Insert your old rules using your iptables script then list them in the new syntax using nft list ruleset.
The problem is that it works so well that it seems most users just stayed with the iptables syntax and did not bother migrating at all.
IMO, the problem is that the people who created nftables (and the "ip" tool) couldn't create a user interface that anyone but themselves would like to use. Linux traffic shaping functionality suffers from the same "obscure word soup" interface.
I agree for the "ip" tool (from iproute2).. I got used to it but I still prefer the ifconfig output. It is somehow consistant and you can get used to it.
I somehow got accustomed to the nftables rules format. It is in fact objectively much better than the iptables format in many ways. The native JSON, easy bulk submit to the kernel, built-in sets and maps (the source of the currently discussed CVE though). It really does fix a lot of what was wrong with iptables.
But iptables was probably not broken enough for most users to warrant re-learning everything.
Now, the traffic shaping tool, oof.. I still cannot grok any of it. I've been happy with the fireqos script so far to abstract everything out of the tc syntax.
I wouldn't generally expect a use-after-free to result from improper pointer arithmetic; that's the recipe for a buffer overflow. But Rust happens to also be well-known for helping manage object lifetimes, which seems to be what went wrong here.
So to me (someone who is not an expert in this code) it looks like the fix is checking if the set has the anonymous flag before changing the reference count. I'm not an expert in this code and I could be mistaken, but I think your claim that this would be fixed by Rust object lifetime checking requires better evidence.
I think a Rust-influenced design would have shied away from the manual direct reference count management in the first place and resulted in a fairly different-looking API; but at a minimum I'd expect that the safe wrapper `nf_tables_activate_set` would probably have existed from the beginning, and may have been designed to transfer ownership of the `nft_set` rather than just capture a reference to it.
More generally: doing a line-by-line translation from C to Rust is never going to be the best way to make use of the capabilities Rust has that C lacks.
One of the parts of Rust’s safety story is to always use smart pointers for reference counting rather than the type of ad-hoc manual reference count management seen in the code you quoted. Combined with lifetime checking, it makes it impossible for some random logic error to cause a use-after-free.
> to always use smart pointers for reference counting
Agree - and the Linux kernel is extremely fragile because it is full of ad-hoc manual code like that.
Unfortunately, Rust won't be the rescue, because (in the foreseeable future) Rust will only be available in leaf code due to the many hard problems of transitioning from fragile C APIs to something better. Writing drivers in Rust is useful, but limits the scope of how Rust helps.
Many of Rust's advantages at a tiny fraction of the effort could be had easily with a smooth transition path by switching the compiler from C to C++ mode. The fruit hangs so low, it nearly touches the ground, but a silly Linus rejects C++ for the wrong reasons ("to keep the C++ programmers out", wtf).
Every time I work on the Linux kernel source, I'm horrified by how much pain the kernel developers inflict on themselves. Even with C, it would be possible to install a mandatory coding style that is less fragile.
For example, in the aftermath of the Dirty Pipe vulnerability last year, I submitted a patch to make the code less fragile, a coding style that would have prevented the vulnerability: https://lore.kernel.org/lkml/20220225185431.2617232-4-max.ke... - but my patch went nowhere.
We’ll see. As far as I know, the biggest blocker to using Rust outside of drivers is the fact that LLVM lacks support for some architectures Linux supports. And rustc_codegen_gcc seems on track to fix that eventually; even if it takes years more, that’s not much time on the scale of Linux’s development history.
That wouldn't solve the hard problems I meant. Rust portability is an easy problem - it's clear how to port Rust to more architectures, just nobody has done it. But doing interop between Rust and C in both directions, with complicated things like RCU in between - that is a hard and complex problem.
Correction: it is impossible in safe Rust that only ever calls safe Rust. The moment you're calling unsafe Rust, the possibility returns.
Not saying Rust isn't an improvement, it's a huge improvement over C, but there's no reason to oversell it. Rust is not going to make these errors magically go away, at least not in a kernel, even if you wrote the kernel from scratch, all in Rust. Unless you managed to write all of it in safe Rust which... good luck with that.
Therefore, according to the linux-distros list policy, the exploit must be published within 7 days from this advisory. In order to comply with that policy, I intend to publish both the description of exploitation techniques and also the exploit source code on Monday 15th by email to this list."
Interesting.. they didn't write what conditions have to be met for it to be exploitable. Also interesting that someone screwed up and accidentally forwarded an email including the exploit to a broad mailing list...
Part of the nf modules are active if you have iptables, which you have if you run ufw (for example), so pretty broad exploit if that's all that's required, but the specific module in question in the patch, nf_tables, is not loaded on my Ubuntu 20.04LTS 5.40 kernel running iptables/ufw at least.