Linux kernel use-after-free in Netfilter, local privilege escalation

l33tman · on May 9, 2023

"We developed an exploit that allows unprivileged local users to start a root shell by abusing the above issue. That exploit was shared privately with <security () kernel org> to assist with fix development. Somebody from the Linux kernel team then emailed the proposed fix to <linux-distros () vs openwall org> and that email also included a link to download our description of exploitation techniques and our exploit source code.

Therefore, according to the linux-distros list policy, the exploit must be published within 7 days from this advisory. In order to comply with that policy, I intend to publish both the description of exploitation techniques and also the exploit source code on Monday 15th by email to this list."

Interesting.. they didn't write what conditions have to be met for it to be exploitable. Also interesting that someone screwed up and accidentally forwarded an email including the exploit to a broad mailing list...

Part of the nf modules are active if you have iptables, which you have if you run ufw (for example), so pretty broad exploit if that's all that's required, but the specific module in question in the patch, nf_tables, is not loaded on my Ubuntu 20.04LTS 5.40 kernel running iptables/ufw at least.

pizzalife · on May 9, 2023

> but the specific module in question in the patch, nf_tables, is not loaded on my Ubuntu 20.04LTS 5.40 kernel running iptables/ufw at least

This doesn't matter since Linux has autoloading of most network modules, and you can cause the modules to be loaded on Ubuntu since it supports unprivileged user/net namespaces.

  ubuntu:~% grep DISTRIB_DESCRIPTION /etc/lsb-release
  DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"
  ubuntu:~% lsmod|grep nf_table
  ubuntu:~% unshare -U -m -n -r
  ubuntu:~% nft add table inet filter
  ubuntu:~% lsmod|grep nf_table
  nf_tables             249856  0

TacticalCoder · on May 10, 2023

For comparison, on my Debian Bookworm (aka "testing" but in hard freeze and full freeze in a few days I think, stable release in june) here...

    ...$  lsmod|grep nf_table    (tried without any just to make sure) 
    ...$  unshare -U -m -n -r
    unshare: unshare: failed: Operation not permitted
    ...$  /sbin/nft add table inet filter
    Error: Could not process rule: Operation not permitted
    add table inet filter
    ^^^^^^^^^^^^^^^^^^^^^^

    root #  cat /proc/sys/kernel/unprivileged_userns_clone
    0

veonik · on May 9, 2023

Yikes... are other popular distros shipping with unprivileged user namespaces enabled by default?

marcthe12 · on May 10, 2023

Most, I think Debian has patch to be disabled at runtime via sysctl. The reason is that most containers or sandboxing techniques are root only unless you mix it with user namescapes. So most container or sandbox software use suid(firejail) , root daemon(docker) or user namescapes (podman and flatpak). Looking at the cves, user namespaces is probably the safer option

galangalalgol · on May 10, 2023

That is part of enabling rootless containers on rhel or similar.

waynesonfire · on May 10, 2023

should have re-written it in rust.

yjftsjthsd-h · on May 10, 2023

Rewritten what? The container runtime will need the same access regardless of what it's written in, and rewriting all of Linux (the kernel) would be... ambitious, although it is adopting rust incrementally.

scns · on May 10, 2023

The good old Strangler Pattern.

https://martinfowler.com/bliki/StranglerFigApplication.html

galangalalgol · on May 10, 2023

Some of the issue though is that a monolithic kernel provides more access than necessary to many things. When they made the locks granular, those might be reasonable boundaries for permissions? At this point I'd rather figure out how to make windows drivers work in redox or something crazy like that.

failsecure · on May 10, 2023

Yes and this decision haunts distros like Ubuntu over and over again. There's no easy win though.

touisteur · on May 10, 2023

Do you need a user namespace? I'd expect a network namespace to be enough. Am I missing something?

Edit: should've read better, this seems to need CLONE_NEWUSER.

jwilk · on May 10, 2023

You need CAP_SYS_ADMIN to create a new network namespace.

jstanley · on May 10, 2023

> Somebody from the Linux kernel team then emailed the proposed fix to <linux-distros () vs openwall org> and that email also included a link to download our description of exploitation techniques and our exploit source code.

> Therefore, according to the linux-distros list policy, the exploit must be published within 7 days from this advisory. In order to comply with that policy, [...]

What? Someone publishes information about your vuln to a random mailing list, and this somehow creates an obligation on you to follow that mailing list's policies? I don't get it.

batch12 · on May 10, 2023

Maybe they consider the exploit is in the wild when sending to a distro that large[0] with recipients that aren't provably trustworthy.

[0] https://oss-security.openwall.org/wiki/mailing-lists/distros

amatecha · on May 10, 2023

I believe they are referring to this:

https://oss-security.openwall.org/wiki/mailing-lists/distros

> Please note that the maximum acceptable embargo period for issues disclosed to these lists is 14 days. Please do not ask for a longer embargo. In fact, embargo periods shorter than 7 days are preferable.

failsecure · on May 10, 2023

Maybe linux-distros has a poc or GTFO rule in place to keep the unchecked "I can get root on your box with this one weird trick but I won't tell you how" emails to a minimum. Just a guess though.

jstanley · on May 10, 2023

That's fine. They didn't want it on linux-distros anyway!

thelastparadise · on May 9, 2023

I don't think the bug itself is newsworthy. The existence of the exploit code, and the way that it was accidentally published, I think are.

pizzalife · on May 9, 2023

It's exploitable by an unprivileged user on the most popular distro out there (Ubuntu). I would say it's newsworthy.

hsbauauvhabzb · on May 9, 2023

What’s actually reasonable here. I’m all for exploit code becoming public eventually, but I think it’s silly to drop it immediately after a fix has been released, or before, in almost all scenarios (unless there’s been 90+ days or the issue marked as wontfix)

sp332 · on May 10, 2023

Odds are that well-resourced attackers already have the exploit by now. Making it public lets users decide if this is important to them and come up with their own mitigations.

j_walter · on May 10, 2023

Once they issue the patch...it's only a matter of time till a good chunk of reasonably decent coders can develop the exploit. Once the premise is released...yeah the top exploit coders will have this in a few hours.

hsbauauvhabzb · on May 10, 2023

So we lower the bar to all adversaries with no benefit?

If you can read exploit code to determine if patching is worth it for your use case, you can probably also read diffs for the same outcome.

I’m not saying don’t release them, but releasing them with short notice seems irresponsible, without much benefit to defenders.

ikiris · on May 10, 2023

The link to the exploit accidentally went public. Anyone can have it.

candiddevmike · on May 9, 2023

What a dumb policy. Why have the disclosure time be so soon? This thing will be in the wild before folks can upgrade if I'm understanding this correctly.

krastanov · on May 9, 2023

The thing is already in the wild because someone on the private mailing list already accidentally mailed it to the public mailing list.

chasil · on May 10, 2023

You have a few options for dealing with problems like this.

You can "apt update; apt upgrade" then reboot when a new kernel is available.

Oracle has also offered Ksplice for free on Ubuntu for many years, and I'm sure that patch will be available promptly.

https://ksplice.oracle.com/try/desktop

Otherwise, Kernelcare is available for a fee. I think Canonical also has paid kernel patches.

withinboredom · on May 10, 2023

There is Ubuntu Pro which is free for up to five servers/desktops, after that, it requires a paid subscription.

explorer83 · on May 10, 2023

Based on this 11 month old discussion this has been an exploit vector for sometime - https://groups.google.com/g/linux.debian.bugs.dist/c/ZF9rWY3...

"I vaguely recall at least around 6-7 such holes, and a quick google search seems to reveal that at least those would have been mitigated by unprivileged user namespaces being disabled: CVE-2019-18198 CVE-2020-14386 CVE-2022-0185 CVE-2022-24122 CVE-2022-25636 CVE-2022-1966 resp. CVE-2022-32250"

snvzz · on May 10, 2023

There's easily thousands of such bugs hidden in the kernel.

Reminder the kernel has over ten million LoCs, or megabytes of object code.

Perhaps we should start thinking about whether it is a good idea to run something this large in supervisor mode, with full privileges.

I wouldn't say it is sensible in a world where seL4 exists.

akvadrako · on May 10, 2023

It won't make that big of a difference. If you exploit the networking layer you could intercept any local traffic, which will mostly be unencrypted, and communicate with outside attackers. You are probably owned by that point unless you treated localhost as untrusted.

It's like why it doesn't matter if you are running as root or not. The user account has access to whats important, like a database or keychain.

0cf8612b2e1e · on May 10, 2023

Microkernel does seem the only sensible path forward. Even if the kernel is slowly rustified, going to be playing security whack-a-mole for a long time.

anonymousiam · on May 10, 2023

Back in the day when the micro-kernel/monolith flamewars were raging, the arguments for monolith were about improved performance and lower memory usage. I haven't seen much discussion on this topic for years, but at least those two arguments have not aged well.

pjmlp · on May 10, 2023

Mostly because the cloud is based on microkernel like approach regardless of the kernel.

Hypervisors, userspace drivers, containers, language runtime sandboxes, bytecode deployments, driver and kernel sandboxes (safe kernel / driver guard),container only distributions,...

aflag · on May 10, 2023

Why not? It isn't clear to me why monolithic kernel wouldn't still have better performance.

pjmlp · on May 10, 2023

It doesn't matter with layers hypervisors, virtualization, containers and sandboxes running on top.

All mitigations to achieve microkernel like capabilities.

aflag · on May 10, 2023

Hm, if you're making the underlying hardware slower, don't you want the kernel to be even faster though?

VMs are much more than micro kernels. It's about allowing the user to install whatever they want in their machine. Containers are just a userland abstraction. Not sure where the link to microkernels is there.

pjmlp · on May 10, 2023

Not when using hypervisor type 1.

aflag · on May 10, 2023

Why not? Hypervisor type 1 has less overhead, but it's still not quite the same as running directly on the box. I don't think micro kernels would replace those anyway. To be honest, I don't even really see the connection between running most of the kernel in user space and allowing concurrent systems to run in the same hardware.

snvzz · on May 12, 2023

seL4 with its VMM is a better hypervisor architecture than, say, Xen.

Xen is unfortunately large, and the full hypervisor runs privileged.

With seL4, VM exceptions are forwarded to VMM, which handles them.

From a security standpoint, a VM escape would only yield VMM privileges, which are no higher than that of the VM itself. This is much better than a compromise of Xen, which would compromise all VMs in the system.

Makatea[0] is an effort to build a Qubes-like system using seL4 and its virtualization support. It is currently funded by a NLNet grant.

0. https://trustworthy.systems/projects/TS/makatea

mananaysiempre · on May 10, 2023

Spectre and friends seem to have killed Liedtke’s fast synchronous IPC, unfortunately. Of course, there’s still asynchronous IPC, exokernels (perhaps the closest thing to today’s containers), and so on.

CorbetL · on May 10, 2023

Linux may eventually become a microkernel with most IPC done via io_uring, but it may take 20 years to reach this state.

touisteur · on May 10, 2023

Right now it seems microvms are the way. Build an extremely minimal tailored kernel+userland for network-facing components. If you don't have nf_tables built-in (and it's not loadable because not present) this vulnerability isn't a problem. I mean, right now to use it one would have to chain it with a RCE on your userland app (or on the kernel but just skip the nf_tables step then...). Then one would have to escape the VM, then if you're using firecracker or crosvm, you'll have to break seccomp. Still imaginable, but by then I guess the next kernel (or userland app) fix release is already available :-) and you're already rebooting your microvm.

If you can CI/CD in minutes a reduced kernel+app and reboot in 100ms your network-facing thing (be it nginx or haproxy) you might just take latest vanilla anyway...

ttarr · on May 10, 2023

Care to elaborate plz?

How would we go about GPUs, NCs, and many kinds of drivers?

galangalalgol · on May 10, 2023

For rack servers you could probably get away with a number of microkernel os today. Desktop has clear options in that regard, but you are giving up op n source.

userbinator · on May 10, 2023

Alternatively, perhaps we should start thinking about whether it is a good idea to have multiple users of different privilege sharing the same hardware.

gizmo686 · on May 10, 2023

"User" in a modern Linux system is just a weird name for "security ___domain". Many programs run as their own user to limit their ability to attack the rest of the system if they get compromised; and limit the ability of a different compromised component from attacking them.

My desktop, on which I am the only person with an account, has 49 "users", of which 11 are actively running a process.

At work, every daemon we run has a dedicated user.

On android, every app runs as its own user.

thfuran · on May 10, 2023

What's the alternative, just running all code at ring 0?

Aerbil313 · on May 10, 2023

https://www.theseus-os.com/

userbinator · on May 10, 2023

IMHO rings 0 + 3 with protections against bugs, and not deliberate malice, is probably the sweet spot.

travis729 · on May 10, 2023

I’ve been thinking this recently as well.

saagarjha · on May 10, 2023

Who's going to make seL4 perform comparably to Linux?

snvzz · on May 10, 2023

Why would we need to slow seL4 down?

saagarjha · on May 10, 2023

I'm not really in the mood for trolling.

snvzz · on May 10, 2023

The name-calling is uncalled for.

To elaborate, seL4 claims to be the fastest kernel around[0], a claim that remains unchallenged.

To put it into context, the difference in IPC speed is such that you'd need an order of magnitude more IPC for a multiserver system based on seL4 to actually be slower than Linux.

A multiserver design would imply increased IPC use, but not an order of magnitude.

0. https://trustworthy.systems/projects/seL4/

camgunz · on May 10, 2023

Sorry I'm pretty naive to this space. I didn't immediately see any performance info on that page save for this paper [0] which shows seL4 competitive with NetBSD, but far from Linux. Is there something else I should look at?

[0]: https://trustworthy.systems/publications/full_text/Elphinsto...

bbarnett · on May 10, 2023

The name-calling is uncalled for.

From an observer on the sidelines: there was no namecalling.

He said you trolled, not that you ate a troll. The distinction is important.

Even the best of us troll, sometimes. (Not claiming you did btw, just that there was no name calling.)

saagarjha · on May 10, 2023

No, it doesn’t. Here’s the full quote from their website:

> seL4 is the world’s fastest operating system kernel designed for security and safety

Linux is arguably not designed for security and safety but it blows seL4 out of the water when it comes to performance. There’s a reason it only gets used in contexts where security is critical; I would have expected that you would be aware of this considering you were the one who is promoting it.

snvzz · on May 11, 2023

>but it blows seL4 out of the water when it comes to performance.

Citation needed.

And by that I mean actual benchmarks of Linux doing the few tasks seL4 does, such as IPC or context switching, faster than seL4.

saagarjha · on May 11, 2023

No, you don’t get to define the benchmarks like that. People use an OS so they can run real-world programs on top of it, not spin it in a loop and see how fast it can do IPC. In a monolithic kernel there’s no need to switch contexts for many things; that’s the entire point of using one. I’m sure that seL4 has a perfectly fast implementation of those operations but that’s because it sits and does those all day as part of its basic functionality. Optimizing overhead doesn’t win you extra points when you’re comparing against an OS that doesn’t have it all.

snvzz · on May 12, 2023

seL4 is an order of magnitude faster at this "overhead" thing. We're talking nanoseconds vs microseconds difference.

The multiserver architecture does indeed imply an elevated use of IPC, but it does in no way outweigh the difference in IPC cost.

In this model, data sharing, and the implied locking, is minimized, which as a consequence helps SMP scaling.

Dragonfly, while not multiserver proper, took a different direction than Freebsd and Linux by optimizing IPC and not implementing fine-grained locks, and instead favoring concurrent lockless and lockfree servers.

As a consequence, Dragonfly scales much better than Freebsd, and in many benchmarks manages to outperform Linux.

This is despite the tiny development team, particularly so when considered relative to the amount of funding these two systems get.

I am sickened by the effort that's being wasted on a model that we know is bad and does not work. Linux will never be high assurance, secure or scale past a certain point.

Fortunately, no matter how long it'll take, the better technology will win; there's no "performance hack" that a bad system can pull to catch up with the better technology once it's there.

Just a matter of time.

arp242 · on May 10, 2023

> To elaborate, seL4 claims to be the fastest kernel around[0], a claim that remains unchallenged.

Can I run Firefox or PostgreSQL on seL4? Or another real-world program of comparable complexity? And how does the performance of that compare to Linux or BSD?

That's really the only benchmark that matters; it's not hard to be fast if your kernel is simple, but simple is often also less useful. Terry Davis claimed TempleOS was faster than Linux, and in some ways he was right too. But TempleOS is also much more limited than Linux and, in the end, not all that useful – even Terry ran it inside a VM.

I've heard these sort of claims about seL4 before, and I've tried to look up some more detailed information about seL4 before, and I've never really found anything convincing on the topic beyond "TempleOS can do loads more context switches than Linux!" type stuff.

phendrenad2 · on May 10, 2023

Actually over 30 million LOC

unixhero · on May 10, 2023

Well it is the kernel.

Gigachad · on May 10, 2023

We really need to be moving faster on migrating Linux to a safer language which prevents these kinds of issues.

knorker · on May 9, 2023

> delete an existing nft rule that uses an nft anonymous set. And an example of the latter operation is an attempt to delete an element from that nft anonymous set after the set gets deleted

I'd be very interested to hear how this can be done by an unprivileged user.

Try to race set add/removals, sure, but if it depends on the set itself getting deleted, that seems… harder.

0x006A · on May 9, 2023

on https://bugzilla.redhat.com/show_bug.cgi?id=2196105 a comment suggests that it might only be possible if you have "unprivileged user namespaces" enabled

pizzalife · on May 9, 2023

>a comment suggests that it might only be possible if you have "unprivileged user namespaces" enabled

Which is the default on Ubuntu.

chlorion · on May 10, 2023

It's the default on pretty much any modern Linux system!

klooney · on May 10, 2023

From 2016- https://lwn.net/Articles/673597/

Andy Lutomirski described some concerns of his own:

> I consider the ability to use CLONE_NEWUSER to acquire CAP_NET_ADMIN over /any/ network namespace and to thus access the network configuration API to be a huge risk. For example, unprivileged users can program iptables. I'll eat my hat if there are no privilege escalations in there.

withinboredom · on May 10, 2023

I hope he hasn't been eating his hat all these years. I hear that isn't good for the digestive system... /s

AdamJacobMuller · on May 9, 2023

https://nvd.nist.gov/vuln/detail/CVE-2023-32233

The NIST CVE page points back here. Funny.

Nothing I see so far specifically says how far back this goes, but, https://security-tracker.debian.org/tracker/CVE-2023-32233

Seems to go back really far.

moring · on May 10, 2023

Honest question: Why did they build an exploit that uses the bug? I always assumed that use-after-free is equivalent to "game over" (i.e. I assumed that local privilege escalation is a given) and it is clear that such a bug must be fixed.

By that I mean, it might be easy or hard to exploit a bug to achieve LPE, but it seems to be redundant to prove that it is possible.

mort96 · on May 10, 2023

Making a PoC is a great way to convince both yourself and the maintainers that the bug is actually exploitable in the wild and thus a big fucking deal. Alternatively, you might discover that there are some other things going on which turns out to make the bug unexploitable.

moring · on May 10, 2023

Let me rephrase my question: Is there actually such a thing as an "unexploitable use-after-free"? How would that look like? How would you reason that it is actually unexploitable?

Context: My experience with C programming is that practically every bug that is related to memory management tends to blow up right into your face, at the most inconvenient time possible.

mort96 · on May 10, 2023

Here's a stupid example:

    struct foo *whatever = new_foo();
    // use 'whatever'
    free_foo(whatever);
    if (whatever->did_something) {
        log_message("The whatever did something.");
    }
    // never use 'whatever' after this point

The 'whatever' variable is used after what it points to is freed, but it's not exploitable. Worst case, if new memory gets allocated in its place and an attacker controls the data in the offset of the 'did_something' field, the attacker can control whether we log a message or not, which isn't a security vulnerability.

moring · on May 10, 2023

What happens if the code gets pre-empted between free_foo(whatever) and the if-statement, memory allocation gets changed, and subsequently dereferencing the pointer to read whatever->did_something causes a page fault?

I am making assumptions here: That pre-emption is possible (at least some interrupts are enabled), that "whatever" points to virtual memory (some architectures have non-mappable physical memory pointers), and that a page fault at this point is actually harmful.

However I do want to point out that the reasoning why your example is not exploitable isn't as easy as it first seems.

mort96 · on May 10, 2023

No preemption is needed, the call to free might unmap the page the pointer points to. I was considering adding a paragraph about that but didn't bother. A page fault isn't a privilege escalation issue though, it's a pretty normal thing.

friendzis · on May 10, 2023

> How would that look like? How would you reason that it is actually unexploitable?

For use-after-free to be exploitable, by definition an attacker must be able to put arbitrary content at the memory region. This is not always easy: may require certain [mis]configuration, data layout and so on.

> practically every bug that is related to memory management tends to blow up right into your face, at the most inconvenient time possible.

I will not contest this claim, however there is a difference between "blow up" and "exploit". Malicious packet being able to segfault a server is one thing, malicious packet resulting in RCE is quite another. This may be a lost in translation moment when under colloquial use "exploit" does not include DoS.

igo95862 · on May 10, 2023

I am developing a sandbox project for Linux desktop applications called bubblejail:

https://github.com/igo95862/bubblejail

In the next not yet released version 0.8.0 there will be a new option to disable a specific namespace type per sandbox. For example, disabling the network namespace would prevent this exploit.

This is more flexible than globally disabling all user namespaces as some programs might use other more harmless namespaces like Steam uses mount namespaces to setup runtime libraries.

alex14fr · on May 9, 2023

Glad to have sticked with the good old iptables and left CONFIG_NF_TABLES unset in kernel configuration.

sam_lowry_ · on May 9, 2023

Aren't iptables just an emulation layer on top of netfilter?

failsecure · on May 10, 2023

For modern distros, the nft package includes an alternative binary that takes the place of /sbin/iptables and translates the input to an nft compatible format. As far as the kernel is concerned, iptables is still iptables. Old iptables can be accessed by calling the iptables-legacy binary which will auto load the old iptables ko.

TechBro8615 · on May 9, 2023

Yes, AFAIU (not an expert), iptables and nftables are two command line tools and abstractions (chains vs. tables) for interacting with the same underlying netfilter API.

nubinetwork · on May 9, 2023

I believe at one time they were two separate subsystems, but they got merged in 4.x or 5.x

alex14fr · on May 10, 2023

I run 6.3 and the incriminated files were not compiled in my kernel thanks to CONFIG_NF_TABLES=n during make config.

eikenberry · on May 9, 2023

Probably depends on the distro. Iptables is a wrapper around nftables in most distros, but probably not all.

smashed · on May 10, 2023

You can check with: iptables -V

If it says (nf_tables), you are using the compatibility layer from the iptables-nft package.

It works quite well. Apps like Docker that inserts rules using the legacy iptables syntax are oblivious to the fact that they are actually inserting nftables rules.

It also provides an easy migration path. Insert your old rules using your iptables script then list them in the new syntax using nft list ruleset.

The problem is that it works so well that it seems most users just stayed with the iptables syntax and did not bother migrating at all.

ahartmetz · on May 10, 2023

IMO, the problem is that the people who created nftables (and the "ip" tool) couldn't create a user interface that anyone but themselves would like to use. Linux traffic shaping functionality suffers from the same "obscure word soup" interface.

smashed · on May 10, 2023

I agree for the "ip" tool (from iproute2).. I got used to it but I still prefer the ifconfig output. It is somehow consistant and you can get used to it.

I somehow got accustomed to the nftables rules format. It is in fact objectively much better than the iptables format in many ways. The native JSON, easy bulk submit to the kernel, built-in sets and maps (the source of the currently discussed CVE though). It really does fix a lot of what was wrong with iptables.

But iptables was probably not broken enough for most users to warrant re-learning everything.

Now, the traffic shaping tool, oof.. I still cannot grok any of it. I've been happy with the fireqos script so far to abstract everything out of the tc syntax.

fnordpiglet · on May 10, 2023

Rust needs to be more prominent in the kernel, and where not rust ebpf. The days of hand mangling pointer arithmetic need to end.

eklitzke · on May 10, 2023

The patch doesn't fix anything with pointer arithmetic: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

wtallis · on May 10, 2023

I wouldn't generally expect a use-after-free to result from improper pointer arithmetic; that's the recipe for a buffer overflow. But Rust happens to also be well-known for helping manage object lifetimes, which seems to be what went wrong here.

eklitzke · on May 10, 2023

I'm not sure if your claim here is correct. The patch is to change call sites like

  priv->set->use++;

To look like:

  nf_tables_activate_set(ctx, priv->set);

Where this function is defined as:

  void nf_tables_activate_set(const struct nft_ctx *ctx, struct nft_set *set) {
    if (nft_set_is_anonymous(set))
      nft_clear(ctx->net, set);
    set->use++;
  }

So to me (someone who is not an expert in this code) it looks like the fix is checking if the set has the anonymous flag before changing the reference count. I'm not an expert in this code and I could be mistaken, but I think your claim that this would be fixed by Rust object lifetime checking requires better evidence.

wtallis · on May 10, 2023

I think a Rust-influenced design would have shied away from the manual direct reference count management in the first place and resulted in a fairly different-looking API; but at a minimum I'd expect that the safe wrapper `nf_tables_activate_set` would probably have existed from the beginning, and may have been designed to transfer ownership of the `nft_set` rather than just capture a reference to it.

More generally: doing a line-by-line translation from C to Rust is never going to be the best way to make use of the capabilities Rust has that C lacks.

comex · on May 10, 2023

One of the parts of Rust’s safety story is to always use smart pointers for reference counting rather than the type of ad-hoc manual reference count management seen in the code you quoted. Combined with lifetime checking, it makes it impossible for some random logic error to cause a use-after-free.

max_k · on May 10, 2023

> to always use smart pointers for reference counting

Agree - and the Linux kernel is extremely fragile because it is full of ad-hoc manual code like that.

Unfortunately, Rust won't be the rescue, because (in the foreseeable future) Rust will only be available in leaf code due to the many hard problems of transitioning from fragile C APIs to something better. Writing drivers in Rust is useful, but limits the scope of how Rust helps.

Many of Rust's advantages at a tiny fraction of the effort could be had easily with a smooth transition path by switching the compiler from C to C++ mode. The fruit hangs so low, it nearly touches the ground, but a silly Linus rejects C++ for the wrong reasons ("to keep the C++ programmers out", wtf).

Every time I work on the Linux kernel source, I'm horrified by how much pain the kernel developers inflict on themselves. Even with C, it would be possible to install a mandatory coding style that is less fragile.

For example, in the aftermath of the Dirty Pipe vulnerability last year, I submitted a patch to make the code less fragile, a coding style that would have prevented the vulnerability: https://lore.kernel.org/lkml/20220225185431.2617232-4-max.ke... - but my patch went nowhere.

comex · on May 10, 2023

We’ll see. As far as I know, the biggest blocker to using Rust outside of drivers is the fact that LLVM lacks support for some architectures Linux supports. And rustc_codegen_gcc seems on track to fix that eventually; even if it takes years more, that’s not much time on the scale of Linux’s development history.

max_k · on May 10, 2023

That wouldn't solve the hard problems I meant. Rust portability is an easy problem - it's clear how to port Rust to more architectures, just nobody has done it. But doing interop between Rust and C in both directions, with complicated things like RCU in between - that is a hard and complex problem.

mtlmtlmtlmtl · on May 10, 2023

Correction: it is impossible in safe Rust that only ever calls safe Rust. The moment you're calling unsafe Rust, the possibility returns.

Not saying Rust isn't an improvement, it's a huge improvement over C, but there's no reason to oversell it. Rust is not going to make these errors magically go away, at least not in a kernel, even if you wrote the kernel from scratch, all in Rust. Unless you managed to write all of it in safe Rust which... good luck with that.