The bug was not because a system call was involved. It was a multi threaded life...

saagarjha · 2024-12-31T17:38:54 1735666734

No. The kernel has no idea what your lifetimes are. There’s nothing stopping a buggy Rust implementation from handing out a pointer for the syscall (…an unsafe operation!) and then accidentally dropping the owner. To userspace there are no more references and this code is fine. The problem is the kernel doesn’t care what you think, and it has a blank check to write where it wants.

IshKebab · 2024-12-31T18:19:24 1735669164

That's no different to FFI with any C code. There's nothing unique to this being a kernel or a syscall. There are plenty of C libraries that behave in a similar way and can be safely wrapped with Rust by adding the lifetime requirements.

fc417fc802 · 2024-12-31T22:42:13 1735684933

> can be safely wrapped with Rust

They can't. Rust can't verify the safety of the called code once you cross the language boundary. Handing out the pointer is inherently unsafe.

In the user space FFI case at least you might be able to switch to an implementation written in the same (memory safe) language that you are already using. Not so for a syscall.

IshKebab · 2025-01-01T09:25:02 1735723502

Rust can't verify the correctness of the kernel code, but the problem here wasn't incorrect kernel code!

The problem was that the C API exposed by the kernel did not encode lifetime requirements, so they were accidentally violated. Rust APIs (including ones that wrap C interfaces) can encode lifetime requirements, so you get compile time errors if you screw it up.

I don't think you can win this argument by saying "but you have to use `unsafe` to write the Rust wrapper". That's obviously unavoidable.

ryao · 2025-01-01T11:28:07 1735730887

There was no problem with lifetime requirements. The problem was that a pointer to a C++ function that could throw exceptions was passed to a C function. This is undefined behavior because C does not support stack unwinding. If the C function's stack frame has no special for how it is deallocated, then simply deallocating the stack frame will work fine, despite this being undefined behavior. In this case, the C function had very specail requirements for being deallocated, so the undefined behavior became stack corruption.

As others have mentioned, this same issue could happen in Rust until very recently. As of Rust 1.81.0, Rust will abort instead of unwinding C stack frames:

https://blog.rust-lang.org/2024/09/05/Rust-1.81.0.html#abort...

That avoids this issue in Rust. As for avoiding it in C++ code, I have filed bugs against both GCC and LLVM requesting warnings:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118263

https://github.com/llvm/llvm-project/issues/121427

Once the compilers begin emitting warnings, this should not be an issue anymore as long as developers heed the warnings.