More

tych0 · on May 19, 2023

Author here (hi Sargun), it's not really about rediscovering killable vs. unkillable waits, and any confusion is probably a result of my poor writing.

The crux of it is that once you've called exit_signals() from do_exit(), signals will not get delivered. So if you subsequently use the kernel's completions or other wait code, you will not get the signal from zap_pid_ns_processes(), so you don't know to wake up and exit.

There's a test case here if people want to play around: https://github.com/tych0/kernel-utils/tree/master/fuse2

sargun · on May 19, 2023

Hi Tycho!

I'm glad you inherited this :).

Oh, I wasn't suggesting that it was about killable vs. unkillable.

Couple of things: 1. Should prepare_to_wait_event check if the task is in PF_EXITING, and if so, refuse to wait unless a specific flag is provided? I'd be curious if you just add a kprobe to prepare_to_wait_event that checks for PF_EXITING, how many cases are valid?

2. Following this:

  zap_pid_ns_processes ->
     __fatal_signal_pending(task)
     group_send_sig_info
       do_send_sig_info
         send_signal_locked
           __send_signal_locked -> (jump to out_set)
             sigaddset // It has the pending signal here
             ....
             complete_signal

Shouldn't it wake up, even if in its in PF_EXITING, that would trigger as reassessment of the condition, and then the `__fatal_signal_pending` check would make it return -ERESTARTSYS.

One note, in the post:

  # grep Pnd /proc/1544574/status
  SigPnd: 0000000000000000
  ShdPnd: 0000000000000100

> Viewing process status this way, you can see 0x100 (i.e. the 9th bit is set) under SigPnd, which is the signal number corresponding to SIGKILL.

Shouldn't it be "ShdPnd"?

tych0 · on May 19, 2023

> Couple of things: 1. Should prepare_to_wait_event check if the task is in PF_EXITING, and if so, refuse to wait unless a specific flag is provided? I'd be curious if you just add a kprobe to prepare_to_wait_event that checks for PF_EXITING, how many cases are valid?

I would argue they're all invalid if PF_EXITING is present. Maybe I should send a patch to WARN() and see how much I get yelled at.

> Shouldn't it wake up, even if in its in PF_EXITING, that would trigger as reassessment of the condition, and then the `__fatal_signal_pending` check would make it return -ERESTARTSYS.

No, because the signal doesn't get delivered by complete_signal(). wants_signal() returns false if PF_EXITING is set. (Another maybe-interesting thing would be to just delete that check.) Or am I misunderstanding you?

> Shouldn't it be "ShdPnd"

derp, fixed, thanks.

tych0 · on May 19, 2023

> Or am I misunderstanding you?

Oh, I see, you're suggesting exactly,

> (Another maybe-interesting thing would be to just delete that check.)

I agree.

steelframe · on May 20, 2023

Hi Tycho. I was The Guy at LSS who tested positive for COVID about 12 hours after we sat next to each other at that Japanese restaurant in Vancouver the week before last. I really hope you didn't catch it. So far, to my knowledge, my "blast radius" is just me.

As somebody who has written a non-trivial amount of upstream Linux filesystem code and who is leading the containers team at my current employer, I've found your writing more interesting than perhaps most people on the face the planet might. I'm also a bit surprised at how often companies write their own custom FUSE filesystems. A lot of them I only hear about as former employees from those companies join mine and then clue me in about their existence. It seems like every large-ish company these days has at least one now.

It looks like you were able to figure things out through some combination of /proc poking, code inspection, and LKML querying. Out of curiosity, would it be feasible for you to have tried enabling some of the kernel hacking options such as WQ_WATCHDOG or DETECT_HUNG_TASK? Do you think that would have sped up your investigation?

Also, my whole career I've been doing ps aux, but TIL about ps awwfux. Which I guess goes to show there's always some gap in one's basic knowledge of Linux foo!

tych0 · on May 21, 2023

> Hi Tycho. I was The Guy at LSS who tested positive for COVID about 12 hours after we sat next to each other at that Japanese restaurant in Vancouver the week before last. I really hope you didn't catch it. So far, to my knowledge, my "blast radius" is just me.

Hi Mike. So far so good for me.

> It looks like you were able to figure things out through some combination of /proc poking, code inspection, and LKML querying. Out of curiosity, would it be feasible for you to have tried enabling some of the kernel hacking options such as WQ_WATCHDOG or DETECT_HUNG_TASK? Do you think that would have sped up your investigation?

We do have these both enabled, and have alerts to log them in the fleet. I have found it very useful for saying "there's a bug", but not generally applicable in debugging it. However, we wouldn't catch these things without user reports if we didn't have those tools.

Something that might (?) be useful is something like lockdep when there's hung tasks. It wouldn't have helped in this case, since it was a bug in signals wakeup, but I e.g. in the xfs case I cited at the bottom maybe it would.

loeg · on May 19, 2023

Should processes not be able to wait after exit_signals? That seems like a plausible invariant.

tych0 · on May 19, 2023

I think they definitely should not. I've considered sending a patch that adds a WARN() or some syzkaller test for it or something, especially now that I've seen it in other filesystems.

loeg · on May 19, 2023

Makes sense to me.

avianlyric · on May 19, 2023

I think that’s the point. Currently doing that will potentially result in a deadlock.

loeg · on May 19, 2023

Well, only if the wait is for userspace or a remote resource, right? Regular disks are sometimes considered infallible (or at least, the IO will timeout eventually in the generic SCSI logic) and might be ok to wait on.

To generalize a bit, I think the problem is doing any sort of interruptible wait -- because we can no longer be interrupted. Uninterruptible waits aren't any different without signal delivery. I might be oversimplifying, though.

mjevans · on May 19, 2023

It sounds like exit_signals() is being called too early, and based on the test case linked this might be a library issue rather than a code or kernel issue?

Edit: Reading the article it's more clear this happens in kernel's:

  do_exit() {
    ...
    exit_signals(tsk); /* sets PF_EXITING */
    ...
    exit_files(tsk);

Would a better solution not be to exit_signals(tsk); later in do_exit() after all possible signal sources are exhausted?

cryptonector · on May 20, 2023

It doesn't matter. Filesystem waits are historically non-interruptible. The correct fix is indeed to allow the flushes to fail fast rather than wait forever.

loeg · on May 19, 2023

> It sounds like exit_signals() is being called too early

Or zap_pid_ns too late, yeah.

cryptonector · on May 20, 2023

Later would be better, no? Since it'd allow the FUSE process to outlive the init process, thus allowing the flushes to complete.

tych0 · on Feb 17, 2023

docker-the-company maintained https://github.com/linuxkit/linuxkit when I worked there. I have no idea who maintains it now, but it looks like it is still active (presumably still docker-the-company, since their adopters list [1] lists docker desktop).

[1]: https://github.com/linuxkit/linuxkit/blob/master/ADOPTERS.md

tych0 · on Nov 10, 2022

And at least in SBF/Alameda's case, they did, and you can google it. IIUC it was basic arbitrage, the hard part was figuring out how to interface with Japan's banks.

Maybe the guy is a bad dude, I don't have a horse in that race. But lots of trading strategies that have worked in the past are well known.

tych0 · on Sept 6, 2022

Use the freezer cgroup to freeze everything, then kill it all off however you like.

tych0 · on July 19, 2022

> Could it be that some Oracle employed contributors are using a different email when submitting pull requests?

Yes, and if you contribute from an e-mail address that Greg K-H's scripts don't understand, you get an e-mail from him asking you to disclose your employer if you're willing.

Some companies mandate you contribute from your corporate address, and it is impossible to contribute from some corporate e-mail accounts, since they don't allow SMTP access for use with git send-email. For example, my understanding is that this is the reason for the linux.ibm.com subdomain, though someone at IBM can probably elaborate.

LWN's numbers are probably pretty close to accurate, as I think most people disclose. I haven't read TFA, but I guess Oracle is counting only commits to kernel/+fs/+net/ or something like that.

xbar · on July 19, 2022

kernel/+fs/+mm :

git log --pretty=format:"%<(60,mtrunc)%ae %h %s (%ar)" -i --no-merges v5.17..v5.18 -- fs mm kernel

tych0 · on July 14, 2022

> Note that seccomp has limited visibility into recvmsg / sendmsg args because bpf can't dereference syscall arg pointers.

I guess landlock can't help you here since it is still mostly about filesystem access right now, but maybe someday? It looks like "minimal network access control" is on the long term roadmap: https://landlock.io/

l0kod · on July 25, 2022

There is an ongoing work to support network access-control: https://lore.kernel.org/all/20220621082313.3330667-1-konstan...

tych0 · on May 10, 2022

Hi Matt, speaking of PyCon, thanks for your talk on Qtile many years ago. It remains one of the funniest lightning talks I've seen, and also influenced my talk style to some degree.

https://youtu.be/r_8om4dsEmw

__mharrison__ · on May 11, 2022

Ha ha. Thanks. I've had a comment similar to that a few times. That was probably the most nervous I've ever been for a talk even though it was the shortest talk I've ever given.

Any links for your talks?

tych0 · on April 15, 2022

I have news for you: drivers don't think laws apply to them either. I've nearly been killed several times because of it, and I know people who have been killed.

RIP Dan Spira, one of the faster guys in Denver at the time of his death: https://www.bicyclecolorado.org/join-us/donate/remembering-d...

Thanks to Bicycle Colorado's advocacy, perhaps I won't have drivers intentionally trying to side swipe me next time I ride through stop signs in a safe (and legal!) manner.

tych0 · on April 6, 2022

> Files that were previously in /usr/bin or in /bin can now be found in EITHER of these locations, since one symlinks the other. So no previous expectation was really broken.

I don't know, I just hit breakage the other day. I have /usr/bin before /usr in my path (which is the default on Ubuntu at least); I have muscle memory to use dpkg -S `which $foo` to figure out which package a binary is, but that doesn't work if dpkg thinks the binary is in /bin (e.g. ping), since it'll ask dpkg who installed /usr/bin/ping, which is nobody.

It is small fiddly things like this all over people's packaging and personal scripts that break.

ramses0 · on April 6, 2022

This is a _very_ clear P.O.V.:

Who installed '/foo/bar/baz' when '/foo' is a symlink to '/usr/bin'?

I'm 100% in favor of the DPKG maintainer's perspective of "do ugly symlink farms" and then "reap what you sow" (ie: if you don't like there being a symlink there, then fix the offending package).

tych0 · on March 2, 2022

Fastmail has great calendar (CalDAV), contacts (whatever the open protocol for this is) support in addition to their e-mail. I migrated from Google Calendar & friends several years ago and have had no issues.

nicholasjarnold · on March 2, 2022

Another vote for this option being a viable, and dare I say good, one. In addition to the CalDAV, contacts and email you can use davx5[0] on Android to help with syncing to your (root-not-required) Android device.

This setup has worked well for me for a little under a year at this point.

[0]: https://www.davx5.com/