Writing

Feed Software, technology, sysadmin war stories, and more.

Friday, February 7, 2020

Trying to be too (io)nice created a "killer" directory

Every now and then, someone finds a new way to get Linux processes stuck in unusual places. Six years ago, I tripped over the whole thing with the "killer /proc/pid/cmdline" which would hang ps, top, or really anything else which tried to touch those paths. "environ" and a few others in there would also hang when accessed for similar reasons involving the process's memory.

A couple of years after that, someone got clever with FUSE and mmap() and those same paths, and that became CVE-2018-1120.

Today's story is also a couple of years old at this point, but I'm only now getting around to telling it. I haven't actually tried to see if it's still a problem, but considering how spread out and split-up the Linux kernel world is these days, I'm sure it'll exist in devices for years to come. Put another way, it was quickly patched at the company where this happened, but you might still see it.

It basically goes like this: first, you put your machine on the CFQ scheduler. Maybe it's the default on your distribution, maybe you have a reason for it, or maybe you're just a gearhead and like tuning things. Whatever. The point is, now you have this scheduler happening for your kernel I/O.

Next, you have a process that tries to be a good citizen by running itself with its I/O scheduling class set to idle. It's doing some disk accesses, but it is just fine with it finishing whenever the machine is not busy doing something else. It ends up opening a directory to do some work in there - probably with opendir() from userspace.

Now, inside the kernel, opening that directory grabbed a lock. I assume it had something to do with read consistency. Much as you might lock a data structure in a multi-threaded process so that nobody changes it under you as you read it, something of the sort apparently happened here.

After that, some other process came along and really wanted to make progress doing some work on the disk, and after that point, the I/O was no longer idle. It eventually found its way down to that same directory and got stuck behind the lock. Of course, since it was actively waiting for the lock, it was "doing work", and so things weren't idle, so "idle" class processes never ran... so the "good citizen" couldn't finish its work, and never dropped the lock... and... you see where this is going?

Say hello to the priority inversion, here shown by the lowest possible priority task managing to block the highest priority task... forever!

In the moment, the workaround involved using ionice to crank the well-meaning process back up from idle, at which point it eventually got going again, finished its work, and the kernel "let go" of that directory, allowing the "real" work on the box to continue through there.

Longer-term, all of those machines needed a slight tweak to their kernel and subsequent rollout: distribution, quick drain, reboot, undrain, repeat. This was all handled automatically, but it did have to be approved by some humans.

Even longer-term than that would be somehow detecting it the next time someone gets the wrong mix of lock handling strategies, prioritization strategies, mitigation strategies, or scheduling techniques in their system, and screaming out that something's wrong. Unfortunately, I think that's still limited to the domain of meat space.

There's a reason some of us with the grey hair are still useful.

"Wait a minute, isn't that a priority inversion?"

Until we either stop making systems which will hit classical failure modes like this, or start making systems which will somehow grok the big picture, detect them, and then do things about them, we're going to need some amount of systems people in the mix to figure things out.

It's either that, or we're all going to wind up waiting on things that will never resolve themselves... forever.

I know which future I want.