Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, April 29, 2018

Hanging the Linux core dump pipe helper

The Linux kernel has a neat little feature which lets you set up a "helper" to handle core dumps. This is a trick you can use to grab core files from a pipe instead of having them land on your disk. If you have a lot of poorly-behaved programs, you can then do some magic to make a decision about whether to keep the core or whether you can just throw it in the bit bucket.

Most people would probably prefer to keep the core files around, but consider the situation where hundreds or thousands of identical processes are crashing in the same place. Do you really need all of those tens-of-GB files filling up your disk or SSD? I'm guessing you really need one or maybe two, and the rest are just making a mess for you.

Let's say you do this. You write the helper, and the helper talks to its pals on other machines, and they collectively decide whether a given incoming core is high-signal or not. It makes network calls and does other things that can block or otherwise take a long time.

Eventually, you realize that something strange is going on. Programs are crashing with something that should make them dump core, but they never actually go away. Instead, they just stick around. If you have stuff like systemd that supervises processes and only restarts them once they fail and exit, it won't do anything, since they haven't actually exited. They're still around... but they're stuck.

What happened? The core dump helper froze and stopped reading from the kernel. When that happened, the kernel stopped writing to the pipe. That in turn made the core dump process wedge, which meant the program shutdown got stuck, too. That keeps the progress in the process table effectively forever, sort of like a zombie, but even more interesting in that it looks like it's sleeping or maybe running, but it's not the classic "state Z".

I'm not making this up. Here's how you can duplicate it.

First, make a program that'll dump core. Something like this will do nicely.

$ cat bad.cc
#include <stdio.h>
 
int main() {
  char* c = 0;
  printf("%s\n", c);
  return 0;
}

Run it and notice that it dies quickly.

$ time ./bad
Segmentation fault
 
real	0m0.003s
user	0m0.001s
sys	0m0.002s

Easy enough. Now, create a dumb little script that'll play the part of the core dump helper, like this:

#!/bin/sh
/usr/bin/logger "core.dumper: sleeping 60 seconds"
sleep 60

Put that somewhere, like in /root. Make it executable, then point the kernel at it.

# echo "|/root/core.dumper" > /proc/sys/kernel/core_pattern

Now run the crashy program and watch what happens.

$ time ./bad

Notice, it just sits there. Try hitting ^C. It won't do anything.

Eventually, it finally returns to the prompt...

Segmentation fault (core dumped)
 
real	1m0.007s
user	0m0.002s
sys	0m0.001s
$ 

... one minute later, just like we planned. Look in the syslog and there's confirmation of the shenanigans:

Apr 29 20:07:43 edu kernel: [1996338.475296] bad[14289]: segfault at 0 ip 00007fc9d1853ff6 sp 00007ffeedf2a5d8 error 4 in libc-2.23.so[7fc9d17c9000+1c0000]

Apr 29 20:07:43 edu root: core.dumper: sleeping 60 seconds

What's it doing when it's trying to die? Let's start it again and look at it with ps.

$ ps -o pid,stat,wchan -C bad
  PID STAT WCHAN
14303 S+   pipe_wait

Okay, so there's our process, and it's asleep, more or less, and it's waiting for some pipe action, waiting in the kernel. What's it up to, exactly? Let's see the full stack.

$ cat /proc/14303/stack
[<ffffffff811c91b1>] pipe_wait+0x61/0x90
[<ffffffff811c9309>] pipe_write+0xb9/0x420
[<ffffffff811c10c7>] __vfs_write+0xa7/0xf0
[<ffffffff811c115f>] __kernel_write+0x4f/0xf0
[<ffffffff8121ea78>] dump_emit+0x68/0xb0
[<ffffffff8121eb64>] dump_skip+0xa4/0xb0
[<ffffffff81218fac>] elf_core_dump+0x12fc/0x1420
[<ffffffff8121f31a>] do_coredump+0x77a/0xe40
[<ffffffff8109b666>] get_signal+0x276/0x5d0
[<ffffffff81005458>] do_signal+0x28/0x6d0
[<ffffffff810021ee>] exit_to_usermode_loop+0x6e/0xb0
[<ffffffff81002b66>] prepare_exit_to_usermode+0x26/0x30
[<ffffffff81c6fddb>] retint_user+0x8/0xd
[<ffffffffffffffff>] 0xffffffffffffffff
$ 

Now it should all slot into place. We got a signal, started doing a core dumper, started writing it to the pipe, and ... got stuck.

Obviously, since our little test rig exits after 60 seconds, this eventually clears up, but imagine what would happen if it didn't. You'd just get more and more of these things waiting around, trying to push their core dump data at a pipe that will never be serviced.

How can this happen? Well, imagine if the core dumper helper relies on something else on the local host being up and running, like DNS. Maybe it's trying to resolve something via ::1 (or 127.0.0.1 for you legacy types) and the name server isn't running. Maybe dnsmasq or bind or tinydns or whatever WAS listening on there, but it crashed and started dumping a core. But, since the helper depends on DNS being up, it will never get going.

You get a nice little dependency loop and the system just kind of grinds to a halt as things die and pile up behind it.

The moral of the story here is to be careful when you're using this feature. Keep the pipe receiver as simple as possible to avoid any chances of having it get deadlocked. Give it a strong assurance that it'll die if it gets stuck somehow, even at the cost of losing the core data.

Could the kernel be changed to avoid blocking here? I suppose. What's the likelihood of that ever happening? I'd say slim to none. I can see the comments now: "it gets stuck when we stop reading from the pipe" ... "so don't do that, then".

Yes, of course. The standard reply. And nothing improves.

Business as usual. (Feel free to prove me wrong.)