Software, technology, sysadmin war stories, and more. Feed
Sunday, April 8, 2018

Troubleshooting kernel crashes under pressure

Many years ago, there was a popular service. It was growing like crazy. Then, one night, it had a very busy night, for the world was celebrating and user traffic increased accordingly. It was doing okay, but then it started losing machines. One went down, and then another, and another, and pretty soon it was a major problem.

It was New Year's Eve, and they had just hit a kernel bug.

For some reason, machines were hitting kernel panics and were rebooting. This caused that server to go down, and the load went to another machine. Higher load, for some reason, made it much more likely the next machine would die. As the load shuffled onto the remaining servers, the problem just spread more and more.

The first clue that anything was wrong was that the machines had rebooted, naturally. Then, it was clear that they had a kernel panic, and specifically, a "BUG:" was reported. The message was good and clear about where it was and what happened: a NULL pointer dereference (!) within the kernel itself...! What's more, the crash pointed straight at a function: tcp_estats_update_mss.

Now, this wasn't part of stock kernels, but it turned out there was some patch which had been applied to some of these machines. It's been many years, but I found the code online as part of the "web10g" repository, and it seems to sport the bug. Working from the patch, this is what it looks like.

The top of the function in question looks like this:

void tcp_estats_update_mss(struct tcp_sock *tp)
  struct tcp_estats *stats = tp->tcp_stats;
  struct tcp_estats_stack_table *stack_table = stats->tables.stack_table;

We're passing in a pointer (tp), dereferencing that, then chasing down another pointer (tcp_stats), dereferencing that, and then we go looking into "tables.stack_table" within that.

One of those pointers with the -> is probably the NULL, but which one? Back up another level to the call site, bearing in mind that the servers are still crashing, and the site is still dying while the world looks on. You find this:

  tp->mss_cache = mss_now;
  TCP_ESTATS_UPDATE(tp, tcp_estats_update_mss(tp));

So right here, we know that "tp" itself is fine, since otherwise it would die on the first line, and would never make it into tcp_estats_update_mss.

If it's not that, then what? It must be the dereference of tcp_stats.

Where does tcp_stats come from? Well, to answer that, we have to go chase down tcp_estats_create. In it, we find this:

  tp->tcp_stats = stats;

Back up from that... what's stats? It's estats_mem, which happens right above that, so we'll just do this as one big multi-line snippet:

  estats_mem = kzalloc(tcp_estats_get_allocation_size(sysctl), gfp_any());
  if (!estats_mem)
    return -ENOMEM;
  stats = estats_mem;

Okay! Now we're getting somewhere. The tp->tcp_stats that we try to use later (and which blows us up) ultimately calls from a call to kzalloc() right here. This may seem shocking to some and boring to others, but memory allocation attempts in the kernel can actually fail! Yep. You hopefully won't see it often, but it will happen eventually.

What else makes this happen more often? Being under significant load. Like, you know, if your buddies are falling over dead and you are taking up the slack. Yeah. That.

But clearly, something is checking for tcp_estats_create returning -ENOMEM and is disabling this feature for this connection, right?

Well, no, not so much. Here are the four call sites, extracted from various places in the code:

  tcp_estats_create(newsk, TCP_ESTATS_ADDRTYPE_IPV4, TCP_ESTATS_INACTIVE);
  tcp_estats_create(newsk, TCP_ESTATS_ADDRTYPE_IPV6, TCP_ESTATS_INACTIVE);

Notice one key thing: none of them check the return code from the function! It's returning -ENOMEM, but nobody sees it, and nobody finds out.

There are at least two chances to avoid badness here. One is to skip the fancy estats stuff if the 'create' call fails. It doesn't. Another would be to check for NULL before dereferencing tp->tcp_stats. It doesn't do that, either.

So when we found this, the next question was: can we turn this off? It turned out that yes, we could. There was a sysctl knob which gated the entire feature, and so we set it to 0 on all of the machines, and they settled down. Then it was just a matter of keeping the usual config management stuff from flipping it back on before we could get things cleaned up properly.

That was enough to get through the night and let the service recover. Later, some sort of patch was added to make the code not go down the bad path when allocations failed. It should have never happened again.

However, this code is still out there in the wild. Perhaps others have downloaded it and are using it. Maybe they hit a BUG in that function now and then and don't know why, except that it tends to come as a double-whammy when things are straining under memory load. If this is you and you found this page, maybe now you know why.

Happy hunting.