Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, November 11, 2012

Mercenary mode, or troubleshooting just about anything

Some problems land on my plate in weird ways. One time, I got a mail from a former boss which didn't have me anywhere in the To: or Cc: fields, but still managed to come to me. I figured he had sent it as a "bcc", but could only wonder why. We had both moved on to different things in the intervening years, but he still remembered me.

I pinged him to see what was up. It turned out his new team kept hitting this really strange and annoying problem. It was happening enough to where it was preventing them from getting actual work done. He remembered that I was pretty good at digging into this kind of stuff to find out whatever was actually broken and decided to fling a copy of the mail my way to see if I'd bite.

Well, obviously, I did. Real people having real problems? Sign me up.

Without making any promises, I decided to start sniffing around to see what was going on. I was actually able to reproduce the problem, and he said "we weren't joking about the 70% failure rate". That would certainly explain why they are having so much trouble doing stuff.

They were trying to use a small helper program which was installed on certain machines to ship files around, and it was failing most of the time. I decided to build a new version of that program with full debugging symbols present in order to learn more about where it was dying. I pushed that out to some random machine where I was able to twiddle things as root, and got it to die again. The difference is that this time, I had a stack trace with meaningful symbols, including function names. It was blowing up down in the RPC code, deep within some library which was used by the helper program.

So I started digging around in the RPC code to see what I could learn about it. After some poking and prodding and instrumenting a few things, I found even more about what was dying: it was a new[] which was generating a SIGABRT. Yes, it was one of those cases where new can actually fail, and since the policy was to not do exceptions in C++, it fell to the default which is to call abort().

This raised a new question: why can't it call new[]? These machines have tons of memory and it should always succeed. It's not asking for that much memory, is it? I looked around and found that this program normally ran in a relatively small space created by way of ulimit. It was gulping down memory, hitting the hard limit, failing to allocate more, and the rest came crashing down. Removing that ulimit was not an option, so this thing had to go on a diet somehow.

I tried turning off compression. It was good, but it wasn't enough. I think it saved something like 6 MB in overhead. Having reached the limits of things I could do quickly, I reached out to a friend who worked on the RPC code and asked him about what sort of memory overhead it would have when pushing certain quantities of data through it.

He said it had a blowup factor of at least X. I forget what X was, but it wasn't small. All of that outgoing data from the top-level code would get buffered and copied in the RPC layer, and that would ultimately hit the limits, at which point everything would explode. It wasn't pretty.

We wound up writing something which would snoop on the size of the RPC layer's output buffer. Then, the server process would not transmit anything else until that buffer had dropped down below some arbitrary size. The specifics are lost to the sands of time, but I think it looked like this:

while (bytes_in_buffer() > limit) {
  sleep(some_number);
}

Seriously. It was that simple and that evil at the same time.

This was enough to make it send the file without chewing up memory like crazy, and that got them going. I think this patch eventually wound up being accepted by the team which worked on that service, and ultimately rolled out in production. Finally, at that point, the original team could reliably use the file moving mechanism in that service to get their work done.

Total time elapsed from the weird bcc mail to having the "fix" (while + sleep)? Maybe two hours.

I have no idea how long they were fighting with it before it came to my attention. I get the impression it wasn't an interval measured in hours, though!

If you have something strange which just won't go away, you know what to do.