Software, technology, sysadmin war stories, and more. Feed
Sunday, February 21, 2016

A mystery with memory leaks and a magic number

Have you ever seen a number that looks out of place in its current context, or otherwise looks like it belongs somewhere else? Once in a great while, I get that feeling, and sometimes, the hunch turns out to be right. It's just a matter of how you look at it. Here's a situation where it happened to apply as part of a bigger group effort.

There was a system which was supposed to report in with telemetry data but it wasn't managing to pull it off most of the time. It seemed to work on other hosts, but on this one, it had real problems. Someone started looking at this and noticed that the telemetry collector daemon was getting restarted every few minutes. What's more, it was happening because this nanny process thought it was broken.

Looking more closely at the collector's logs showed that indeed, it was broken. One thing it tries to do is run subprocesses on your behalf (think 'netstat' type stuff, kinda-sorta), and then it grabs the output and sends it off to the big stateful storage service. Well, after a few minutes of being up, those subprocesses would fail to start. Specifically, it would fail to fork(). It logged something like ENOMEM -- out of memory.

(Aside: fortunately, this program is smart enough to know that fork can fail and doesn't try to treat "-1" as a pid.)

That got people looking at the memory size of this program, and sure enough, it was growing fairly consistently. Every minute or so, it would grab a bit more than a gigabyte of memory. After a few minutes of this, it ran up against the limits of what the kernel was willing to do, and so the fork (which was really a clone at the syscall level) would fail.

Now the question had changed to "why is this thing leaking memory". The first thing to do was to get the approximate times of when it would grow. A quick loop in a shell to print the size and time provided that, and that gave us log entries to look at. We'd see it grow between 2:30:15 and 2:30:16 and would then check out what got logged around then. A minute later, it would grow around 2:31:15 to 2:31:16 and we'd look again.

These log entries didn't show any smoking guns, but there were a few things to remember. While most of these telemetry collector jobs showed up a lot, a couple of them seemed to show up right around the time the program would start chewing memory. One of them was also complaining about being able to do some network I/O -- something about data being truncated.

At this point I decided to take a whack at it in a debugger. While I'm generally a "debug with printf" sort of person, in this case it was going to take a while to build a new binary, and we had debug symbols, so why not parallelize and try gdb at the same time? After a bit of fooling around in the UI, I got it going and added a breakpoint at the beginning of the common code used to do network-based status counter collections. Then I started it up inside the debugger.

This means the program would run along as usual, and then it would just return to the prompt and wait for input from me. Meanwhile, my other terminal was printing the current process size over and over. Then I could tell it to continue and could watch what happened. Sure thing, once in a while, one of the calls would be followed by the memory utilization jumping up.

After doing this for a bit I realized that these function calls had some parameters which could be used to tell them apart, like the port number in use. The one I suspected was going to port 8080, so I changed my breakpoint to only make it fire when it was called with that value. This worked nicely: the program would run without nearly as many interruptions, then it would break, and I could continue it, and that's when I'd see the growth.

Now this pretty much meant that this particular job was trouble, somehow, but what was going on? There was a particularly good suggestion to trap mmap to see which part of the code was actually calling for more memory. After some false starts, I decided to wait for the breakpoint to fire, and then would manually add a break on mmap on just that thread, because trapping all mmaps was just far too noisy. Then I continued the program and waited.

A few mmaps later, I had something plausible. There was a stack showing mmap, and a bunch of stuff going on leading to it. I left this up on my screen to ponder. Among the noise, one thing stuck out: the reason why it was growing by so much. It looked something like this.

malloc(1213486160) at some/path/to/foo.cc:1234

I don't know why, but that number just seemed... strange. It seemed to belong somewhere else. My next step was the Python interpreter, since it's great for looking at this kind of stuff.

That's when I saw this:

>>> hex(1213486160)

This tells us that 1213486160 expressed in hex is 48 54 54 50. Again, I don't know exactly why, but this tripped something in my head that made me think "printable ASCII, perhaps letters". My next stop was to figure out what those were. I used the "ascii" CLI tool at the time, but for this story I'll use Python to keep it all in one place:

>>> chr(0x48)
>>> chr(0x54)
>>> chr(0x54)
>>> chr(0x50)

Now it all made sense. The number of bytes it was trying to allocate was this crazy number, but that crazy number was actually just the letters "HTTP" taken as a 32 bit value on this particular platform!

Why was it getting back HTTP? That part was easy: it was trying to speak its custom protocol to something on port 8080, and there was some web server going "huh, what?" by returning "HTTP/1.0 400 Bad Request". Apparently the network client code assumes the first four bytes are a length, and tries to set up a buffer of that size and then goes to read that much into it.

Now, since there isn't ~1.2 GB of data waiting on the other end of the pipe, it fails to fill up the buffer, and ultimately throws an error about "not enough data" -- remember the truncation thing before? Exactly.

So what do you do about it? Well, right away, you turn off the thing which is trying to speak non-HTTP to a HTTP port. Then, you see about making the network client code a little less trusting, and maybe add some magic numbers or other sanity checks before it jumps in feet-first. Also, there's that whole memory leak issue when it bails out which needs to be fixed.

This puts the fire out, then keeps it from happening again in the first place to someone else who mixes up their ports or protocols. Then life goes on.

I guess this means 1213486160 is going on my list of magic numbers. Also, it means that 2008-06-14 23:29:20 UTC is, in a strange sense, "HTTP time". If you see either that number or that date (with possible adjustments for your local time zone) showing up in your life inexplicably, this might just be why.

October 7, 2016: This post has an update.