Software, technology, sysadmin war stories, and more. Feed
Sunday, March 2, 2014

Segfaulting atop and another trip down the rabbit hole

Let's say you're writing a program which is intended to take a snapshot of your system's status every minute or so. The idea is to grab whatever data you might have looked at directly had you been on the box at that moment. You might have the results of an all-encompassing call to "ps", "netstat", "lsof" or whatever else suits your fancy.

Odds are, after some iterations of this, you'd end up with a tool which did some kind of binary-level data logging so as to be more efficient. Even once a minute is (usually) 1440 samples per day, after all. So there you are, with your binary blobs and a growing file. Your logic for the writer probably looks something like this:

  1. Open file for writing and appending
  2. Write some stuff
  3. Write some more stuff
  4. Possibly write even more
  5. Close the file

Then you shut down or go to sleep or whatever until the next time you're supposed to take another snapshot. If you're the "sysstat" package, this probably means having one of your binaries run by cron in a minute or two. If you're "atop", then you're probably a long-lived daemon of some sort which just sleeps in the middle.

So now here's where things get interesting. What if you get interrupted in the middle of this? What happens now? The box might go down, or you might get OOM-killed, or some runaway admin might hit you with a SIGKILL by accident, or someone might decide it's time to upgrade your binary and down you go.

Now, when it comes to data integrity, there's the obvious angle of worrying about whether you use fsync() and if your filesystem is smart enough to put things back in a sensible state, but that's not the end of it. Remember the list earlier? You probably did multiple writes to the data file. What if the first one got in but you died before making any others? Chances are, you have the beginning of a record written to your file but nothing else.

This now brings up two questions: what happens when you go to read this file, and what happens when it's time to continue writing to this file? Trying to read this last (interrupted) record from the file will probably result in an EOF. If the program doesn't want to do anything with that, I can certainly understand. That makes sense.

The situation of having the end of the file contain a partial record will last just long enough until the writer starts up. Then, it'll probably run the list as above: open for writing and appending, spew out some blobs, close the file, and go to sleep again. Now you have a file which has a bunch of good records, one broken attempt at a record, and then probably more good records.

What happens when the reader comes along and hits this? If you're "atop", you pull in data, assume it's a record, and use it as such. Since things are completely mis-aligned in that file now, this means incredible and impossible numbers appear in fields because they are actually part of different fields from the next (complete) record.

In my experience, the first time anyone finds out about this is when they go to run atop after a system crash and it segfaults. Then they go to find out why it segfaulted and realize it's happening in mktime() or somesuch. It takes a while of backing up to find out that the segfault is caused by dereferencing a NULL. Where did that NULL come from? A call to localtime(), of course. Wait, what?

Did you catch that? localtime() almost always returns a pointer to a struct tm, except when there's some kind of error. It seems simple enough: you give it a time_t (seconds since 1970-01-01 00:00:00Z, more or less) and it gives you back a structure which has things broken out into a bunch of neat little ints.

It does... except when it doesn't, of course.

From the ctime(3) man page on my Linux box:

Each of these functions returns the value described, or NULL (-1 in case of mktime()) in case an error was detected.

Got that? NULL can and will happen. Still, we're talking about taking a number and slicing it up to make it into years, months, days, and so on. That's just a bunch of division and subtraction and stuff like that. How can you make that fail?

Easy. Give it a huge number.

$ cat bigtime.cc
#include <stdio.h>
#include <time.h>
void check(time_t t) {
  struct tm* tt = localtime(&t);
  printf("%ld = %p\n", t, tt);
int main() {
  return 0;
$ g++ -Wall -o bigtime bigtime.cc && ./bigtime
67768036191705599 = 0x7f890471fee0
67768036191705600 = (nil)

This is about the point where you start thinking "just what is so magical about 67768036191705600", and, well, I can give you a hint. Think about it in terms of years.

Here, look at just one second before that.

$ date -d @67768036191705599
Wed Dec 31 23:59:59 PST 2147485547

That "2147" is what got me paying attention. That magical "highest number you can pass to this version of localtime()" yields a year which is in the neighborhood of 2^31, aka 2147483648, aka 0x80000000. It's actually a little past it - by 1899. Remember that for later.

tm_year is documented as an int, not an unsigned int, so anything with that high bit set would actually be negative. That gives us an effective top end of 2147483647 or 0x7fffffff.

Given all of this, how are we rocking a year which is somehow above that value? For this, we must also remember what tm_year stands for. It's not exactly the year number. Oh no. Quoting from the man page yet again:

tm_year : The number of years since 1900.

Remember all of that software which went goofy about 14 years ago because people were doing things like this?

printf("The year is 19%02d.\n", t.tm_year);

That gave us such gems as "The year is 19100.", or "19114" as it would say today, assuming it's been unmaintained all of that time.

That's how we seemingly get past the (2^31)-1 year limit: it's offset.

So let's pop up a few levels here, and get back to talking about atop. atop is using data which is bad, and it blows up. Due to the nature of how it traverses a file, it's now impossible to seek past this bad record, since it always starts from 0 and reads along until it gets to the time offset you requested. Now what?

This is how it came to me: atop was blowing up on machines, seemingly randomly, but usually right after a machine went down. This is exactly when people wanted to use it to look for signs of insanity, and it would let them down unless they were super careful to only request times before the first crash. They couldn't see anything beyond it.

My investigation took me through all of the above and the inescapable conclusion that we were getting incomplete data. However, I suspected that there was good data in that file past the interruption, if only we could get to it. This got me looking for some kind of "beginning of record" marker. I found none. I looked for an "end of record" marker, and again found none. That would be less ideal since it would mean potentially skipping the first good record to be sure of a sync, but it wasn't there, either.

You can't just skip a fixed number of bytes, since each record is variable length. The first part of a record has a header which says "this many entries are coming", and then you get that many entries. If some or all of them fail to appear (due to the crash), then you're in trouble.

Ultimately, I resorted to brute force. There are a number of fields in the header which are always set to zero since they are "reserved for future use". There's also the time field which needs to be in a small range of values: basically a couple of years ago to a handful of years in the future. I can use them to figure out if I'm lined up with the record header.

So now here's what happens: I read a record and then do a sanity check on it. If it seems fine, then I exit to the normal display code and let it do whatever it wants. Otherwise I declare shenanigans and start single-stepping through the file until it seems to be reasonable again. If this succeeds, then we carry on from there. Otherwise, after some number of failed attempts or hitting EOF, we just give up.

That's about where things are now: patching the code means it usually can survive certain kinds of problems. It's a nasty little hack driven by the desire to change as little as possible about some externally-sourced code which may be a moving target.

What would be nicer is if the file format had some stronger assertions about exactly what you were looking at. If it had both a beginning and end marker for a record, then you could be reasonably sure that you had gotten a complete view. If those markers also included some sort of hopefully-unique identifier for each record, you could be pretty confident that you got a single record and didn't somehow pull in parts of two or three.

These markers should probably include a whole bunch of magic numbers, or at least magic bytes within the magic numbers so you can be sure of alignment. If you're using 64 bit types, there's a lot of room for leaving your favorite easter egg values around for the program to find later.

I'm not asking for every little sysadmin type tool to have its own log-backed, transaction-based storage system underneath. I just want them to behave better in the face of the sorts of common problems which send people to those tools in the first place. It doesn't take much to greatly improve the situation.