Writing

Software, technology, sysadmin war stories, and more. Feed
Sunday, August 7, 2011

Funny stuff with uptime, or hey, did that just reboot?

One night while working at my job as a web hosting support person, we stumbled across something interesting. A particular combination of obscure trivia and some custom monitoring configurations turned into an interesting, if short, puzzle.

It came in as a monitoring alert. Some of the higher-end customers had configurations which went beyond the mere "it's up and answering a HEAD / HTTP/1.0" type tests. For them, we had something which would hit a custom URL and look for specific content. This one customer had rigged up something pretty neat. He had written something which would sanity check all of his systems and would print OK if nothing went out of spec. We just had to look for that OK. No OK meant something was up, and it's time to alert a human.

So, sure enough, this guy had a no-OK situation. I went to his custom page and saw what his scripts had detected: "uptime less than 30 minutes". I logged in, and yep, uptime said that. But... it didn't feel like a box which had only been up 30 minutes. I'm not sure how to describe it, but just logging in and looking at 'w' didn't give me that impression.

I looked in the process table. There was lots of stuff running. None of it had started exactly 30 minutes ago. Much of it had been running for days, weeks, or even months. I figured, aha, this sounds like jiffies rollover, but how to prove it to the customer and other techs who won't just believe it?

Fortunately, Red Hat systems like this one tend to create /var/log/dmesg when they boot and then never touch it again. Sure enough, the modification time on that file was about 497 days in the past -- just enough for 2^32 jiffies to elapse and roll the uptime clock. The machine hadn't actually rebooted. It just looked like it did from a number with limited storage space.

I told the customer what had happened and congratulated him on running such a tight ship that kept a machine up for a year and a half with no funny stuff. Then I said, hey, uh, actually, let's do some maintenance here. You're way behind on kernels, and there have been several privilege escalation vulnerabilities since (whatever he was running) was released. Let's make plans to upgrade your systems and reboot them at a time which works for you.

We figured out the timing, the kernels were upgraded, and life carried on. A few techs and one customer learned yet another little weird fact about life with Linux: uptimes can roll over if you wait long enough.

Epilogue: some months later, a tech asked me how you tell if uptime rolled. I told him /var/log/dmesg would be around whatever the month was 497 days before. He looked. It wasn't. It was way way before that. I started thinking I had lost my mojo, but then I looked at it again. The date wasn't just 497 days in the past. It was 497 days in the past, and then another 497 days beyond that. This box had stayed up almost 1000 days without falling over, burning down, being rooted or whatever. Absolutely amazing.