Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, July 14, 2011

Server death by heat reveals human failure

One night, I was doing my thing, just doing whatever, while my systems in another state scrolled log entries at me. I used to use "biff", so my terminals would beep and show a few lines when someone or something mailed me. My idle bliss was interrupted by a bunch of notifications from rsync. Mirroring had failed, and was unable to restart by itself. Oh boy.

I jumped on the system in question and tried to traceroute out. It failed at our big core router. I checked on the big core router. It was fine, and its default gateway was intact, pointing at my firewall box. So I pinged that firewall box. Nothing.

This firewall was just an ordinary Linux box, but it had something called a PC Weasel in it, which redirected VGA console writes to a serial port. It basically turned PCs into Real Computers, and it was installed for just this kind of contingency. It didn't respond. This thing existed outside the realm of the operating system, so this was really bad news. Either it had been powered off, or its serial cable fell out. Either one was possible but unlikely, considering it was midnight there, and nobody should have been in the office.

Still working with little data here, I poked around some more. One of my switches reported the link dropping off-line around the same time my rsync mails started. Okay, that's interesting. OS crashes usually don't kill the NIC. I thought, gee, I hope it's not on fire.

This got me thinking about temperatures. I was able to query my UPS and found it was running at 123.4F internally. I didn't know what its baseline was at first, but it seemed rather high. After some digging, I found a log file which showed it normally ran no higher than about 100F. This same log file showed something more disturbing: its internal temp had been climbing steadily since 4 PM without fail.

I decided at this point that it was time to wake someone up. The boss was the closest person to the office, so I rang him at around 3 AM his time. Obviously, I woke him up by doing this. He said he'd go right in.

About 20 minutes later, it leveled off and started dropping. Something had changed, and for the better. I found out shortly thereafter that he had arrived to find it somewhere north of 100F in the office area, and he had just forced the doors open to let in the 32F air on that cold January morning.

30 minutes after that, we found out what had happened. Someone had been working on the air handling equipment and had set it to "manual". This made it ignore the thermostats and just kept pumping heat without stopping. He just threw them all over to "disabled" and left it until the morning, when the guilty party could be located and dragged in to fix it.

A couple of days after that, I heard back from one of the "network engineers". He asked me to call him the next time something like that happened. I was just a naive kid at the time, so I figured, okay, sure, whatever. Don't wake up the boss.

Now, with the benefit of many more years behind me, I think I know what happened. I was several states away and I detected a situation that probably would have destroyed all of the equipment in that office. It just happened that one of my machines was our canary, and that I monitored them. For all of the money and power that the so-called network engineers had, they didn't have any sort of monitoring for this stuff. Things would die and they would only notice when some human pinged them.

I bet what happened is that the boss called them on the carpet and demanded to know why some kid in a distant state woke him up in the middle of the night. Why didn't they notice? Why didn't they pick up on this?

You might think he was being hard on them, but not really. The summer before this happened, some kids got up on the roof and shut down the A/C to the data room. Not too long after that, things went nuts. That time, I was also the person who noticed and got the boss to drive in and fix things. Six months had passed and his network people hadn't added monitoring.

Now I know why I got that call: they got in trouble. If I called him first, then he could have gone in and squared it away without the boss ever finding out. It's easier than actually monitoring things, right?