Friday, July 6, 2012

Pull that breaker and see what dies

Generator Let's say you have a multi-level power system for your business. There's your usual utility power, then there's a big UPS or two for your data room stuff, and then you also have a gigantic beast of a generator. The UPS is only really supposed to hold things until the generator can start up, even out, and then take over.

How often does anyone really test that sort of thing? This is my tale of what happened when I realized we hadn't done it at all and decided to change it.

I was riding back to the office with the boss. He had done me a favor and picked me up at the airport to save a cab ride or parking there the whole week I was gone (and paying for it). We somehow got to talking about actually testing our backup systems, and I mentioned wanting to do it myself. I didn't want to give any warning just so we could see how things would fare in the closest thing to a real failure scenario. He agreed, and figured I could proceed once we got back to our office.

So, when we rolled in, I took out my master padlock key and unlocked the massive exterior master breaker outside and pulled it. Thunk. Inside, the lights went dark (as expected) and then behind me, I was treated to an aural assault as our ridiculously noisy generator started up. It did its thing for a minute or two and then took over. There's an audible dip in its noise when this happens, and then the lights come back on.

I went in to see what had happened and found a whole lot of stuff figuratively burning down. As it turned out, our "110" UPS didn't work at all. It was sitting there in bypass mode, so when I killed utility power, everything sitting on those circuits instantly died. That basically took out everything in our data room.

The only things which survived were the local phones, and that's only because they were hard-wired to our switch, and that switch ran on a separate backup system which had actually worked. I mention this bit about local phones, since we dropped every single remote call and connection when the CSU/DSU racks went down.

Oh, what fun that was.

Anyway, I went back outside and turned the power back on and replaced the lock. Then we started looking at the UPS. Okay, it said bypass. That's curious. It seems to be implemented as a circuit breaker that's obviously "popped", so let's see what happens when we reset it.

*Click* Off. *Click* On. Snap.

It didn't drop the load, but it wouldn't stay in normal mode, either. I did it again, and the same thing happened. It kept bypassing itself. Click, click, snap.

Ultimately, we found out from a repair tech that it had a bad fan and had been trying to tell us about it. Someone must have silenced the alarm or otherwise ignored the helpful hints it was trying to share with us. It wouldn't operate in any mode but bypass with that fan out. That's why it kept tripping the breaker, or so they said.

Despite all of this mayhem, none of my systems suffered too badly. All of the important ones were sitting on my desk (!) instead of living in the data room, and that meant they got to live on a separate UPS. This UPS was also responsible for keeping my workstation going, so it got lots of TLC from me. I made sure it was healthy and swapped the batteries when needed and all of that.

The only real problem my stuff had was that all of those still-running hosts dropped off the network for a few minutes since all of the upstream switches in the data room went down. They could still talk to each other across the switch on my desk (which was also on that same surviving UPS), but they couldn't get to anywhere else, including the outside world.

This was not the first time that keeping my systems out of that data room (the domain of the "network engineers" at that job) actually made them healthier. Besides the "kids on the roof" A/C failure, there was also that random idiot Lucent tech who decided to just unplug some stuff to "make an outlet for the T-Berd" one day. Oh, that was a fun one. I wound digging up some old pictures to prove that my power strip had in fact been on outlet X and now, surprise surprise, somehow it wasn't.

Oh, and finally? That data room had a sink in it for its first few years of life, since it had been some kind of A/V repair work room (with a vent hood!) in a prior life. Yes, a sink like you'd find in your kitchen, as in "running water"... in a room with millions of dollars of equipment and a whole lot of electricity running around.

I am mildly annoyed that I don't seem to have a picture of this particular gem anywhere. Oh well. You'll just have to take my word for it.