Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, September 4, 2011

Embarrassed by a beeping RAID card

Have you ever heard a RAID alarm? Some of them are these shrill beeper/buzzer things installed on a card which is then buried within the guts of a server. Somehow, you're expected to hear this when it's in a room where the fan noise is so loud that you are supposed to wear ear protection. This was a reality before, and I bet it still is now.

Set the wayback machine for several years ago. I spent a week on-site at a data center as part of a training exercise. This was one of those things where "support gets to see what happens with the actual hardware". This included things like the shift change and data center handoff to the next crew.

One afternoon, I tagged along as one of the guys did the actual floor inspection. He was looking to make sure everything seemed okay: no broken pipes, lights are working, nothing's fallen over, that sort of high-level stuff. There was one other piece: he was supposed to report on RAID alarms.

To get some idea of what this was like, set the alarm on your microwave for 30 seconds, then run into the next room, close the door, and turn your blow dryer on "high". Then try to tell exactly when the microwave's beeper is sounding and when it's being silent. In our case, we had to contend with the noise and the fact there were hundreds or thousands of machines which could have been generating it. We just had to slowly walk the rows, pausing between beeps, to slowly zero in on it.

So on that day, we found one and checked it out. My data center escort said, oh, yeah, we know about that one, and pointed to a sticker. Apparently, when they find a new one, they put a sticker on it and then open a ticket so Support can contact the customer. Then downtime can be scheduled to fix whatever might be broken.

For some reason, I found this curious and didn't just walk past it. Instead, I took down the server number on a scrap of paper. Later, I plugged it into the ticketing system, and that's when it got ugly.

Oh, the customer had received a ticket, all right. Someone had opened it and informed them of the problem... and then the ticket had been closed. There was no way for them to update it. So you had a sticker saying "it's under control, since the ticket owns it" and at the same time, the ticket was in a terminal state where the customer couldn't possibly update it. What a mess.

I might have been working in the data center that week but I was still fundamentally a support person, so I could not let that stand. They didn't have headsets but they did have telephones, so I got busy. The guys crowded around to watch as I actually talked to a real live customer. They never did that -- Support was their firewall.

I think I wound up calling somewhere in Toronto. The customer was there and freaked out a little when I asked him about his server. He thought it had been resolved and hearing from me basically meant that we had dropped the ball. He was miffed but said, okay, go ahead and fix it right now. I reopened the dead ticket using my admin powers and we proceeded to do exactly that. His machine was pulled, a drive was swapped, it started rebuilding the mirror, and life went on.

It was at this point that I realized just how embarrassing this situation was. First, we had to detect these things by walking around and hoping to HEAR them. Then we had a sticker which told people whether they need to care about it or not. Then there was a ticketing scheme which could be lost in the shuffle. All of these things would have come out in a post-mortem after some customer suffered data loss, but why wait?

This is the kind of stuff which bothers me. It's all avoidable.