Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, November 20, 2011

How not to treat a customer's disk array

How about another story starring our good friend "Z"? Last time, he starred in a story where he electrocuted himself by cutting through a live wire instead of redoing a bundle the right way. This time, he went and butchered a specific customer's system and was shown the door.

The last story was related to me by a friend, but I witnessed this "Z" story myself. It started with a shift handoff from one of the guys on first shift to me, the second shifter in charge of handling this particular group of high-price, high-maintenance customers.

All of the bad stuff had happened during the morning, and the attempted cleanup and recovery was underway. I just needed to be informed, since the customer was seriously pissed off and would undoubtedly call in at least once during the night. Since he'd be dealing with me, it was only right to give me some context so I knew what was coming.

Here's what happened. This particular segment of the company had a recent policy change which said they would no longer use a certain kind of RAID controller for their customers. Apparently they had some reliability issues. Okay, that's not too strange. Moreover, they were migrating customers on those controllers to another type. That's a little more intrusive, but, again, okay. It can be done.

This particular system was one with the old RAID controller, and it had come up on their audit. The customer approved the work, and that morning, our good friend "Z" pulled the box off the rack to do the swap.

The first thing he did was to remove the old card. That went fine. Then he put in the new card and hooked everything up. Strangely, nothing worked! So, he decided to "initialize the array", and then tried to reboot.

It didn't work.

Okay, thought Z, he would just put the old card back in and would re-rack the machine and have the support people figure out another plan. So, he pulled the new card, swapped in the old card, and hit the switch.

It didn't work. Again.

So now the machine was totally hosed. For the benefit of those who haven't spotted the problem yet, here's how it failed.

First of all, when it failed to come up post-swap, he should have stopped right there. That in itself was a genuine sign that something was very wrong. Proceeding past that was just ridiculously stupid.

So, when he went past that point, he then did about the worst thing you could do, which was to "initialize the array". This writes some gunk to the disk and basically screws up whatever was already on it. That way, when he swaps back to the old controller, that one can't work, either!

Had he stopped at that first hurdle, someone might have realized what was actually going on here.

The system was running RAID-5.

The swaps were only supposed to be for RAID-1 (mirroring) customers.

So, in other words, this guy did a swap that could never work and then scribbled all over the disk with garbage data by doing an "init", thus ensuring there was no way back.

To make matters even more interesting, this machine was the only one on that customer's account which wasn't being backed up. It had actually been scheduled to move to a part of the data center where it could be put on the backup network. But, eh, that day hadn't arrived yet.

Later that evening, I got the inevitable call from the customer. He was remarkably cool when dealing with me. I guess he realized I had nothing to do with it, and throwing a fit with me would accomplish little. He had just one question of me:

What was the name of the tech who did this thing to my machine?

I had to tell him. There's really no way around it.

Not too long after that, "Z" was out of a job. Last I heard, he was hawking confections at a local mall.

There's a lot of stupidity here. "Z" should have never gone past the first failed boot. He probably should have caught it when noticing the machine had more than two drives in it!

The support team should have never allowed this kind of hardware diddling to happen with no backups in place. They should have arranged for the machine to be moved, then backed up, then maybe have its RAID sorted out.

Obviously, the support team should have also been on top of the whole RAID-5 vs. RAID-1 issue and handled the controller migration properly.

But, naturally, only "Z" was fired over this. The day shift people who actually scheduled that maintenance and "set him up the bomb" (as we used to say) got away unscathed.