Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, July 23, 2012

"Kicking over" and "ghosting over" customer drives

Once you see the same problem happen a few times, you should start thinking of things that can be done to make it disappear. If there are a lot of problems, you might have to handle some later than others, but you don't just ignore it. Problems where people can mess up and make mistakes which affect customers are particularly bad. Technology should be used to help them out.

I can't remember how many times a customer's drive was "kicked over". The fact I have for a name for it should tell you something, just like the whole "stabbed it" thing. "Kicking over" refers to the installation process, or "kickstart", running on a drive which already has data on it! Installations tended to be an automatic process beyond the point of choosing a flavor of Linux. They'd pick RHEL 2.1 or 3 or whatever and off it would go.

Here's where things got messy. Sometimes, a machine would become compromised at the root level and would need to be reinstalled by policy. The usual approach here was that a fresh disk would be mounted and set up with the same OS, and then the old disk would be mounted somewhere -- preferably nosuid, noexec, naturally. The customer would have two weeks to migrate their stuff across, at which point the old disk would be removed (or they could keep it and start paying for it as an additional drive).

Most of the time, this worked fine. Naturally, we never heard about the ones which worked properly. Being in support, we had to find out and then deal with the ones where it went horribly wrong. What would usually happen is that someone would forget to disconnect the existing drive (you know, the one with all of their data on it), and would start the kickstart process. It would then proceed to re-partition the drive and make fresh filesystems, thus totally destroying anything which had been on it before.

After seeing this happen a couple of times with the predictable and understandable outrage from flummoxed customers, I came up with something which could be done about it. It's automated software which is screwing up these disks, so let's teach it how to be smarter. All you have to do is write your installer so that it looks for a magic signature on the disk which says "I have been blessed for an install". This signature would be written somewhere that would be erased by any sort of "real use" - probably in the partition table space.

If someone happened to initiate a kickstart with a "used" drive in the system, the installer would detect the lack of the magic signature and would fail. This would give the human a chance to count their lucky stars and set things right before trying again.

Obviously then there is the question of how do you bless a disk? That's relatively simple. Just have a machine which does nothing but that and which lives in a special place. If you're touching that machine, you know you are deliberately blowing away data. Maybe that machine would live back in the Inventory cage where the data center guys typically never tread.

Assuming you've taken my earlier advice and instituted a disk testing regiment for any recycled hardware, then it's even easier. At the end of the testing, just write the magic signature to it and call it done. It wouldn't require any more work on the part of the people involved.

You could actually do this without having any of the workers change their behavior. First, you make the disk tester start blessing disks when it's done with them. Then, later, you make your kickstart/installer look for that signature and throw an error if it isn't there. Then you just let the process run by itself.

If your people are perfect, you'll never hear anything. However, if they are in fact human, once in a while they'll trip it and will be very happy that someone (you) put in the effort.

I should mention there is another failure mode which I saw. Besides "kicking over", there was also "ghosting over". This is when, yes, you guessed it, someone used "Ghost" or a similar tool to copy disk A to B. Usually what happened is they would screw up the notion of which one was source and which one was the destination, and would overwrite the data disk with the (empty) contents of the new one. Oops.

For that situation, I have two things to say. First of all, copying an entire disk at the partition level is usually the wrong thing to do. It takes a long time, it touches parts of the disk which may be going bad needlessly, and it gets tripped up by variable disk sizes.

Second, instead of using a blunt tool which has no clue about how your business works, craft something new which does. Make it look for that magic signature before writing anything to a disk. That way, it becomes impossible to "ghost over" a good drive.

Once you make that change, you just have to enforce a policy of not using anything but the official tools for this kind of work. Otherwise, someone will just grab a copy of Ghost (or whatever) off the shelf and keep on making trouble.

In a world where data center techs destroy arrays by barreling on ahead instead of asking for help, you need every safeguard you can create. They're only human.