Software, technology, sysadmin war stories, and more. Feed
Sunday, December 11, 2011

Cloning failing disks the right way

When you work technical support, you tend to see only the problem machines. This is selection bias in its most obvious form, since happy hardware tends to just sit there and run. From your perspective, all you ever see is broken stuff.

One thing which happened a fair amount was when a hard drive started failing and someone needed to migrate the data to a new disk. For some reason, people have gotten the idea that 'dd' is the right way to go. I'm here to say that it's probably the worst thing you can do. It all comes down to how you choose to look at that device.

Let's say you have a customer with a 250 GB drive that starts failing. You get the data center to swap it over to the secondary position (hdc) and then install a new 250 GB disk at the primary (hda). Now you want to copy all of that data across. If your solution is 'dd', you're thinking of the disk as a linear amount of space instead of a bunch of files. Here's why that's wrong.

First of all, 'dd' is going to try to read the entire drive, including the parts which currently do not contain any useful data. This happens because you have probably told it to read /dev/hdc. That refers to the entire readable portion of the disk, including the partition table and the parts which are not associated with living files.

If you use 'dd', you are going to have to read 250 GB from hdc and then write 250 GB back to hda. This is going to take a long time. Worse, if you happen to hit a chunk of the disk where something is wrong, it's probably going to fail. Even if you tell it to continue after read errors ("noerror"), you've just silently corrupted something on the target disk. Way to go!

Finally, there is the whole matter of geometry. Do both of the drives have the same configuration in terms of heads, cylinders, sectors, and all of that other arcane garbage? Will the partition table wind up in the right place? What about the boot sector data and secondary stuff like the locations where your boot loader might be looking?

What if the target drive is just a little smaller than the source? 250 GB is an approximation, after all, and they might not have the same number of accessible bytes. See the problem?

Instead, take a step back. You care about the files, so use a solution which thinks in terms of filesystems. Mount the source partition(s) some place as read-only and then use something sensible to copy them while preserving all of the metadata. I used to use a pair of tar processes connected with a pipe to do this. rsync would probably be an acceptable way to do it.

The benefits are numerous. First, you don't have to read the entire source drive. Maybe that 250 GB disk only has 50 GB of data on it. Great! You just eliminated 80% of the work. It'll be a lot faster.

Second, you're dealing with individual files. If your tar or rsync (or whatever) fails to read a file due to some disk issue, you'll know about it. Now you can know for sure that files X Y and Z might need to be recovered from a backup instead. If you had copied the raw partition with errors being skipped, you would have missed that entirely.

My only caution is to beware of crazy metadata. If your files have extended attributes or ACLs or anything of the sort, your copier tool needs to support it too, of you'll have a nice mess on your hands.

Obviously, this means you get to do the work of partitioning the new disk and creating the new filesystems and swap spaces. I consider that a small matter when compared to the mess that could happen when trying to blindly copy an entire disk byte by byte.

Back in the C-64 days, that sort of thing mattered when you really needed your warezed copy of some zero-day game to work. Now, it just doesn't make sense.