Writing

Feed Software, technology, sysadmin war stories, and more.

Wednesday, October 24, 2012

Cleaning up one mess actually created another

Here's another tale from my days in the trenches working web hosting tech support. This one is about how the same event can seem both amazing and evil depending on who's looking at it, and what they know about systems in the real world.

We had a customer who was running a "vhosting" business. This basically meant they ran a control panel on this box, and then created customers in that. The customers had their own domain names and admin logins and could then add e-mail accounts and publish web sites. It was a common business model, and it scaled fairly well on the available hardware.

One particular quirk of this control panel called Plesk is that it was built around Qmail. Supporting any mail setup will eventually give you enough ammo to hate all of them (sendmail, Postfix, you name it), but Qmail was particularly annoying. It had this tendency to fill up its queues with useless messages which couldn't be bounced and would take forever to otherwise expire.

On this evening, we (the support team) got a ticket from the managed backup folks telling us that this machine had stopped backing up. It was because there were too many files on the filesystem. A quick look established that it was qmail's spool directories which were full of garbage, and they needed to be removed.

I wound up with this ticket and wrote something dumb to run a tool called qmail-remove which would knock things out of the queue safely by renaming them out of the way. It would do this on a few hundred or thousand bogus mails, and then it would circle around and delete the actual files which had been moved out of the queue. I set this up in a loop and let it run.

I kept an eye on it for a while, including well past the end of my shift, but it was getting late even for me, and it was time to go home. I put a note in the ticket explaining what was going on, and asked whoever grabbed it to make sure it got back to the backup folks to have them run a full backup once it finished. Then I handed it off and went home.

The next afternoon, I checked on the ticket. The cleanup had finished all by itself, and one of the day shift techs had thrown the ticket back to the backup team as I had requested. The backup guys did their thing, and they managed to get a full backup to run for the first time in quite a while for this box.

A couple of hours later, the disk in that server died.

I only wish I was making this up. Some people congratulated me on my foresight to clean up all of this and making sure it went back to the backup team instead of waiting for its usual scheduled backup. I accepted that praise, but in the back of my mind, something was nagging me.

I suspect I actually caused the disk crash. Think about it. This was a cheap consumer-grade IDE drive, and I beat it to death by moving and then deleting files all night long. After that finished, we immediately smacked it around by running a full backup. This pushes a lot of data really quickly and visits every part of the disk.

They got the machine reinstalled on a fresh disk and restored from that very backup, and things finally went back to normal for them. Our customer only really lost a few hours of uptime and didn't suffer in any other tangible way.

The drive probably would have died sooner or later, but by "saving the day" with my queue cleanup, I basically pushed it right up to the edge, and it did the rest. A couple of hours either way and it could have turned out very differently.

Sometimes, it seems like it's possible to do something really good and yet really bad at the same time, but it's only apparent in retrospect.