Software, technology, sysadmin war stories, and more. Feed
Monday, November 21, 2011

Baby 1, mail server disk 0

I remember my first encounter with the dancing baby. It was years ago, back when that file started making the rounds.

I was running a POP mail server for about 1500 users. It was a little Pentium 90 box with a 4 GB disk just for mail. That particular drive usually sat at the 60% mark. 4 GB was considered a fairly big drive at the time, and things had been fine for the two years or so that we had been using it.

Then the baby landed. It's not so much that someone got it and forwarded it. That by itself wouldn't be a huge deal. It was a 40 MB file which was massive by most standards back then, but we could have handled it without too much trouble. The problem is one of multiplication.

We had a bunch of mailing lists set up for our users. Every site had a dozen or so with different groups of people: everyone, just teachers, just staff, just this, just that, and so on. You might have 100 or so people on any one list. We were running majordomo in a fairly loose mode, such that anyone could send anything to any list if they were an internal user.

So then it happened. Someone mailed the 100 recipients on a list that 40 MB dancing baby file. That's 4 GB of base64 attachment gunk that hadn't been there moments earlier. sendmail did its job as expected and promptly filled up the mail drive.

That's when everything hit the fan. Nothing new could be created on there since it was at 100% utilization. Incoming mail started being rejected with 4xx errors. Worse, the users couldn't fetch the actual message and lighten the load because the POP server needed to open a temp file of some kind, and it couldn't do that due to the space situation.

It was an embarrassing mess. I had to fix it by groveling around in raw mailboxes to excise the little dancing demon. Then we had a little chat with the user in question and let them know exactly how things went bad. Then we asked them to never do it again.

I'd like to say that we used this opportunity to build a better system which was resilient to such use cases, but that stuff never happened. We just kept doing things the hard way.