Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, November 25, 2012

My days of POP server wrangling

I used to run a Unix box which did a bunch of things for lots of users at a school district. I had on the order of 1500 users at any given time, and this one box was responsible for dialups, DNS, mail relaying, POP mail retrieval, time syncing, and more. However, since most of my users eventually wound up on Windows machines with the Eudora mail client, POP3 access eventually came to dominate the workload.

At first, this meant running "qpopper". The Q was for Qualcomm, the same company which wrote Eudora. It suffered from a number of security problems over the years, and every time that happened, I'd have to scramble to patch my machine and hope nobody had broken in while my system was vulnerable. I got tired of this routine and finally switched to something new called popa3d. It was a POP server which had been designed with security in mind, and it was clearly different.

However, when I switched to it, there were a few things which happened that seemed to be a consequence of running it on an established system with a bunch of traffic. One of them was that popa3d defaulted to logging four lines per connection. This creates tons of traffic like this:

Nov 25 19:06:12 mail popa3d[1234]: connect from 172.16.1.1
Nov 25 19:06:12 mail popa3d[1234]: Authentication passed for rachel
Nov 25 19:06:12 mail popa3d[1234]: 0 messages (0 bytes) loaded
Nov 25 19:06:12 mail popa3d[1234]: 0 (0) deleted, 0 (0) left

This was bad on multiple levels. If you wanted to use POP logs to see which IP address was associated with a user (or vice-versa), you couldn't do a simple grep. Instead, you had to dig around to find the right connection, then identify the PID and approximate time and then look for a match. Doing this programmatically was error-prone and doing it by hand was annoying.

I wound up changing it to only log one message during normal polls:

Nov 25 19:06:12 mail popa3d[1234]: Stats: rachel 0 0 0 0 172.16.1.1

This one line format still provided all of the same information but it kept it all bundled together so that dealing with multi-line log cruft was unnecessary.

There were other log issues I had to fix, too. One problem was that it would occasionally yell about how it was "unable to open mailbox" but not say which one. I suspect this message made a little more sense in the multi-line logging regime I just described, but I had ripped that out. Now it would just say the message and not say which one, or what had happened.

I had to change it to make it leave some details. When someone's mail broke in some fashion, my phone would almost always start ringing shortly thereafter. Having useless logs only made my life more difficult. Once I had changed things around, I was able to detect problems before the users could moan about them, and sometimes I could launch a pre-emptive strike to reach out to one of the tech people at their location to explain what was going on. This would get the problem fixed without getting stuck on a lengthy "now click this, now do that" call.

There were some other fun issues which came up when switching from qpopper. The two servers used different schemes to generate the list of messages returned to a client. It was something to do with how they generated a checksum for each message, and that was how the clients kept track of what they had already seen. So, when I switched servers, all of the messages left in a user's mailbox became "new" even if they were months or years old. One user actually had to scarf down 1500 messages all at once which had accumulated at the bottom like scale in a water heater.

The thing is, back in those days, 1500 mails was a lot. People with multi-megabyte mailboxes really hurt the performance of the machine and made it slow for everyone. This is because you had to rescan their entire mailbox every time they checked for new messages. The entire system might have had on the order of 128 MB of memory, if you have 1500 users with even 2 MB each, that's just not going to fit. Back in the '90s, you just didn't find Intel boxes with gigs of RAM like you do now.

As a result, we had this policy that users were not to use the "leave mail on server" option. They were supposed to poll for mail, pull it to their machine, and then delete it. Most of them complied, but more than a few turned that on and then never deleted anything. This caused mail to back up and the system slowed down. As a result, this change to the mail server which forced them to fetch all of this stuff again was actually a very good thing.

Still, some users persisted, so I wound up writing some progressive penalty code which would still let them check mail but without beating my machine to death. Some of the worst offenders also had it set to check every minute instead of a more reasonable 10 or 15 minute period. This meant they'd beat the machine silly for a few tens of seconds, disconnect, and then immediately come right back to do it again.

My change introduced a series of delays and even called nice() to have the process drop its own priority. This caused the kernel to run other things first, such as the mail server instances for well-behaved users. It also meant that response time went higher and higher for the mail hogs, and it would actually stretch past one minute per poll. This had the effect of keeping them from reconnecting too quickly.

The best part was that most of these users never even noticed. They just figured "it does that" and let it be. They never paid attention to my requests to clean up their messes, so this was about all I could do to keep things usable for everyone else.

On the rare occasion that someone did notice the slowness, I only had to say "it will speed up when you clean up your mailbox", and then they do that at their leisure. Since it was done entirely in code, I didn't have to lay a finger on it to make it automatically speed back up. Some users got the message, cleaned it up, and then enjoyed zippy polling once again.

Not long after I switched mail servers and the great disgorging of old mails occurred, something else happened. Three users just stopped getting mail entirely. Specifically, they were still receiving it on the server, but their clients had stopped fetching it. I had to dissect their mailboxes to get to the bottom of it, but it eventually made sense. There was a message with a specific combination of slightly strange headers which would make Eudora choke. Once it hit that message in a mailbox, it would give up.

Trouble is, when this happened, Eudora didn't throw an error. It just said "you have no new mail". Brilliant. Everyone thought it was yet another change I had inflicted on them, but it really was just some kind of bug in their mail client. I manually fixed their mailboxes, and then also generated a proof-of-concept "Eudora killer" which would spew the bad mail into a mailbox of your choosing. I did this to prove that it was in fact the client and not anything I had done. This got the boss type people off my case.

That's a taste of what running a mail server was like for me back in the '90s. I had a bunch of users pushing around lots of data, and not much in the way of hardware resources. As a result, I had to try to work around it in software to get the best possible behavior for as many people as possible. Today, it would be a totally different story.