Writing

Feed Software, technology, sysadmin war stories, and more.

Tuesday, January 3, 2012

How my homebrew spam control system was written

Quite some time ago, I wrote a system which delayed incoming mail if it looked "new". It would throw out a 4xx response and remember the details, and then if you called back, it would let it in. This was also independently invented by others and one of them named it greylisting.

Given that I've already posted about what it did at the SMTP level in terms of 4xx responses, it seems appropriate to get into how I wrote it and how it evolved. The code changed quite a bit over the years, and all of it was in response to some real need.

The first thing I needed was some way to get "inside" sendmail during the mail transaction. I was not about to rig up some dumb program which accepted the mail and then attempted to bounce it later. Down that path lies backscatter and madness. Nope. I needed to temp-fail these messages before they got to the DATA stage where the body is transmitted.

Fortunately, sendmail had recently added something called "milter" -- a mail filter (get it?) scheme where your code would wake up during different parts of the transaction. After a couple of days of having my systems molested by these dictionary attack idiots, I finally sat down and figured out how to capture the incoming IP address of the relaying host, the HELO data, and the from/to addresses. Then I just had it connect to a predefined server via TCP and asked it to "decide".

For the other side of the connection, I just ran netcat in listen mode and manually typed in "OK" or "REJECT" to test things. This worked just fine, and proved that I could do what I wanted in sendmail with the milter interface. Next, I needed a real server.

The first instance of "mailserv" was just a bunch of recycled TCP server code from my weather station project. It would accept connections, pretend to read the four parts of a connection (ip, helo, from, to), and then it would always say "OK". With this in place, I could now leave sendmail running with the milter hooked up and it would not mess up my normal mail flow.

Now it was time to start figuring out the whole delay scheme. I decided to go for something really simple: internal data storage with no backing. I basically had it build a linked list of pointers to structs, where each one contained pointers to the IP, helo, from and to data (hereafter, a "quad"). If it didn't find you in the list, it added a new entry and tempfailed you. If it did find you and it was old enough, then it let you through.

This was enough to start testing it on my tertiary mail exchanger:

Nov 23 18:53:39 mx3 mailserv: Starting timer: [192.168.121.234] [192.168.203.226] [<etytdk3s@example.net>] [<example@example.org>]

Nov 23 18:53:39 mx3 mf: response: [TEMPFAIL 451 Resources temporarily unavailable.]

Nov 23 18:53:39 mx3 sm-mta[9445]: gAO1rXo7009445: Milter: to=<example@example.org>, reject=451 Resources temporarily unavailable.

I've flipped the domains and IPs around for use in this post, naturally. Assume my domain is example.org and the evil spammers are example.net. (Yes, this is far from ideal, but it would be irresponsible of me to use any other domains for example purposes. You should see the kind of random traffic the guy behind "domain.com" used to get.)

The dates and times are right, though: that's November 23, 2002.

That initial test worked. That was spam. They never tried back.

The next day, I started spotting more things which were obviously wrong:

Nov 24 10:53:22 mx3 mailserv: Starting timer: [192.168.36.232] [example.com] [<122724@example.com>] [<myuser@example.com>]

This idiot had the gall to forge both an (invalid) from address at one of my domain names, and even claimed to be my domain in the HELO! This brought up a new feature idea: any host which does such shenenigans with HELO/EHLO is clearly not running a real mail server. They should be quarantined forever -- basically, tempfail them until the end of time.

Obviously, my own machine(s) were allowed to HELO as us. People running Eudora did this all the time. That's why I added a shortcut system to my milter. If you connected from a handful of CIDR blocks, I didn't even bother connecting to mailserv. I just did a local ALLOW and stayed out of the way.

After a week or two, I added some code to my server to dump that internal data to disk. When it had been idle for a little while, it would flush all of that storage to a local file, assuming it had changed since the last pass. Then, when you restarted it, it would read it all back in. This combined with some magic in the milter to "fail open" when mailserv was gone allowed me to iterate quickly without worrying about screwing up my mail delivery.

About a month after that, I noticed that a huge spam run had nailed a bunch of my never-valid accounts. These are addresses which had been corrupted on some spammer's list once upon a time. Let's say the user's actual address was example@example.com. I'd see attempts to xample@example.com, exampl@example.com, and even xampl@example.com! I have no idea what kind of stupidity was going on, but the pattern became obvious.

There were some places which would hit the same three or four never-valid accounts every time they did a volley. This was in addition to several valid accounts they would also hit at the same time. I added another feature to my to-do list: spam trap addresses. Mailing them would quarantine the sending host. Even if they retried after the prescribed delay, they wouldn't get through.

Later on, something else came up: some places always used the same domain name or full u@h to spew spam at us, but would abuse different open proxies every time. I decided that having a list of never-valid senders would be good, and any host relaying mail for them needed to be quarantined. Basically, if you're sending mail for them, I don't want to hear from you.

I eventually added the spam trap recipient feature. It worked so well that I then noticed more patterns and expanded it to work for regexes. That way, I could list "bait.*@example.com" and the whole namespace would be dead forever. It was great.

At this point, while "quad" storage was stateful by virtue of being flushed to disk and reloaded at restart, quarantine storage was not. This way, if I marked a host as bad for some reason, it would go away once the server process restarted. Rapid development meant that happened every couple of days.

Still, I needed to keep tweaking my config files to add more spam traps or whatever, so I added SIGHUP support: hit it with that and it'll reload the config without starting over. This made it possible to keep that quarantine list around longer. I eventually got tired of looking up the process ID in ps, so I had it dump a pid file and then added a mode where you could do "mailserv -c reload" and the new copy would read that file to poke the old version of itself with a signal. It's standard Unix daemon stuff, and it's very useful.

Four months into this experiment, my linked list scheme was not aging well. Doing linear scans of that thing every time a quad came through was just ridiculous. By this point, there were almost 7000 entries in there, and it was in danger of taking longer than the delay which sendmail would allow before giving up on the milter. I had to do something about it - all of that strcmp() action isn't cheap OR quick.

Some time passed, and I thought about it. The list grew. I decided to start with a hash table. Given the IP address, I could now turn that into a hash bucket. In each bucket, I then had a linked list. Instead of having to traverse the 10,000 or so entries (it grew!), it would only have to jump out to a bucket and deal with a far smaller list. It wasn't perfect but it would buy me some more time.

I did some analysis on my new scheme. There were only about 3400 unique IP addresses, and with a hash array of 34031 buckets, I had about a 5% collision rate. This then gave a worst-case scenario of about 170 comparisons for a single IP address. Due to the collision situation where two IPs might share a hash bucket, the worst-case was closer to about 200. It was still much better than it had been.

I thought about making it a series of trees where it would pivot off each of the four keys in the quad, but this was good enough for now. At this point, I had bigger problems: dumping all of this stuff out to the disk was taking too long. It would occasionally get caught up in a dump-to-disk call and would miss incoming requests from the milter. Oops. I had to fix that.

It had been taking nearly 9 seconds to flush the ~10K entries to disk. I profiled my code and found out that most of that was spent in the encoding stuff which was taking the in-memory binary format and was turning it into something which could be stored on disk in a human readable ASCII scheme. Yep, I was doing that particular evil thing.

What was really stupid here was that none of this data changed. A quad was a quad, and it was always the same. You might get more of them, but a quad read from disk at startup was the same quad flushed back later. I decided to swap memory for CPU time and had it save the unparsed string from that file when it read it in. Then, later on, I'd just write that same string right back.

This meant I had to store both the "raw" form of the quad and the nicely-split version in memory, but the speed difference was worth it. Just changing that took it from around 9 seconds to about 250 ms. I also did some changes to my encoder scheme which helped this number.

Around this time, I decided to add the HELO trapping. It was awesome.

Feb 24 16:00:28 mx3 mailserv[16059]: Bad HELO: [192.168.75.234] [serv.example.com] [<delivery@serv.example.com>] [<myuser@example.com>]

Yep, they had the gall to say HELO as one of my hosts and were quarantined for it.

This process went on and on from here, but this post is long enough. Check back for the next update to see what happened when mailserv outgrew in-memory storage.


January 4, 2012: This post has an update.