A terrible schema from a clueless programmer
The first time you encounter something, you're probably going to make some mistakes. There's a post going around tonight about how someone forgot to put an index on some database thing and wound up doing full table scans (or something like that). The rub is that instead of just being slow, it also cost a fair amount of money because this crazy vendor system charged by the row or somesuch. So, by scanning the whole table, they touched all of those rows, and oh hey, massive amounts of money just set ablaze!
The usual venues are discussing it, and I get the impression some people have the wrong approach to this. I want to describe a truly bad database schema I encountered, and then tell you a little about what it did to the system performance.
This system was something that was intended to slow down or stop a lot of the "open proxy" spam that was going around at the time - late 2002. It was apparent that these spammer scum would grab an open proxy and then would connect outward to your mail exchanger and would try to spew a spam at you. Blocking them did little since they forged everything - the HELO, the MAIL FROM, you name it. They also didn't use that same open proxy twice, so blocking it also didn't help much.
There were lists of these open proxy hosts, but plenty of "legit" organizations which were mailing us were sufficiently clueless to have landed on those lists, and the boss had decreed there would be no use of those lists. We needed something else to do the job.
The observation was that we could probably store the IP address, HELO string, FROM address and TO address in a table, and send back a 4xx "temporary failure" error the first time we saw that particular tuple (or "quad"). A real mail server which did SMTP properly would retry at some point, typically 15 minutes to an hour later. If it did retry and enough time had elapsed, we would allow it through.
Simple enough, right? Well, the database used for this thing was absolutely terrible. It was bog-standard MySQL but that's not the bad part. The bad part is the schema. The rows in the table looked like this:
id | ip | helo | m_from | m_to | time | ...
"Big deal", you might think. Isn't it supposed to be those four things plus a time? Yes, it is, but the problem was HOW it was stored. Those things were STRINGS (you know, varchars in database-speak). So, every damn row in the single "quads" table looked like this:
1 | ip1 | blahblah | email@example.com | firstname.lastname@example.org | ... 2 | ip2 | foobar | email@example.com | firstname.lastname@example.org | ... 3 | ip3 | MAILSERV | legit@person | email@example.com | ... 4 | ip4 | foobar | firstname.lastname@example.org | email@example.com | ...
Every row was extremely heavyweight: it had the actual raw string in it. Now think about what the database had to do to check for a match. We'd say something like this:
SELECT whatever FROM quads WHERE ip='ip1' AND HELO='thing_they_sent_us' AND m_from='whatever' AND m_to='some_user'
The poor database had to go through and start doing string comparisons all the way down the table! This meant it had to read every single row in the table, and then it had to start comparing characters from the query to what it found in that row. It probably aborted a row once one of the columns didn't match, but still, it was slow and nasty as hell.
You know what it should have been? It should have been normalized. *Someone* should have heard about things like "third normal form" and all of that stuff. Nobody had heard of that, and so it went, bumbling along, chewing memory and CPU and disk space, and being oh-so-slow.
The person who designed and programmed it really screwed up that initial implementation, and it kind of went like this for a while - maybe a couple of months? I'd have to dig through the logs to see how long the system was groaning under the weight of this thing.
Still, eventually, things changed. The whole system was reworked to have a table for each set of values. There was now a table of IP addresses, a table of HELO strings, a table of FROM addresses, and a table of TO addresses. Each table was simple enough, with a schema like this:
id | value
Each table was setup such that "id" was the primary key, naturally, and "value" was unique, so it got an index. This meant you could ask the database to do something involving finding the id given the value, and it would efficiently find it for you without reading the whole thing.
Then the "quads" table was changed to use those numeric values for each row, where any given entry was one of the ids from the other tables (ip, helo, from, to). It's a foreign key relation, but the programmer didn't know that, either.
In any case, the first four rows in the table might've now looked like this:
1 | 10 | 11 | 15 | 2 | ... 2 | 84 | 99 | 44 | 3 | ... 3 | 91 | 71 | 16 | 4 | ... 4 | 89 | 99 | 44 | 2 | ...
Then, a multi-column unique constraint was set up using the (ip, helo, from, to) as the inputs, and so that too added an index. Now, you could ask the database to look up (10, 11, 15, 2) or whatever, and it would do it relatively efficiently thanks to that index.
We came out the other side with a system that quickly and efficiently processed this data and which did its intended purpose of slowing down or stopping most of the open proxy abuses that had been headed our way.
Now, what do you suppose happened to that clueless programmer who put in the schemas with raw strings (yes, varchars) galore and didn't know anything about foreign key relationships?
Well, that's easy. She just wrote this post for you. That's right, I was that clueless newbie who came up with a completely ridiculous abuse of a SQL database that was slow, bloated, and obviously wrong at a glance to anyone who had a clue.
My point is: EVERYONE goes through this, particularly if operating in a vacuum with no mentorship, guidance, or reference points. Considering that we as an industry tend to chase off anyone who makes it to the age of 35, is it any surprise that we have a giant flock of people roaming around trying anything that'll work?
It's a massive problem, and we're all partly responsible. I'm trying to take a bite out of it now by writing stuff like this. What about you?
November 7, 2021: This post has an update.