Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, January 5, 2012

Learning about normalization the hard way

In my last post, I talked about moving from a bunch of hand-hacked data structures in my process to storing stuff in MySQL. I had just come up with a really simple and ridiculous schema because I did not know any better.

The table stored the IP, HELO, FROM and TO as character data. That gave me a bunch of rows like this:

192.168.1.1 | mx1 | user1@example.com | user2@example.org

This meant that my database had to do text comparisons against everything in the table every time I did a lookup. This didn't last long. I decided to turn it around and changed all of those strings into numbers, sort of like a tokenizing pass. The actual strings would live in their own tables: one for each of the four types.

My program was pretty simple: first, it would ask the database for the ID of the IP address from that table with a SELECT. Then it would send another to get the ID for the HELO, a third for the FROM, and a fourth for the TO. After that, it would build a query using those four numbers and would fire it off in the fifth request to the database. It looked like this:

mysql> select * from quads where mipid=32 and mheloid=32 and mfromid=38 and mtoid=39;

(snipped hugely wide output)

1 row in set (1.87 sec)

This was bad news. Granted, the system was still having to grovel around through a huge table which still had the old varchar columns in place, but still? Two seconds for a query where the numbers are known?

A bit later, I figured out how to add an index with all four columns. Things got MUCH better.

mysql> select id,ctime,passed from quads where mipid=32 and mheloid=32 and mfromid=38 and mtoid=39;

+-------+---------------------+--------+

| id | ctime | passed |

+-------+---------------------+--------+

| 46801 | 2003-08-01 01:40:10 | 0 |

+-------+---------------------+--------+

1 row in set (0.00 sec)

Over on the SMTP side of the house, things were great. Previously, I could use a stopwatch and get about 2.5 seconds from sending a "RCPT TO" to getting the 4xx or 2xx response. Now, it was happening so quickly, I couldn't even measure it reliably with a stopwatch.

If you know anything about SQL, this whole escapade should have you either cringing, or rolling on the floor laughing... or maybe both. I had come up with normalization the hard way, but still had no idea what a JOIN was.

Much much later, I finally had a solid reason to learn about joining tables while dealing with the ticketing system's reporting database. At that point, I realized just how silly my earlier design had been. Sure, it had worked, but it had four whole round trips to MySQL which should have never happened.

This is the kind of stuff which happens when you don't know any better and don't have reasonable code to use as an example. I plead ignorance because I was operating in a vacuum. It would have been different if someone else had been working with me on this, had flagged it as being insane, and I had ignored them.

So now, years later, when I say to someone that I've seen something done before, and perhaps they might consider another way, this is why. I've been on the other side, and know just how disgusting the code can be if there's nobody there to act as a guide. If someone offers, you should at least hear them out and consider what they have to say before rejecting it.

The alternative is reinventing the wheel every five to ten years.