Writing

Software, technology, sysadmin war stories, and more. Feed
Thursday, March 8, 2012

Multi-line data and uniq and a tale of ugly prototypes

As much as I hate dealing with things which are in a text log format, there are times when it's just unavoidable. Sometimes, that's all you have, and somehow you have to make sense of it and turn it into something that's useful. This is what happened one of those times when I was roped into helping someone fix up a database server.

It was a common situation at a web hosting company. Someone is running a database-heavy web site, and it's starting to drag. It was the usual stack of Linux, Apache, MySQL and some programming language which doesn't change the story but probably started with a P. We had relatively good fortune on this day because someone had turned on the "slow query log" feature at some point in the past.

The problem here is that the log had captured far too much stuff. Either the threshold was too low, or this guy's site was seriously screwed up, or maybe both. We had to find the hottest of the hot spots, but how? Normally, finding the most frequent X is a simple Unix text processing problem. You get your list of stuff, send it through sort and then uniq -c, then sort -nr and look at the head. Easy!

That didn't work here. For whatever reason, the queries were split across multiple lines. This is where the other tech fetched me. He needed some way to do the "uniq" type trick but on multi-line queries. I thought about it for a few moments and decided that it was time to get down and dirty with shell pipelines and forget about trying to write a "real program" to deal with it. Wrangling all of that stuff with any of the languages I usually used for problem solving would have been even worse than using shell tools.

I decided that I would still use the sort | uniq | sort | head trick, but I needed to do it on some indirect representation of the queries. Instead of operating on the actual data which spanned multiple lines and was thus painful to digest, I'd just translate it into a hash. Then I'd look for whatever hash occurred the most and go from there.

Given this idea, the next thing to do was to split up the log file. grep removed the per-query stuff like time and date, and csplit with an appropriate regex did the rest, giving me one file for each query from the original file. Now that I had tons of files like this, I just ran md5sum against all of those files and kept that output in a temp file. Using sort+uniq as described above on the actual sums, I got the top offenders.

Now, granted, all I knew at this point was that some query with a given hash was the top offender, but had no idea what it actually was. It was easy enough to just go back to the temp file, look for any match for that sum, then grab the filename column and display its contents.

It wasn't particularly quick, and it wasn't pretty, but it did work. The best part is that I never had to use it again. If there had been any chance of this problem coming up again, I would have turned around and built something proper to handle the long-term case while everyone else used the short-term stopgap mess of scripts and shell commands.

For me, this is a reasonable way to work. You can put together a prototype to see if a given idea can even function, and then see what happens. If it gives results, you can turn your users loose with it, and they will totally understand that it's wonky and weird. After all, they probably just watched you whip it up from a blank slate, and they can put up with some strangeness when it's been delivered that quickly.

This buys you time to go off and do something with the things you learned about the problem space while coming up with that quick hack. Then you can use that knowledge to put it together in something which can stand up to repeated use and won't fall apart when slightly weird input comes along. At this stage, the basic technique is already known. Most of the time should be consumed by error handling and worrying about correctness matters like safely creating temp files.

Of course, there are plenty of times when someone will write something in that first style and then never carry on to the second part. This usually leads to a lovely situation where your users put up with it for a long time, and then someone decides to start "maintaining" it. Before long, you have a monster on your hands.

Sometimes, you just have to recognize that you are in fact working on a beast which was never meant to live this long and treat it accordingly. It's not often the right answer, but when it is, the pain and suffering should make it crystal clear.