Software, technology, sysadmin war stories, and more. Feed
Sunday, April 1, 2012

Incremental revisions might not scale forever

Let's say you have a web server and it's generating the usual log files. You're curious about where people are coming from, so you focus your attention on the referrer data. You write a quick shell script to extract that field from your logs (however that happens is not important), and pretty soon you have the raw data.

At first, your web server isn't getting much traffic, so there's not a whole lot to see. You can just run the script and look at the output. Maybe you use a little of your scrollback buffer or you pipe it into "less" in order to paginate things. Either way, it's relatively small.

Over time, traffic starts picking up, and now you have a lot of stuff coming in. You can't just look at the raw list. Now you need to use sort and uniq to squish duplicates down to a string and a count. This works pretty well, and soon, getting a bunch of visitors all from the same place doesn't give you tons of entries. You just get one entry with an increasing number and life is good.

After that, your site gets even more popular, and now bunches of different sites are linking to you. Your "sort|uniq" squishing magic is no longer enough to make it a meaningful list since there are just too many entries. Now you also have to sort it *again*, this time by hit count, and then use "head" to just pluck out the top hundred or so. The others just disappear as noise.

Sooner or later, this stops working as well. You get tired of seeing the same old entries in the list, since you know they link to you, and it's no longer very interesting to see it again. So okay, let's feed this thing through grep -v to strip out a few. This works well enough, and you add some more. Eventually, you move on to having a whole file of patterns you don't care about, and you have grep use that as its input instead of the command line.

Now you have a result which is so meaningful and useful that you want to watch it change all by itself. So, you take this huge mass of commands and run it in "watch". That just runs it every couple of seconds and puts the latest results on your terminal. You leave it running and now you can see what's going on as it happens.

Strange things start happening, though. It starts taking longer and longer for your page to refresh when something happens. You'll make a change to your "grep -v" file and it won't just take effect any more. Even though you have a polling interval of 5 seconds, it might take 20 or 30 seconds for it to show up.

Also, other things on the server seem to be slowing down, and there are these strange spikes in your CPU utilization as tracked by things like "sar". One day, you catch it: grep is sitting there with 99% of the CPU. When you dig deeper, you find out it's not just any grep, but the grep you're running on your referrer data.

You are now deep in a hole of your own creation. Now what?

Obviously, you have to understand what's going on to even try to fix it. Basically, you have a scheme where you re-process the same data every couple of seconds. At first, it was able to finish within the relatively long windows you gave it when you ran it by hand, and then it even kept up during the early days of using "watch", but eventually it caught up with you.

All of that log extraction, squishing down to unique entries, counting, sorting by those counts, removal of uninteresting entries and display on the screen happened every time your script was restarted by watch. It didn't matter that you had processed nearly all of that data just seconds before. It was all tossed out just as soon as those tools finished.

So, instead of having to process the latest little bit of data, you now have to process all of it every time, and "the little gets more and more" as the log gets bigger. You didn't intend for this to happen, but here you are, and now something must be done about it.

One solution might be to arrange things so that you don't throw out the results of those calculations and instead keep them around. This might work by saving a snapshot at the very end and reloading from it. That might be okay for something which might run on the order of once or twice an hour, but it seems rather wasteful for this kind of task.

Instead, it looks like having a stateful process hanging around might be a better idea. It would just start up, read through the log, do whatever processing you need, and then start polling the log for new data just like "tail -f". Every so often, it would render the latest report to your designated output device.

This approach has a burst of CPU activity when it starts up, but then it drops back to only that which is required to keep up with the changes. If you also have it write out an occasional snapshot which is cheaper to load in than the raw unprocessed data, then you might even be able to make restarts less expensive and faster overall. That way, you get to restart from somewhere within the log instead of having to start from 0 every time.

It seems unlikely that someone would manage to evolve a collection of text handling utilities and shell scripts into a stateful process. While they may have the same high-level processing logic happening, the whole way they manage data and persistence is different.

This is why sometimes you just have to stop and take a few steps back to see what you're really trying to do. I mean, I'm not trying to run a shell script as quickly as possible. Who cares about that? I want to see referrer data, and it just happened to be implemented with a shell script. That script started giving me troubles, so now I'll change to some other method which will give the same output.

It's sad just how many things come down to this. Leave it to Scrooge: work smarter, not harder.