Writing

Feed Software, technology, sysadmin war stories, and more.

Wednesday, February 8, 2012

ASCII logs are inappropriate for inter-program comms

Want to feel really, really dirty? Try writing something to analyze the Common Log Format used by things like Apache without cheating. I've been looking at some of the software which exists to do things like this already, and far too many of them do some awful hacks to figure out what they're looking at.

For reference purposes, here's a sample request and the log entry it generates here on my system:

$ printf "GET / HTTP/1.0\nHost: rachelbythebay.com\nUser-Agent: \"Ooh, this is evil\"\n\n" | nc localhost 80

127.0.0.1 - - [08/Feb/2012:12:49:10 -0800] "GET / HTTP/1.0" 200 991 "-" "\"Ooh, this is evil\"" "rachelbythebay.com" "-"

Left to right, it's the host, username from identd (!!!), username from HTTP auth, date and time, offset from UTC, the main request as received from the client, the HTTP response code and size, whatever I supplied for a Referrer (nothing) and User-Agent, the virtual host name, and finally any cookies (none, in this case). This is essentially the "combined" LogFormat you see in Apache with two fields I've added on for my own debugging purposes.

I've seen parsers which go out and look for "GET" or "POST" or "HEAD" in the line and then jump around from there. Some of them also look for "get" or "post" or "head". These probably work most of the time but miss anything remotely interesting which is also valid in that spot: OPTIONS, TRACE, anything pertaining to WebDAV, and so on. They also fail on the basic premise of being able to deal with new things which come along without having to update the parser itself.

I saw one which looks for "(" and then works back towards the beginning of the line until it finds a double-quote character. What's it doing? Well, according to the source, it's trying to figure out the User-Agent string. This is all well and good for most cases where you get "Foobar/1.0 (WeaselOS 31.33.7 + Q extensions)", but it will completely fail if you get "wget/1.23" or "fred feed fetcher" or anything else of the sort.

Then there is the matter of those double quotes as field containers. Just by looking at my example, you should be able to see that there's the possibility for craziness. Clients are not to be trusted, and they can send along a " or even a \ embedded in the stream. Apache's logger will escape them by adding a \ in front, thus yielding \" or \\, but the log parsing stuff has to know about that!

I've seen a bunch of these things which basically split on double quotes and go on with life. They have no concept of dealing with an escaped double quote within a field or the escape character itself being escaped. If that kind of line comes along, they almost certainly get it wrong.

The one approach I have not yet seen is something which has a simple state machine and walks from start to finish, plucking off characters and dealing with them one at a time. It would have to know about normal collecting (where a space means it's done), escaped collecting, collecting characters inside a "quoted field" (where a space is part of the field), and so on. Once once all of that had been split up safely would it be possible to start reading meaning into the positions of this data.

So, now that I've said all of that, let me just say that it's all crap. Even the state machine idea at the bottom is a horrible idea in the bigger scheme of things.

None of this should even matter.

We're talking about data which is already in a nice ordered format somewhere inside the web server, which then needs to make it into a nice ordered format in another program. The only reason this mess exists is that we're bouncing that data through a human-readable format with arcane rules and potential ambiguities if those rules are ignored.

Stuff like this is why binary logging exists. These programs should be able to hand data to each other without all of this mayhem.

That being said, try finding a log module which will give you a reasonable binary stream in a file, on a pipe, or through a socket instead of trying to push it straight into a database.

Yeah, good luck with that.


February 10, 2012: This post has an update.