Apache logging to binary protocol buffer files
Oh my. I knew it would be controversial. Wednesday's post about parsing things which largely resemble the Common Log Format used by Apache (and others) generated a bunch of responses. I got to see a bunch of sample code, including a few good attempts at figuring it out.
I actually did not intend to get people into a coding frenzy, but what's done is done. I do appreciate all of those feedback comments, so don't be afraid to write in if you want to say something.
One notion was to look at (f)lex, yacc, or bison to create a proper parser for C, or boost::spirit for C++. This is certainly reasonable if you absolutely must consume an ASCII format file, but let's hope that doesn't come up.
I did learn something rather interesting. Apparently some schools have data structures and algorithms classes which use parsing of Apache's logs as assignments! That tells me these problems are pretty evil if someone went as far as to use them to torture students. The whole "deceptively simple" aspect makes them perfect for that kind of use.
Well, good news, everyone. It doesn't have to be like this. My other post from Wednesday had a hint at the bottom: binary logging would avoid all of this. We'd just need to get the data out of Apache and into a nice machine-readable format. It wouldn't need to be consumable by humans!
The problem was that it looked like nobody had ever done that.
I'm happy to report that this evening has been rather productive. I wrote an Apache module which will grab a whole bunch of interesting things regarding a just-completed request and then serialize it in a practical binary format: Google's Protocol Buffers.
For those who haven't experienced these things yet, they are one of the nicest parts of Google 1.0, and fortunately they escaped into the world of free software/open source. They are a simple and efficient way to store a bunch of data and push it into a serializable format, then read it back somewhere else and use it sanely.
The original implementation is for C++, Java, and Python, and it's since shown up in a bunch of other places. The most important port done by other people is the one in C, via the protobuf-c project. It's particularly important for this project since Apache modules are compiled with gcc, not g++, so you can't use C++.
The solution is simple enough: Apache calls me, I create a protocol buffer structure, serialize it to a string, then write it out. There's just one little part where it gets interesting, and that's dealing with a bunch of these records stored in series within a single file. Just writing a bunch of protobufs is no good, since you can't tell where they begin and end.
In some parts of the world, they have a little library which will deal with this by turning these things into records which can be used for I/O. Cough. Out here, there is no such thing, so I had to create it.
As a result, the actual binary log entries are formatted and stored as netstrings. In short, it's a bunch of ASCII digits representing the length, a literal ":", then that many bytes, and finally a ",". There's no escaping or other funny stuff, and it's about as low-tech as you can get.
My days of groveling through ASCII web server log garbage are over at last. My only question now is: who's with me? Do I open Pandora's Box and release the source and .proto definition?
Not bad for a night's work if I do say so myself.
February 15, 2012: This post has an update.