The making of my Apache binary logging project
One of my friends pointed out that all of this referrer business had come together quickly. I forgot mentioning my desire to clean up some of the parsing garbage in my life some days before, and had no idea she had any idea of how long I had been working at it. Really, it's not that difficult. Here's how something like this comes together.
Back on Monday, I realized that my web server's referrer logs are becoming more and more interesting. As people share links to various posts from Twitter or Hacker News or whatever, I see new things pop up. It's fun to follow them back and check out those conversations and occasionally contribute. If nothing else, it frequently gives me a place to go with a future post.
I've been doing this for a little while with a horrible hack. I had been running a little shell script which would use cut to get the fourth field using " as a delimiter. Then I did the usual piping through sort and uniq to get a list. That was then trimmed to what would fit on one screen without scrolling, and I kept it running in "watch".
Obviously, once I've seen a referrer and have investigated, it's not very interesting. So, I had to build a list of patterns and feed that to grep to remove them from my results. It works, but it's annoying. I wanted something a little more pointy-clicky, in other words.
There was just one catch: getting the data into that server meant groveling through the Apache logs, and that's a big bag of hurt.
On Tuesday, I had a look at some existing logging modules for Apache. The amount of random cruft in there scared me. It wasn't clear just where the web server ended and where the add-on code began. They were clearly written in a closely-integrated fashion. I decided to shelve that for the time being and started looking into parsing.
This is about the point where I ran into programs which look for literal "GET" or "POST" in a string and try to intuit everything else from that. This left me sufficiently cynical about the whole thing and I decided to try poking at it with a regex. Yes, this meant I now had two problems. I got a good chunk of the way through it but still felt really, really dirty. I wasn't anywhere near done but I could see all sorts of corner cases which still had to be handled. It was pointless busy work.
Finally, I closed the evening by writing a dumb state machine which would handle the basic rules: split on a space unless you're inside a "double-quoted element", \" is really ", and \\ is really \. That would give me a STL vector of strings, and then I could go to work on those, making sense of stuff like the "[12/Feb/2012:16:33:55 -0800]" date stuff split across two strings. Turning that back into a time_t was not fun.
The next day, I tested it and found that while it worked, it was ridiculously slow. Having to run strptime() and mktime() created tons of churning as it fetched tzinfo data for every single log entry. It took 2.5 seconds to go through 11,000 log entries. That's horrible.
I decided to take another look at the Apache module system. First, I took apart one of the existing log modules just to figure out where it got all of its data. I found the remote host, remote address, local address, and so many more things hiding in different structures. Some of them were just bare pointers to char* arrays while others needed to be accessed through helper functions. It didn't look too bad.
On Thursday, I managed to create something which would hook into the "log transaction" process in Apache and would shovel test data at the syslog. I didn't want to deal with actually opening files and all of that in there, since it's spooky multiprocess stuff inside their API. This little proof-of-concept module worked, but now I had a new problem. You build these things using a tool called "apxs" which uses the same compile settings as the actual web server. This way you get compatible binaries. That's all well and good, but it's C and I wanted to use C++, not the least of which was for the "real" protobuf library.
So now I had another problem: how do I make a C module and still use C++ code? My "solution" was unbelievably evil. My C module would be a tiny little shim which was just enough to talk to Apache and grab whatever data it needed. It would then dlopen() a C++ shared module that I also wrote, use dlsym() to find the entry point, and jump blindly into it to deliver the payload. Brilliant, right?
No, it's a horrible mess. Just because it's technically possible and makes people gape in awe doesn't mean I should inflict it upon the world.
Instead, I dug around and found a port of the protocol buffers library to C. It didn't work anything like the "official" C++ / Java / Python versions from you know who, but it did generate compatible output.
After doing a bunch of unrolling of existing modules, I had some idea of the flow: where things need to be placed, what calls I need to make, and when to actually grab the log data. By the end of the night, I had something which would wake up for each web hit, build a protobuf, wrap it in a Netstring, and dump it into a log file as a binary blob.
Finally, on Friday, I wrote a decoder for my Netstring file. Due to the nature of that encoding, it had to be a very small state machine to look for digits, then a colon, then that many bytes, and finally a terminating comma. Anything wacky causes it to just ignore that chunk and go into resynchronization mode. It doesn't just punt on the whole log, in other words.
Once it gets a full message, it kicks it over at the "real" (C++) protobuf library and out pops a fully-populated message. Now I have all of the data which was gathered for that web hit available to me in my own program. It's all socked away in nice variables with no ridiculous escaping required. Even if it has wacky control characters or evil attempts to screw up parsers, my program doesn't care, since it doesn't look at it that way. It's just a bunch of data now.
Now it was time to run a comparison. The new program jumps through a log file with 5200 entries in about 180 ms, which is about 35 microseconds each. Meanwhile, my "split into chunks and then interpret" scheme from Tuesday night goes through about 18120 entries in 4400 ms, or about 242 microseconds per entry. That's about 7 times slower.
My new scheme also gets way more data about a hit than the ordinary ASCII stuff. I've put in a whole bunch of fields just to see which ones turn out to be useful. This is what one example protobuf message looks like when expressed in debug form:
user_agent: "fred feed fetcher"
That time is just a time_t with 6 extra places of precision for fractional seconds. Apache has way more precision internally than you might realize, so I just carry it through as a 64 bit number.
This kind of stuff is easy when you take the time to make it right.
Does this sound interesting? Do you want to try it too? If so, send me a note through my contact page.
February 15, 2012: This post has an update.