protolog: binary logging for Apache using protocol buffers

If you're reading this, odds are good that you are looking for a way to log hits from your Apache web server without generating ASCII spew which is hard to parse later. If so, you are in the right place. Here are some notes about making it work.

Limitations

This software definitely does not attempt to solve every problem. If it works for you, great! If not, prepare to get your hands dirty.

Log path

This will work both at the global level and in VirtualHost containers. If a VirtualHost does not contain a ProtoLog entry, it will use the global log(s).

Dependencies

protobuf-c

You'll need protobuf-c installed to make this work. Some distributions package this and others do not. It's not hard to build from source, though. Once "protoc-c" is in your path, it should be usable to build protolog.

Apache development stuff

You also need a working installation of Apache with the dev stuff included. This may mean installing extra packages on your system. If you have "apxs" in your path, you're probably set.

Apache versions

I wrote this against Apache 2.2, and it's still working with Apache 2.4. It might also work on other versions. Your mileage may vary.

Build

You can use the provided Makefile, or just crib from it to add to your own build system.

Installation

After compiling the module, drop the protolog.so found in .libs into some path like /etc/httpd/modules, then add this to your Apache config somewhere:

LoadModule protolog_module modules/protolog.so

Then you need to tell it where to log. Use the ProtoLog directory. If you install this module and nothing happens, odds are you forgot to add this directive somewhere!

ProtoLog /var/log/httpd/proto.log

Adjust these paths to suit your system, naturally.

After that, restart your server and check the log file. It should pop into existence and then start growing automatically. Don't forget to teach your log rotation stuff about it if needed.

Encoding

The log file will contain a series of "netstrings", each containing a protocol buffer message which has been serialized to a stream of bytes.

A netstring is just this:

<length>:<data>,

Or, in other words,

5:hello,

That's it - there are no newlines, carriage returns or anything else of the sort to separate these messages. It is a fundamentally binary format and must be treated as such. In particular, *do not* attempt to use C str* functions on this data, as it will contain copious NULLs and you will have a very bad time.

Dig around on Wikipedia or cr.yp.to if you want to know more about this format, but it's really just that simple.

More on protocol buffers can be found at their GitHub repo.

Fields

The fields in apachelog.proto are primarily the things I found interesting while first writing this. It's trivial to add more, and it's also trivial to not use any of them due to the wonder of "optional".

The only thing you really do not want to do is change the number of a field after it has been used. If you do that, you will unleash the demons of uncertainty upon your logs. Don't do this!

Obviously, the potential for much badness exists here if people fork this thing like crazy, add a bunch of overlapping fields, and then expect compatibility down the road. All I can say about that is to repeat myself from before: don't do it. Work together to coordinate these things, and analysis tool compatibility will be your reward.

Decoding

Assuming you have a simple state machine to extract the data from inside each netstring, then you can just pass the bytes directly to protobuf.

In C++, it looks like this:

  LogEntry le;
  if (le.ParseFromString(bytes_from_netstring)) {
    DoSomething(le.fieldname());
  }

Java and Python will be slightly different but the same general principle applies.

Download

Bugs

Having this much data trivially available without parsing ASCII may make you realize how much of a scam so-called client-side "web analytics" scripting actually is. Oh wait, that's actually a feature.

Contact

You can send comments, questions, or whatever via my contact form.