Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, March 12, 2012

Revenge of the ASCII log file

Remember my parsing challenge from last month? It turns out that particular rabbit hole goes even deeper.

Who out there thought they could come up with a perfect Apache log parser which would handle all of the escaping and other shenanigans which come along? Were you happy when you handled both \\ and \"? Well, I have another fun one for you.

aa.bb.cc.dd - - [12/Mar/2012:01:46:31 -0700] "GET /xyz/ Result: \xf4\xee\xf0\xf3\xec \xed\xe5 \xed\xe0\xe9\xe4\xe5\xed / \xed\xe5 \xf3\xe4\xe0\xeb\xee\xf1\xfc \xee\xef\xf0\xe5\xe4\xe5\xeb\xe8\xf2\xfc IP HTTP/1.0" 200 9903 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

That seems to be a half-witted attempt at forum spam, but it only managed to spit up a bunch of Windows-1251 code page garbage. It apparently translates to "forum not found / could not be determined", for what it's worth.

So, did your parser handle that contingency too?

Full disclosure: my own protobuf-based system was mildly annoyed by this because it turns out that "string" in protobuf-ese means "valid UTF-8 data". Obviously, CP1251 gunk is not UTF-8, so it was unhappy with that stuff at first.

I've flipped the spec over to "bytes" instead, since there is no way to guarantee valid UTF-8 data, and pushed a new version of protolog. It doesn't break binary compatibility with the log files, happily enough.