Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, August 21, 2011

Log your production changes, but know your users, too

I was once on a team which was responsible for keeping a globally distributed and replicated database happy. It had dozens of parts replicated hundreds or thousands of times spread across the world, and quite a few people poking at it. It also had no way to answer exactly when certain administrative changes had happened. I decided to do something about it.

As I was moving onto this project, it became clear that there were a number of gaping holes on the administrative side. One of them was the lack of an overview for versions, configuration details, and so on that I handled with colored stripes from MD5 sums. While working on things, I found myself creating a bunch of dumb little tools to do repetitive tasks. I also found myself wondering exactly when I had performed some operation in the past, and decided to add some instrumentation to those tools.

At first, my tools would just drop a line into a flat file which was living in my home directory. I made it world-writable in case anyone else happened to use those same tools. This got things going without anyone really realizing what had happened. It wasn't until something came up and I was able to answer it by grepping my file that someone said, hey, what's all this?

I decided it had outgrown my ridiculous world-writable flat file, and decided to turn it into a rudimentary service with a real RPC mechanism. This is when the bikeshed people came out of the woodwork. They wanted me to use this thing, or do it that way, or whatever, even though it was completely off their radar before that point.

One of the solutions they wanted me to use was completely wrong. It wasn't accessible from large parts of the network due to some ridiculous design decisions. It also only spoke a network protocol which was only really supported in one programming language, even though the company had its own network protocol which worked on everything you would ever run in production. Finally, and I wish I was making this up, it was named after a porno movie.

I wound up rejecting this completely, and instead wrote a small, stupid program to just sit there, receive incoming RPCs using the standard company protocol, and write them to permanent storage. This kept the behavior of "anyone can append to the log" without the world-writable specter of "anyone can edit the log" which had previously existed.

But, the bikeshedders were not done. They wanted a schema for this stuff. Yep, we hadn't even really started using it yet, and they wanted to categorize things. I've always found this to be ridiculous. It's like the people who go out and make a huge hierarchy on their web site for "all of the stuff they are going to do", even though there's no content. It makes no sense. You haven't created it yet, so how can you possibly know what the hierarchy will be?

My position was firm: I'm not doing any of that crap. My logging was, and will continue to be, a line of text. The idea is to be dead simple so that people might actually use it. See, I knew my audience. I knew there were a bunch of macho cowboy ramrod types who would be working on these things, and they would chafe at the idea of "you must log your stuff". So, they'd do just the minimum. With that in mind, what do you think they are more likely to actually do?

Is it this...

$ util log started frobulator on sweden bravo 5 to dodge maint in alpha 1

... or is it this ...

$ util log -l sweden -s bravo -n 5 -i frobulator -A start

The second one turns into a nice happy schema, sure. You get a column for "location" (sweden), another for "building" (bravo), a third for the cluster (5), a fourth for the server name (frobulator) and finally a fifth column for the action taken (start).

You also probably never get that data at all, because who's going to memorize all of those switches? Odds are, the people who already weren't logging their changes aren't going to start doing something this fiddly.

On the other hand, the first one gets the point across. It's not automatically chopped up into those columns, but who needs it? The actual use case for this log was when something went wrong. You'd go back and look at it by time and see what might have happened during a certain interval. As long as you can parse what it said there, it's useful.

Obviously, the automated logging done by programs which already know all of these fields could be laid out in some way that it could be automatically extracted later. That's not the same thing. This was about logging what humans did in the course of running a system, and forcing schema onto a bunch of fundamentally lazy creatures is just a bad idea.

Finally, I played a trick on you. The second example completely misses why that human made a change. They started bravo 5 because alpha 1 was going offline. The second log format captures what they did succinctly but didn't have any of the "why". The first one gets all of it.

It's just like commenting a program. "Adds 1 to i" is stupid and useless. "Adjusts for fencepost error in data from stupid vendor" is better. One is the "what". The other is the "why". Guess which one is more useful.