Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, May 20, 2012

Production change watcher

What do you do when you have a large distributed service which can run on computers all over the world and you need to keep track of changes? There are some more complicating factors here, like having groups of computers come and go over time. You want to know about instances of your code running out there without having to explicitly add or delete those locations from yet another config file.

Worse still, imagine there are about a dozen people who have security access to actually make changes to this global system, and they're not always good about logging what's happening. This is going on in parallel with your own on-call duties when you are expected to deal with problems right away.

My solution to this was to rig up a tattler of sorts. It would grovel around in a few places to find every possible cluster where jobs could run, and then it would poll each one for status. Any jobs it found which were owned by my service would then be noted. If it had seen them on a previous pass, then it would compare their current status to that of the last pass. Any changes would then be logged as having happened in that specific interval, and it would grab the responsible party.

Here's the catch: at the time, there was no audit log. All you could see was the current state of a given job, when it had last been changed, and who changed it. That meant you had to keep checking to catch a change by someone before some other change bumped it out.

It wasn't perfect, but it did work. I turned it into a simple web table where jobs were rows and clusters were columns, and each cell would change its contents (color and character) based on what had been going on lately.

This created some interesting patterns. Multiple changes applied to a single location would show up as a series of "hot" cells stacked in the same column. If, however, someone had just changed a given job in a bunch of places, that same "hot" pattern would stretch out horizontally in its row.

Being able to match this up with pages was great. If something in that table happened to match the jobs which started throwing alerts, then I could find out who did it and ask them what they broke. Normally, this sort of thing would slip away unless someone happened to notice the same userid in the "last changed by" field and similar timing in the "last changed at" fields.

Sure, it was an evil hack caused by being unable to get a reasonable feed for change data any other way, but it was definitely useful.