Writing

Software, technology, sysadmin war stories, and more. Feed
Sunday, September 25, 2011

Config file idiocy and poor production hygiene

Beware of sysadmin busy work. Beware of individuals who seek out sysadmin busy work. They might be doing it as a form of avoidance, or they might be doing it to get a meaningless bullet point on their internal resume.

A few years ago, there was this big system with a bunch of moving parts. It had a bunch of matching config files which all compiled down to a common low-level language. The specifics aren't published anywhere, but for the sake of this story, think of it like how sendmail uses M4. You have one file that gets mechanically translated to another.

Someone got the bright idea to start changing all of this stuff to another high-level language. There's no good analog of this in the sendmail world, but think of it as something else which also translated to the same CF format in the end. I said no to it, but nobody else would go with me on it, so it happened anyway.

The individual in question was supposed to make it a no-op change. That is, the output from the original config file and the output from the second language file should have been identical. That meant none of the moving parts would know that anything had changed.

That was the deal. He didn't hold up his end of it.

Nope. Instead, one fine Saturday morning, I found myself getting paged mercilessly. The first four or so times it happened, I just rolled out of bed and poked at it long enough to shut it up. The fifth time, I declared it impossible to get to sleep and went on the war path, seeking to fix what had actually broken.

I started looking at the location which was going bad. It had just been converted to the new config language. I didn't realize it at first, but large numbers of changes had been applied at the same time. As I started digging into whatever had happened, I started noticing all of these deviations from our standard configuration.

Part X was only supposed to get N% of a machine, not 100%. Part Y was not supposed to co-exist with part X if hardware item Z was not present. All sorts of random crap like this started turning up. Now I had a real problem: were all of the pages coming from the avalanche of stupid changes which had been sneaked in with the config change, or was it something else? I could not tell.

Given that it was Saturday morning and I was the one being annoyed by this as the on-call pager holder, I used that authority to declare martial law on that particular instance of the service. I ripped it down and put it back up with the known good config from the original config language. This eliminated the obvious X/Y/Z co-existence problems, but the original paging problem remained.

At least now I knew that it wasn't this guy's change which caused things to page me. With that out of the way, I proceeded to dig more deeply into the remaining parts of the system, and wound up discovering a new version of an external dependency had been pushed and it was breaking us. That wound up turning into a horrible hack to keep things alive while the team in question fixed their bug.

After that, even though I still did not agree with it, I put back the new-style config, warts and all, and opened issues (think trouble tickets) with the originator to get him to fix the mess he made.

Epilogue: about two years later, I found out some relatively new person on the team was working on a project... to convert BACK to the original config file format. Way to go, guys!

At least there is one good part to all of this: the person tasked with converting back (which should have never been necessary) had the good sense to quit. Clearly, he saw the writing on the wall.