Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, May 27, 2012

Inspect a project carefully before you leap onto it

One time, I was becoming bored at my day job and went looking for other things to do which still involved programming. I figured maybe I could find something which could use my volunteer help as a part-time sort of deal. I knew about this other group of people who were developing the infrastructure stuff for a service I had previously used. They seemed like a pretty solid team from my view of them as a customer, so what better place to spend some time, right?

I got over there and had a meeting with the guy in charge. He said they actually did have a project which could use some help. I agreed to take a look. It was something intended to cut down on requests for support by end-users. It essentially ran a series of checks for best practices against an installation to see how things were doing. It would basically give users things to try before resorting to looping in people.

I thought of it as a "linter" for the infrastructure service, and figured that was a good idea. It also lined up with ideas I had in the past regarding better tools for happier users, so I started poking at it.

It turned out that I made a big mistake.

What I didn't realize was how they had built this tool. It was just a huge Python wrapper around a bunch of command-line tools which are normally run by end-users. When they turned "best practices" into a tool, they literally just mapped the same commands across into things run by subprocess (or popen12345, or whatever they call it today) and called them instead. Then it groveled through the human-readable output and parsed it. Yes, I'm serious.

I was freaked out by this. There are things which you just can't see through those existing command-line tools. They are built for ordinary situations, and really can't be extended to pick up the sorts of crazy things which can and did show up as misconfigurations.

For example, let's say you have a directory server (think LDAP) colocated with each cluster of machines. Chances are, you want things running on that cluster to speak to the local instance of that directory server by default. Otherwise, you're going out over the unpredictable Internet to get to a location far away.

Doing this kind of thing adds latency, and it also adds another failure mode, since now "take down the whole cluster" type maintenances on either cluster will take out your service. It's a particularly nasty misconfiguration, since ordinary tools will not show it.

Now, with that said, the actual library code which runs underneath these command-line tools can get to all of this stuff, and thus anything which uses those libraries can find it. The problem is that the existing CLI tools did not, and thus, their wrapper which builds on top of that couldn't possibly find it.

I was faced with retooling their entire helper program to use direct library calls for everything in order to add one dumb little helper. This is the sort of thing which should have been designed that way from the start, and it definitely was not what I was looking to do. Instead of adding something helpful, I had walked into yet another project which had fundamental design issues which would need to be addressed before useful work could proceed.

I went back to their boss, apologized for taking up his team's time, and left the project. Fortunately, nobody gave me any grief for it.

Experiences like this have forced me to demand even more gory details about projects before committing to them. If it's already gotten to the point where "you can't get there from here", at least, not without tearing the whole thing up, I want to know about it in advance. Finding out later only leads to misery.

I share this so that others might benefit from my mistakes.