Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, January 30, 2012

Doing the right thing when someone else's code breaks

Here's a scenario which is bound to arise in any large organization where software is a nontrivial part of the business. You will be the user of tools which are created by another team. They are responsible for dealing with features and bug fixes and periodically drop a release with a version number for users like you. It's not your source code, but you wind up indirectly using it all the same.

I saw a situation once where my team was relying on the releases of another. We had to keep a big service running which had many moving parts. In addition to the actual "always on" stuff, there were a dozen or so utilities which were also provided by the developers. These utilities allowed you to do various maintenance tasks on your instances of the service. Maybe you'd freeze or thaw something, run a backup, migrate some data, or run a restore. It's the standard data storage and manipulation stuff.

One morning, I came to work and discovered that there had been a restore request of some sort. It was fairly urgent, and the on-call person had handled it over the weekend. I found out that he had run into a problem, though: some aspect of the restore program did not behave like its siblings which did other tasks.

Apparently the developers had neglected to add some command line parsing stuff which allowed you to put hex values in like 0xff. Instead, it would only accept decimal values. This was surprising because all of their other tools already supported it. It was just an oversight from all appearances, since it made sense to support it for restores.

I was pleasantly surprised when I heard what had happened. Not only had the on-call person managed to get the restore to work, but he had figured out what the problem was with the restore program. He had checked out a copy of the source, sifted through it to discover the missing parser, and came up with a workaround to get his restore done.

The next part made me very happy. I was proud when I heard it.

After starting his restore, he figured out what was missing and found a copy elsewhere and used it to fix the restore program. Then he sent a patch to the development team so that it might find its way into the next official upstream release.

That brightened my whole day. A lesser engineer might have punted on the whole thing way back when that restore failed to take the 0x1234 hex stuff which was needed. They might just report failure and possibly file a bug with the developers, then go off and do something else.

This person figured it was simple, and gave it a look. When it turned out to be a trivial fix, he did it himself and later got it committed to the tree. This is what's supposed to happen every single time. You might not always come up with a fix, but you should at least give it a look instead of just giving up because "it's not my code".

The only sad part about this story is that it was unusual. I had witnessed far too many instances of someone punting without even trying to troubleshoot what was going on, and it had soured me on the whole environment.

As for the events of this story? They should be the norm. It's a joy to work with people who believe in really digging into problems.