Writing

Software, technology, sysadmin war stories, and more. Feed
Saturday, August 10, 2013

The year-long bug

The age of a problem, bug report, or ticket can be a signal of complexity, but it must not be taken alone. There are other things which are important, and one of them is the individual who winds up working on it.

I've showed up in situations where there has been some "it does that" problem with a system that's been around for a year. Somehow, it gets assigned to me, and I start sniffing around, trying to figure it out. One thing I try to do is reproduce the problem. Does it even still exist a year later? Let's say it does. Yep, that weird thing definitely happens when you do this other thing. Guess I should try to fix it.

What happens next depends on what sort of system it is. Is it something common which exists everywhere? Is it something like Linux, or one of the BSDs? Apache or MySQL? sendmail, postfix, or qmail? Or, is it some proprietary system for which no analog exists on the outside? Is there any possibility I could have worked on this thing before, or is it entirely new to me?

Let's say it's an Apache system which is doing something weird. Odds are, I've played with something like it in the past. Even if this specific anomaly is new to me, at least the general neighborhood is familiar. There isn't a whole stack of knowledge required just to find things. That's when "domain experts" can be useful.

On the other hand, what if it's the big bag of proprietary stuff? It could be something which evolved organically to solve some problem, with its own rules, layers, protocols, user interface guidelines, precedent-setting decisions, and yes, bugs. Actually fixing that is going to take far more work. There's a whole body of knowledge which must be acquired to properly understand the context of the problem, and only then can a real fix be created.

Otherwise, you're liable to just slap yet another patch onto a system which might already be nothing but patches. Which one will be the one which finally brings the whole thing down? Or, you might do something which immediately conflicts with a decision made somewhere else, because they already decided they liked it this way. Also, if it's been open for a year, assigned to people who have already been working on this stuff for at least that long, the fact they never solved it in that time does not bode well. In theory, they're the ones who already have that body of knowledge, so whatever it is must not be a simple fix as far as any of them know, or they hopefully would have patched it already.

This might be the sort of thing you assign to a person for "trial by fire" purposes, where you give it to them because you want them to learn the entire stack and become a new member of that team. The path through solving this problem will establish the necessary bits of data in this new person's head which will make them productive later on.

I would not characterize such a task as the sort of thing you hand to a "tourist" -- someone who's just stopping by to help tidy things up, and is not expected to know the entire system. You might also call them mercenaries: you bring them in for a quick fix, but not a total system rewrite.

In such situations, it's the duty of the "tourist" to speak up and say something, and then take decisive action. Either they drop the tourist designation and commit to becoming part of that team, or they drop the task as inappropriate.

Anything else is just a recipe for sickness and stress.