Writing

Feed Software, technology, sysadmin war stories, and more.

Tuesday, June 12, 2012

Troubleshooting and laziness

People who read my posts sometimes ask me if I know why some people on one of my old support teams used to resent me. These are the kinds of coworkers who starred in my anti-pattern theater post from last December, for those who are keeping track.

I get questions like this: do you have any idea what was going on? What did you do to make them hate you so much? Well, by going back to some of my old chat logs and personal diary entries from those days, I can get a pretty good handle on it. Since things were logged mere minutes or hours after an event happened, the details tend to be far sharper than just a direct recollection from this point in time.

One event in particular happened at one of these meetings where the developers and pager monkeys (like me) would sync up every week. I had added something to the agenda which needed to be discussed: "Troubleshooting: what is, what isn't", and below that, "supposed pipe #2 badness".

Now, I obviously can't tell you exactly what happened, so here's an analogy. We were plumbing water around using different types of pipes. Each pipe had different guarantees about how much water it could move on behalf of all of its users, and once exceeded, it would "lose" the water -- it wouldn't make it to the other end.

The pipes were rigged such that #1 was the top priority, #2 was next, then #3, and finally #4 was the lowest of the numbered pipes, with this swampy unnumbered mess under that. All of them, even the "swamp", moved water, but if push came to shove, a higher-priority user would win out.

Got it? Okay then. Moving right along.

So, there had been this thing going around. It was almost a meme. Basically, people had been telling each other that there were leaks in the #2 piping, so they weren't investigating weird delays and other anomalies in our systems. It was a problem for the pipe people to maintain, and we aren't them, so they could just sit back and wait for it to be fixed, right?

When it came time to discuss this item, I started by saying I had made a mistake. I had actually screwed up and had us configured to use the #4 pipes. This mistake was possible because we were actually connected to *all five* pipes at the same time, and had our own plumbing to pick which one to use for various tasks.

Anyway, as a result, none of the meme-swapping about the #2 pipes being leaky mattered worth a damn. We weren't using them at all!

I continued and said that I had been asked to fix it, so I set my mistake right and configured us to use the #2 pipes as originally intended. However, I didn't stop there. Instead, I started looking at the actual problem instead of just saying "oh, it's lossy". It turned out there was a bug! No matter what you told the system to do, it would actually use the "swamp" to get stuff done.

I checked with the creators of the system to see if I had misinterpreted something, and if it was supposed to do that. They said, no, it should use the level that you asked for. They took a look at our setup and agreed something was wrong, then went to work on a fix. Not long after that, we had a nice patch which took care of things.

I said "that is what troubleshooting is".

My fellow monkeys were not happy with me when I said that. I had more to say, and continued.

"I looked at the alerts, down to the fixtures, to the pipes, to the fittings, to the washers, and down deep into the original plans. That's what we're supposed to do!"

So another one of my fellow pager monkeys chimed in.

"Given that I was on call last week, I take this personally. I'm not about to go chasing every single water leak all the way down to the fittings and washers. I was told we were using the #2 pipes, and I found leaks in the #2 system."

Tough life, huh? All you need to do is find the shortest path to a resting state. Just have someone come up with a plausible excuse that blames an outside party or component. Then, go and find a metric which shows it as ongoing. After that, when something comes up, point at the metric which shows something happening, and lean back and relax.

I'll say it again. Refuse to be mediocre.