Writing

Software, technology, sysadmin war stories, and more. Feed
Monday, January 9, 2012

Prove the badness of coworkers at your own risk

This is a tale from my cache of things to never do unless you are absolutely sure your boss is on your side. Having been through the sysadmin wringer in a few places, I have plenty of stories like this.

We had this cluster of machines which were responsible for tracking various bits of data for us. Some of them stored metadata for the cluster, and they were relatively special. There were rules about how they were provisioned. For instance, all of our metadata servers were supposed to run on identical hardware. This was intended to avoid anomalies which would jump around depending on which machine was running the show (the "primary master") that day. Having one that was slower or faster than the others could have complicated things.

There was another common rule: every cluster would have exactly n machines running as master. That value of n was static across our entire fleet. It might have been 4, 5, 10, or 20 -- it's not important to the story. The point is that all of them should have had that same value no matter what.

One day, I noticed that one of our cells had n+1 masters running. This would sometimes happen for a brief period if someone was purposely swapping out a machine. In this case, you'd add one, let it sync up, then drop the one you wanted to remove. If someone happened to look at it during that period of an hour or two, they'd see one too many.

This was no big deal. What I found unusual is when it stayed like this for multiple hours. Then multiple hours turned into days. That's about the time I got an idea: instead of cleaning it up again like I always did, I was going to leave it alone.

Moreover (and this is where I made my mistake), I told my boss about my experiment. I told him during a chat one afternoon that we had a misconfigured cluster, and while it wouldn't harm anything, it looked a little weird. I wanted to see who else had the awareness to pick up on it and the good sense to actually do something about it.

He basically nodded and made a mental note of it.

Weeks passed. I think we got about two months into this misconfigured situation when a mail finally came to us from some of the people who did housekeeping in that part of the world. They had been receiving an alert from one of their systems which looked for general anomalies, and hey, one of our clusters was stuck at n+1! They asked us to do something about it.

Well, obviously, at this point, I could no longer just let it rot there as an experiment because now it was annoying real people. I ran a few commands and it vanished a few minutes later. I responded to the housekeepers and apologized for the anomaly and let them know it had been resolved.

Later, I raised this with one of my teammates. He said something about not knowing how to do this kind of thing. I mentioned that there was an automated system which did 90% of the work -- you just had to say "hey you, remove machine X for me", and it would do the rest.

To that, he just said "the documentation sucks". My response to that was "uh, well, I figured it out, so how bad could it be?". He didn't like that.

Next, I asked an honest question: even then, why was it automatically up to me to get these things to work? There was no division of duties on the team. Everyone was responsible for the system as a whole. Even when you weren't on call, there were things to check on and adjust from time to time. This was one of them.

Basically, I asked why he didn't take care of it. His response floored me.

"Oh, well, you always take care of it."

I just wish I was making this up, but nope. Just like all of my other stories about ramrods at this gig, they're all true. This really happened, and nobody stood up for me.

The final insult came some weeks after that. I got my performance review and my boss had decided to use this little event against me. He actually said something along the lines of "Rachel should not leave little things unresolved just to see if the rest of the team will fix them".

Got that yet? He took what I had set up as a little pseudo-managerial experiment to see just how lazy these people were and turned it against me. Meanwhile, nothing happened to the actual people who were lazy!

My advice is simple: unless you are the boss, or unless you are absolutely sure your boss won't turn around and do something hateful, conduct this sort of experiment in secret. You might talk about it with a trusted friend several levels removed from your team, but don't let the actual team find out.

Sure, they're lazy and they're in the wrong, but if you're working at a company where that's not a problem, nothing will ever happen to them. Meanwhile, if your company looks down on people who expose such situations, you will be stuck with the blame.

There is one tiny positive which came from this. I managed to establish that this particular manager was not to be included in certain things. He couldn't be trusted to do the right thing with that knowledge.