Writing

Feed Software, technology, sysadmin war stories, and more.

Saturday, July 30, 2011

Computer scientist does not imply competent sysadmin

There are some people who have no business being sysadmins. For whatever reason, they are missing the kind of systems knowledge which it takes to be useful. Unfortunately, far too many seem to think that just because they are some kind of genius with a Ph.D that they can just run around and manage anything. This is then reinforced by hiring policies which say "you look like some kind of amazing computer scientist, so let's give you a pager!"...

I started discovering this the hard way by seeing things which would be assigned and then would be either mismanaged or ignored entirely. I wasn't the boss, so I didn't have any particular leverage in making this situation suck less, too. I think it first came to light one week when we started running into the inevitable march of technology.

We had these systems which acted as proxies between us and the live systems. They were running some older version of Red Hat or some similar distribution which only supported 32 bit programs. This had been working fine, but eventually it caught up with us. Some program in our tool chain from an outside source stopped being released in a 32 bit flavor, and now we needed 64 bit compatibility if we were going to stay current.

As a starter project, one of the newer people on my team was assigned a simple task: make it so that proxy machine #1 can run 64 bit stuff. I mentioned that he would probably need to shovel everything off to proxy machine #0, then rework it, then bring it all back, then do the same for the other machine. It's sort of like how you do road construction: remove the load on the area you want to work on, work on it, then take over all of the load from the other side, work on that, then open it up for real. It's just obvious.

A couple of days went by, then a week. I checked in on the bug and asked how we were doing. He said, okay, it works! I was suspicious immediately. Why did he respond so quickly after being asked? If it was really done, why didn't he say so back when it happened, rather than when I asked?

I logged into the box and tried to run the new binary. Not surprisingly, it did not work. The system also looked ... familiar. Too familiar. It looked like it hadn't even been reinstalled. I figured, okay, that's a neat trick. I doubted he had found a way to upgrade that crusty old Red Hat-ish install, since I knew that wasn't our approved corporate method to go to 64 bit builds.

Finally, I ran uptime. I figured, okay, if you really did work on this box, you would have rebooted it at least once.

Uptime said the machine had been up about 250 days.

This guy had done nothing. He had taken the bug, flailed around a bunch, and then declared victory when someone finally came to ask about it. He didn't even try running the program we needed to run on that machine, which was the very reason for the request. This was obvious because he would have seen the same thing I did: that it did not run at all!

I pasted the uptime result into the bug, along with the proof that this program did not run, and basically said "this (still) doesn't work". I stopped short of accusing him of not actually doing anything to the system and just sitting there for a week.

Somehow, he found out that yes, you have to reinstall the machine as a totally different distribution to get to 64 bits. That was our corporate strategy. The old RH flavor was to be replaced with something else. 30 seconds of reading the official internal docs would have revealed that. If finding that was too hard, 60 seconds of just asking someone for help would have resolved that. This guy did neither.

I'll spare the details of the craziness which came from moving all of the stuff to proxy #0 and then to #1 and then re-splitting it to balance across both as they were reinstalled. It's sufficient to say that it was at least as painful and awkward as everything already described here.