Writing

Software, technology, sysadmin war stories, and more. Feed
Friday, November 23, 2012

The bogus system administrator hiring ability curve

Thinking back, I realize that system administration is treated as less of an ability in some places, especially when compared to the algorithm-slinging, math-wielding software engineers they normally hire. I found out they were doing something weird in how hiring worked and it resulted in a lot of unnecessary pain.

The hiring process was explained to me as a two-dimensional chart. Imagine the Y axis is a candidate's ability to be a system administrator, and the X axis is their ability to be a software engineer. This is all relative to the company's (probably flawed) metrics used for assessments, but go with me here. If you take those two abilities and plot them on a graph, you end up somewhere in space.

Hiring graph

If you plot a curve from one axis to the next, you get a cutoff which can be used for hire/no-hire decisions, and this is effectively what they did. I've provided a sample chart which attempts to duplicate what I saw when this was explained to me.

Their assertion was that you might think they were looking for candidates at position 4: high SA skills and high SWE skills both. They probably are, but they don't really find people like that. Something to do with not enough people in the world, perhaps, but I digress. While looking for people with both skills, they find a bunch who are at position 2: medium SA and SWE skills. Those people fall inside the cutoff and are rejected.

That leaves candidates at positions 1 and 3. Position 1 is heavy on sysadmin skills but weak on software engineering skills as far as they know, and position 3 is the other way around. But, because of this scheme they've devised, they say "you're so good at this one thing that it excuses your lack of ability in the other". In other words, since you're out past their imaginary line, you're good to become a combo sysadmin+programmer sort who wears a pager and maintains high-reliability services.

How this works in practice is that any time someone shows up in the hiring pipeline at position 3 (high SWE, low SA), they are shunted to the pager monkey side of things. That side of the company gets first crack at them, and can "steal" them away from the usual pipeline which is there to hire ordinary programmers. When these people get their job offer, it says they are to become a site reliability engineer blahblah, and they don't know any better, so they accept.

They show up a few weeks later and are essentially handed a pager and the keys to a global service with many moving parts. This is when things get interesting.

More than a few of these people have no idea how to act as the administrator of one Linux box, never mind tens of thousands. I never would have suspected this at first given the supposed thorough vetting the hiring process gave them, but it happened. I only figured it out well after the fact. Here are the sorts of things which happened.

One guy was brought on in this fashion and was probably caught off guard by this. After all, ordinary programmer types at this company don't get on-call duties with a pager unless they're on the cusp of going live and haven't earned their own separate pager monkey team yet. This guy shows up, is handed the beeper, and probably starts wondering what's going on. He might have been a good programmer, but he had no business trying to run the actual boxes themselves.

For his starter project, he was asked to take a matched pair of machines running some crusty Linux distribution and see to it that they were upgraded or replaced as appropriate so that they could run 64 bit binaries. Upstream processes had started shipping us 64 bit versions of certain tools, and we could no longer use our non-x86_64 machines to get things done.

This should be a simple affair. If the existing machine supports it, then you move services off to somewhere else, reinstall it with the new 64-bit flavor of the OS, and move services back. Then you verify it's all stable and happy, and then do the same thing with its twin. If for some reason you can't upgrade them (old hardware), then you stand up a new machine, migrate things, swap them around, and kick out the old box. Then you do the same thing for its twin.

He was handed this project and just basically sat on it for a good week or two with no apparent work being performed. When I pinged him to ask about it, he pasted the output of "uname -a", pointed out something which said "64" in that line, and said he was done. Yes, that's right, he just found a "64" in a convenient place and declared victory. He didn't do anything else on the machine.

I logged in and tried to run one of the newer 64 bit binaries. It failed just like it had before with a complaint about a bad interpreter or unknown binary format or something like that. This is expected, given that nothing was done to the machine. I pasted in the result of that and threw the task back to him.

Months later, I had another discovery about strangeness with this guy. Something was going on and he needed to do something. I basically said "just ssh as user foo to machine bar and you should be able to get in". We had this system of role accounts set up so you didn't actually log in as yourself. You might log in as a production user, or debug, or admin, or something like that. We also used SSH keys instead of passwords.

The way it worked was that you created your own SSH key pair and then pasted the public key into a web service. This would then propagate out to the ~/.ssh/authorized_keys files for all of the role accounts on all of the machines you should be able to reach. For the most part it got the job done.

Well, on this morning, he said something about how his logins weren't working. I had a closer look and saw that it definitely wasn't working, and so I jumped into one of the machines and checked the authorized_keys file. It looked good, and it had the same public key he had published on the web service some months before. I asked him to confirm his key, and after some digging, it turned out his public key didn't match the one in the web service.

I don't know how it happened, but somehow he had managed to break or change his SSH key pair. This by itself is not too unusual, since those tools aren't the friendliest things in the world. What I find more interesting is that this had happened months earlier and he never noticed. This guy had been on-call for weeks at a time during this period and the whole time would have been unable to ssh into a single machine in production.

Either the service was extraordinarily stable throughout that time (it wasn't) or he wasn't troubleshooting anything past the data he could get by looking at debug pages in his web browser... bingo! Forget about jumping on a bad machine to see if there are crazy things in the logs, or running strace against a flailing process to look for signs of insanity. Without a good key pair on file, you could do none of it, and that's where he was.

Somehow, this guy had managed to fake his way through a couple of months of being an on-call system administrator with no way to SSH to machines. That's pretty hard core faking right there.

Some months after that, he packed up and transferred to another team where he could be just a coder and didn't have to do sysadmin/pager duties. Not long after that, another individual who was having similar issues performing such tasks also bailed out in the same fashion.

I wondered what was going on, and one day I got lucky and found the answer. I ran into one of the rare individuals who was actually closer to position 4 on my chart than either 1 or 3. He didn't even realize he was as good as he was at sysadmin work, but it happened to be that way in his case. He said he hadn't applied to be a pager monkey, and just wound up in that role. He had managed to adapt and obviously the other guys hadn't.

This led me to go digging, and I found the whole "curve" explanation. Clearly, these first two guys must have been seen as some kind of software engineer geniuses and were plucked from the incoming stream to become pager monkeys. They had no business doing that kind of work and both bailed out relatively quickly. Only this third person actually managed to make it work, and work well at that.

Note that as far as I can tell, the opposite "shunting" did not exist. That is, if someone was at position 1 (high SA, low SWE), they did not get yanked out of the sysadmin hiring stream for a life as a programmer. After all, how can a dirty, dirty system administrator be a good programmer, even if they're amazing at being an admin?

This is what happens when you pull the bait and switch thing at hiring time. You get people who might be lousy at the new job, and then transfer out as soon as possible. They mess things up while they're in the job, and they block a slot for someone who might actually be competent. What a disaster.