Writing

Software, technology, sysadmin war stories, and more. Feed
Monday, May 20, 2013

Sysadmin work teaches you the value of stacks

Before I knew to call it "yak shaving", I had a way to refer to all of the dumb things which would pile up while trying to accomplish something with my computers. I called it a "stack", as in the data structure, or like a pile of trays at a cafeteria. Every time you hit a new speed bump, you have to push something else on the stack and go work on that. Only once that's done can you "pop" it off and go back to the previous thing.

Of course, there's no guarantee that any given item will be the last one. It's entirely possible for them to have their own problems. By the time you get to what you set out to accomplish, you may have done quite a few other annoying things which got in the way.

About 10 years ago, I was on a business trip to a customer site. I had a bunch of boxes there which mostly ran by themselves remotely, but about once a year I would go there to do certain things in person. It was a good way to remind them of why they kept me around, among other things.

I would save the "riskier" bits of sysadmin work for these trips. These were the system upgrades where a machine would be taken from one version of the base OS install to another, for instance. While I had a way to do this without even so much as a reboot much of the time, that only applied to the less-important machines which could be replaced by another one on the same network.

Those systems which had special hardware or otherwise had a special network position would be treated as risky. It was one such upgrade which got me into a serious bout of yak-shaving one morning, or as I called it then, "a very deep stack".

This box needed to go to Slackware 8 from whatever version it had been running at the time. I think it was far enough back that it was still based on libc5 - meaning version 4 or earlier. Normally this wouldn't be a big deal. For efficiency, I decided to do it by feeding the machine a CD since it actually had a drive in it.

Unfortunately, the stock Slackware 8 CD didn't have a kernel which supported the SCSI adapter in that machine or its network card. If it was just one of them which had been missing, I could have used the other to grab a module and go from there. Instead, the only solution was to burn the "extra" add-on disc image since that had a kernel with all sorts of bells and whistles.

So I had to go download that image and then copy it to a machine which had a CD burner drive, and that's when the next problem cropped up. While trying to copy the file across I noticed it was being stupidly slow. Now I had to start digging around to figure out why.

It didn't take too long to realize that something was dreadfully wrong with my switch. The day before, we had installed new switches on my networks there, and one of them was clearly sick. It would reboot any time I tried to push a lot of data across the network. I'm not sure why it hadn't come up before, but it was definitely a problem now.

I figured it was a sleepy Saturday morning and nobody else was really using this network right then (thanks to my planned maintenance), so I might as well try to upgrade the switch. There was a new flash release which seemed promising. Of course, then I had to get that file and go through their tftp shenanigans to get the upgrade working, and that's when another problem cropped up. The act of trying to upgrade the switch seemed to make it die, too. It would reboot during the upgrade and that attempt would fail.

I finally had to just dig around in the office to find another new switch as a replacement. Then I had to reconfigure it to behave like the one which was misbehaving: port security, acceptable MAC addresses, SNMP community strings and passwords in general. Finally I had to swap it into the rack without messing up the mapping of patch cables to ports.

Once that was done, I finally had a stable network and could push the ISO to my CD burner machine. Then I had to wait for that to give me a usable disc, and only then could I boot the target machine into the right environment. After all of this stupidity, actually upgrading the OS was trivial. It was mostly a matter of upgrading a bunch of packages and checking over my config files to make sure nothing important had appeared in new versions.

Later, I put the machine back online and swapped things around so it started receiving traffic again. I still had a mostly-dead switch to label so nobody else would grab it and get in the same mess. It needed to be sent through the RMA process, and that's a whole process I'd rather avoid doing myself. I left it to the people on site to deal with it, since they were driving the hardware upgrades anyway.

Systems administration may not involve a whole lot of computer science, but I bet anyone who's done it long enough has learned about stacks the hard way! If not, they've probably dropped a whole bunch of stuff on the floor... into a heap.