Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, April 1, 2018

Everything isn't broken, but a few things might be looping

Years ago, I worked good old-fashioned tech support, answering tickets and taking phone calls. It was primarily Linux troubleshooting, but a few of the more interesting customers had FreeBSD or Solaris. There was an entirely separate group of teams for Windows customers, but when things got busy, if their phones filled up, a Linux person's phone might ring. You'd have to deal with it as best you could, and hopefully without creating more backlog.

As a result, while people tended to specialize in one or the other (Unix type machines vs. Windows type machines), most folks tried to be moderately useful for the most common requests on both sides. If someone called up and asked for their machine to be rebooted, you probably didn't need to raise a tech on the other side, in other words.

After I had been doing this for a little while, late one night, I openly lamented the apparent situation that "everything seems to be broken". One of the other people on the floor that night had been doing it much longer than me, and said something that didn't really "land" with me at first: we only hear about the stuff that's broken. The other stuff which is working just works and nobody hears about it.

It took a while for that to sink in, but over the months and years that followed, I realized it was pretty close to the truth. There might be hundreds or thousands of a given flavor of hardware or software installed, but they didn't all generate support requests. Most of them were specced out by sales, were put online by the provisioning system and datacenter folks, and just sat there in the rack for years until some day the customer moved on. The most attention they might have gotten is during a mass-patching event when the tech support folks proactively logged in to push something important to every box they could access.

It made me feel a lot better. We weren't truly surrounded by stuff that was completely unreliable. There was just enough to keep us busy, and sometimes a confluence of events would lead to a serious influx of tickets and/or phone calls and would make things hellish for us on the floor. Working late night hours and having only a handful of people around tends to magnify such things.

It's selection bias in a nutshell.

That said, there are at least a couple of exceptions to this which are worth keeping in mind. One of them has to do with the way that hardware can move around in circles if your provisioning, commissioning and decommissioning flows aren't quite air-tight.

Imagine you have a random IDE hard drive in a customer's machine. It starts going wonky, but doesn't die completely. The customer requests that it be replaced. The datacenter techs pull the machine off the rack, throw another disk in it, stick it on the KVM for remote access, and throw the ticket back to the support team to actually schlep everything across. Once that's done, the datacenter team pulls the old disk out, switches things around, make sure it boots up, and then puts it back in the rack.

The old disk, meanwhile, is returned to the inventory cage.

Time passes. Then, one day, something happens and that same drive is called up for some reason. Maybe it gets chosen as a replacement for yet another failing drive. Maybe it's pulled to build a new machine. Maybe it's used as a new secondary disk for someone who's adding storage. In any case, it gets pulled back out of inventory, is stuck in a machine, and goes live.

Not too long after that, it starts triggering problems in its new host system, and then that box goes under repair, has the data migrated off to still another drive, and then our original disk goes back to the inventory cage.

Meanwhile, all of the good normal drives from the inventory cage are being used in systems, and then those drives end up sitting there in production for years and years. They have no complaints and generate no problems, and so they exit the loop.

Meanwhile, you have this swirling loop of broken hardware. Good hardware enters the loop and immediately exits. Bad hardware returns again and again. Before long, it's not too hard to imagine a loop that's almost completely full of sketchy hardware.

If all you do in terms of tracking problems is to look at a given customer or server's history, you will miss this every time. Some part of your provisioning process has to keep tabs on individual components. I've used a hard drive as an example, but the motherboard, CPU, memory, or any other replaceable part of a server could have this happen. The more strict the company is about attempting to recycle parts, the more likely you are to encounter this.

Imagine how long it would take to notice this if the only thing you have working in your favor is a particularly attentive inventory tech who notices some unique mark on the bad disk and sees it coming through week after week. You could go years before finally hiring this person, having them be in the right position, and then having them finally notice it.

When that happens, don't discount that person for noticing it. Take it seriously. There's probably something to it.

Also, there's nothing inherently limiting this kind of "good sticks around, bad cycles again and again" behavior to computer hardware. I have to imagine this happens all over the place.