Software, technology, sysadmin war stories, and more. Feed
Thursday, December 27, 2018

Circular dependencies for provisioning systems

There's an interesting chasm which eventually seems to get crossed once a company has enough physical servers running their business. These are frequently, but not exclusively, boxes running Linux. There's no reason this story couldn't also apply to a BSD OS, or Solaris, or even Windows.

That chasm has to do with how you install the systems. How, exactly, is a fresh box turned from an empty pile of parts to a vibrant member of the "family"? There are different stages of this that companies tend to go through. I'm going to miss a few here, but I'll try.

Everyone probably starts with one person who goes around installing boxes by hand. Depending on how long ago we're talking, they might be doing it with tapes, floppies, CD-ROMs, DVDs, USB sticks, or who knows what. It probably involves a lot of manual intervention at the console. And yes, I mean the console: odds are, they need a keyboard and/or mouse and a monitor hooked up to do this.

This might progress to the point where they have some kind of "autoinstall" media: put it in the box, let it boot, and it does all of the work for you. No more manual partitioning of drives, creating filesystems, installing packages, and so on. You just let it run and wait for it to come back.

Sooner or later, you probably arrive at a point where your servers boot off the network, at least initially, and install that way too. That is, when they power up and don't have an OS on the local storage (hard drive, SSD, whatever), they do DHCP and then PXE over their local Ethernet and try to find some install action that way.

Now, think about how this works. For box A to be able to actually boot off the network and get the instructions for how to install itself, at least one other system has to exist and be operable. Companies tend to create these breathtaking "provisioning systems" which do nothing but (re-)install machines all day long.

These systems represent a large amount of investment in handling all of the special cases which have cropped up over the years: different architectures, different CPU/memory configurations, this kind of storage, that kind of storage, IPv4-only vs. dual-stacked vs IPv6-only networks, and so on.

But, at the end of the day, it's highly likely that the provisioning system which gets machines installed is itself running on the same sort of general-purpose hardware. That is, auto-installing a machine today works because some other machine was auto-installed in the past (and then became the install host). THAT machine was in turn installed by some other system... and so on, back to the days when it used to be done by hand by some human.

The chasm is crossed when the company goes all-in on their automated install processes, and now the only way to set up a new system is by having another pre-existing one do it. It is now a true circular dependency: you can install machine N+1 because machine N exists and is working sanely.

This is all well and good... until all of the machines die at the same time. By die, I don't mean reboot. I mean "are wiped and have nothing stored on their local drives". 30 years earlier, they'd be sitting there saying "NO ROM BASIC - SYSTEM HALTED". Today, they're probably in a reboot loop, trying and failing to get DHCP and PXE going off the network from other hosts that themselves are also very dead.

How would this happen? One disgruntled employee (or one determined attacker) ought to do it. Or (much more likely), it'll come from someone writing something that meant well but failed bigtime due to the ridiculous unknowable complexity of our systems. How about a "rm -rf $FOO/" in a shell script where $FOO isn't defined, nobody did "set -u" and rm's "preserve root" doesn't exist or isn't functioning for whatever reason? "cp foo /dev/sda"? There are so many ways.

You can wipe a surprising number of machines in a short amount of time. I bet your entire multi-million dollar server investment can be reduced to the modern equivalent of a VCR flashing "12:00" really quickly. Has your company created a tool to let you run a command as root on a bunch of boxes in a massively parallel fashion? I bet it has one somewhere...

My guess? One command, five minutes. Poof. Back to the stone age.

Now what do you do? Does anyone at the company know how to do enough of an installation to get the provisioning system back online, so you can then use it to bring back the rest of the world? Do you just start over? Or do you start working on your resume and hope to get out in front of everyone else who is also going to be out of a job?

Incidentally, for anyone thinking "we use the cloud and therefore we are immune", you're right and you're wrong. Sure, nobody at your company will probably have access to the low-level provisioning system at the "cloud company' that is selling you service, so they won't be able to screw it up that way. It does change the situation a bit.

But, what about the cloud company itself? They have employees too. They make mistakes. They do things that "should just work" and skip important safety checks because of hubris. They break things. Do they have a way to recover from this?

You'd better hope they have already considered this scenario.