Writing

Software, technology, sysadmin war stories, and more. Feed
Monday, October 26, 2020

Type in the exact number of machines to proceed

I've worked at a few places that had a large number of Linux boxes. I'm talking about well over a million. When you have that many cats which need herding, sometimes you have to do things to big groups of them at once. Once in a while, you even have to touch all of them at once.

It's been my experience that companies which possess such massive fleets tend to create tooling which will let them do exactly that. These tools have different names, but the gist of it is about the same: ssh in as root, run some command, and maybe return the exit code and/or output.

For certain situations, this is exactly what is needed to put out a fire, and that's when you're thankful it exists.

This post, however, is not about that. This post is about the other side of the coin which is where someone uses one of these tools and creates a problem. Maybe they decide to roll out a "flag flip" this way instead of using best practices (tests, canaries, percentage rollouts, that kind of thing). Perhaps they decide to push a new binary to every machine at once and so they all drop out at the same time, leaving no capacity to run the actual site.

There's something I've asked people to put into their tools to prevent certain kinds of disasters. It's intended to address the specific situation where someone runs the command and targets far too many machines. Maybe they wanted to touch a rack of test hosts (40 or so), but accidentally selected all of them.

Once you have tooling like this, errors like this will happen too.

My request is simple enough: if you're going to generate a confirmation prompt as a sanity check, *don't* make it a Y/N type of thing. Instead, ask them to read a number and plug it back in.

It'll look like this:

Blah blah blah 123456 machines will be affected by this.  Proceed?
 
Enter number of machines to confirm: 

Your options are then to type in exactly "123456" to let it go, or anything else to abort.

The idea is to force you to take in that number through your usual input devices (I'd say eyes, but some people are using text-to-speech stuff or similar, and they count too), chew on it with your wetware, and then feed it back into the computer somehow. Adding a few extra steps like this will hopefully activate enough of your brain to make you stop short before blowing off your entire leg with a giant foot-gun.

Of course, if you run into this a lot and you actually intend to hit that many machines, someone might start cutting and pasting the number. In that case, I would say that you're using that tool far too often, and should take a look at changing the way things are done to avoid having to rely on it this much.

Now, reality being what it is, "stop using it" might not be easily done in a given company. If that's what's going on, then it might be interesting to split up the number a bit so it can't just be pasted in and has to make a round-trip through the human doing the work.

This might be as simple as printing the number with your locale's version of numerical separators, like "123,456" or "123.456" or "123 456" or whatever else you might use where you are. The trick is then to NOT accept that as input, but instead demand that they remove the separator and jam it in as just digits.

Blah blah blah 123,456 machines will be affected.  Proceed?
 
Enter number of machines to confirm: 123456
OK!  Continuing.

I've seen this technique save people on multiple occasions, and share it here in the hopes it helps others. If you're designing something rather powerful, consider making it safe this way.

Just think: the numbers come off the screen, leap into the person, bounce around the inside of their head, and go back into the computer. Nothing but net.