Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, August 1, 2013

Your simple script is someone else's bad day

Let's say you're a programmer and you have to automate a sequence of tasks. The end result isn't important, but if it helps, think of it as something like baking a cake, changing the oil in your car, or whatever. What matters is that there are quite a few steps in this sequence, they have to happen in the right order, and they all have to succeed.

How do you handle this? Assuming a Unix type environment here, if it was originally just a bunch of commands people ran by hand, it might have started life as a cheat sheet and perhaps became a "runbook" entry at some point. Then, I suppose you might turn it into a shell script.

#!/bin/sh
do_thing_one
thing_two
task_three --blah --foo=false
# ...
do_thing_twelve --path=/over/there

If that's how you handle it, you're walking a thin line. If all of those steps actually succeed, then sure, okay, you win, and it's probably an improvement over the old manual processes. But no, I'm not going to let you get off that easy.

Pick one of the steps in this file. Now imagine it failing. What happens next? I imagine all of the subsequent steps will still be started, since there's nothing happening to check for success up there. They'll probably fail, since I imagine they depend on the prior steps actually working. Switching it to use bash and adding a "set -e" would be a good start. Even just having some "|| exit 1" blobs appended to some of those calls would be better than nothing.

Without those checks, what happens if the subsequent steps run, and actually manage to get in some weird state because they ran when they shouldn't have? It might even make it unable to run again later without manual intervention, since now it won't be starting from a fresh slate.

I find that sometimes when such processes exist, the best way to make them work is to start from scratch every time. In other words, if it doesn't run all of the steps all the way through successfully, running the script a second time probably won't help and might even make things worse. It may be necessary to wipe out all of the work and do it from the beginning to get a usable result.

To call back to the example of changing the oil in your car, imagine a simple-minded automaton like a robot. It takes things literally, no matter what sort of hilarity and craziness might result. Now let's say someone writes up a list of instructions for changing the oil in the car, and one of the steps is literally "add five quarts of oil". If everything works properly, you're fine. It's the corner cases where it gets interesting.

An early version of the script would probably not catch errors, and so would continue running the later steps even when it should have bailed out. This means even if your dumb oil-changing robot fails at draining the old oil, it still tries to add new oil! This creates a mess and potentially a hazmat situation. Way to go.

So you add error checking, and now every single step is tested to be sure it didn't return an error code. If an error occurs, it bails out and lets someone step in and take a look. They fix whatever's wrong and run it again. This probably works out most of the time, too.

Of course, this eventually fails far enough along that the new oil is now in the car, and re-running it causes the robot to needlessly drain that new oil as it follows instructions. Hopefully, the programmer comes along and adds some checks so it won't do that. Or, maybe they do some clever checkpointing so it can restart a failed step so it won't keep re-running the early ones.

That works for a while, then it fails during the "add oil" stage. It's trying to follow the "add 5 quarts of oil" instruction, and runs out of oil about halfway through the fill stage. It errors out. Someone gives the robot more oil to use, and restarts it. It restarts the step.

It again tries to "add 5 quarts of oil". Of course, the car already had about 2.5 quarts in it, given that it ran this step about halfway before, and so it overfills the container, and spills out, and yes, there's another mess and possibly a call to the fire department's hazmat team as it heads for the storm drain. It's embarrassing.

It's around this point that someone figures out that the command needs to be something more like "fill with oil until there are 5 quarts in the car", but of course that also fails spectacularly the first time a car with an oil leak comes through. The robot keeps filling and filling and filling and never hits the 5 quart mark. Meanwhile, the fire department has been called once again.

Someday, someone might eventually manage to implement it properly with the proper amounts of paranoia: "Add new oil to the car, using up to 5 quarts, until the car has 5 quarts in it. Also, look for leaks and stop immediately if anything escapes from the car."

It takes a non-trivial amount of work to automate a process which can notice failure, be restartable, and not create messes when interrupted in arbitrary locations. In terms of dev time, it's probably faster and definitely easier to just list a bunch of steps and do no checking, but that just shifts the load onto someone else when it fails.

I guess the question is: are you feeling lucky? How about your users?

...

Wikipedia has a few things to say about some of the fundamentals here.