Writing

Feed Software, technology, sysadmin war stories, and more.

Saturday, February 2, 2013

Replacing "scripter mentality" code

I once had to replace a crufty system which needed to do a bunch of setup tasks. These were things which had normally been done by people, and it just wasn't scaling. Someone had tried to write it as a script, but it didn't work out too well. It was relatively slow and basically didn't handle multiple machines properly.

The specific tasks aren't important here, so I will use some equivalent tasks to explain how it needed to work. It still conveys the relative amount of complexity which had to be tackled in order to finish the job.

I also describe the original "scripter" approach later on.

...

Imagine you have a fleet of dedicated servers spread across several different hosting companies and facilities. You need them to be up and running and serving your content. For the sake of this example, we'll say they are all serving the same content, and are being used for redundancy. There is other stuff going on to balance load to them which is out of scope for this discussion.

These machines come and go. Parts break. Datacenters are taken down for maintenance, retooling, or just "defragging". Drives die, and sometimes the servers have to be reinstalled. Somehow, you have to keep all of these up and running.

After doing this by hand for the 50th time, you get tired of this and try to automate it. In so doing, you identify a bunch of steps which always have to happen in order to turn a random "bare metal" box into a functioning member of your serving platform. For our purposes, those steps are:

  1. Machine is marked online at the hosting company
  2. Machine has a base install
  3. SSH keys for other users are installed
  4. Web data is installed
  5. Web server is running
  6. Web server is actually delivering content

As you can see, a lower-numbered step effectively trumps a higher-numbered one. There's no point in trying to worry about whether the system is delivering content if it's marked "under repair" at the hosting company. The same prioritization applies to all of the other steps.

Imagine if your job was to write some code which would look at a machine to see if it was healthy. It would either set a flag that says the machine is usable or it would kick off an action to fix things, assuming one exists. That means installing missing SSH keys or starting the web server, for example.

Assume that you'll be called on a regular basis until you finally set that flag. You only need to do one "pass" in your code. One way to handle this would be to have a long chain of conditionals. I don't recommend this, but I've seen it happen, and it looks something like this:

if (machine_is_online) {
  if (base_install_ok) {
    if (ssh_keys_ok) {
      if (web_data_install_ok) {
        if (web_server_ok) {
          if (web_content_test_ok) {
            machine_is_usable = true;
          }
        }
      }
    }
  }
}

That's pretty nasty stuff, and we haven't even started dealing with the remediation steps needed to move something along! Now it starts getting really horrible, and it might grow into something like this:

if (machine_is_online) {
  if (base_install_ok) {
    if (ssh_keys_ok) {
      if (web_data_install_ok) {
        if (web_server_ok) {
          if (web_content_test_ok) {
            machine_is_usable = true;
          } else
            log("web server not serving up correct content");
          }
        } else {
          start_web_server();
        }
      } else {
        start_web_data_install();
      }
    } else {
      start_ssh_key_install();
    }
  } else {
    start_os_install();
  }
} else {
  log("machine is offline at hosting company");
}

Are your eyes bleeding yet? Assuming we haven't made any mistakes in juggling all of that indentation and all of those blocks, then all of the tests happen and all of the actions are started appropriately. If there's a problem in this, it's going to take a lot of staring to figure out where it went wrong. I imagine it could involve printing out the code and drawing little vertical bars from an if to the matching } below to see the associations. I've dealt with this, and it's frustrating.

I prefer to avoid such things, and instead sometimes go for something more like this where "bail out early" is the name of the game:

if (!machine_is_online) {
  log("machine is offline at hosting company");
  return;
}
 
if (!base_install_ok) {
  start_os_install();
  return;
}
 
if (!ssh_keys_ok) {
  start_ssh_key_install();
  return;
}
 
if (!web_data_install_ok) {
  start_web_data_install();
  return;
}
 
if (!web_server_ok) {
  start_web_server();
  return;
}
 
if (!web_content_test_ok) {
  log("web server not serving up correct content");
  return;
}
 
machine_is_usable = true;

This version takes up a lot of room vertically, but I find it much easier to read. There's no need to scan up and down to match distant blocks based on indentation. The test is right next to the action, if any.

So now here's where real life gets interesting. There's a lot going on in this bigger process. The program as a whole is running with a list of several machines, and it calls your checker function with each of their names periodically. You need to do your tests and kick off your actions but not wait around for the replies. If you block waiting for a response, then the whole program will bog down.

I handled it with asynchronous requests. Every action my program could take resulted in a remote procedure call being generated and handed off to the RPC library. It took my request and a callback pointer and then returned control to me. The RPC library then executed the request in the background. This let my code run straight through relatively quickly and return to the caller without getting stuck in a blocking call.

It changed the above series of tests to look like this:

if (!base_install_ok) {
  start_rpc(machine, "install_base_os", base_os_done);
  return;
}
 
if (!ssh_keys_ok) {
  start_rpc(machine, "install_ssh_key", key_install_done);
  return;
}

This also meant I now had several more functions to handle the callbacks from those RPCs. They were responsible for looking at the status of the earlier request. Some of them were pretty simple. If the RPC succeeded, that means the machine managed to complete the request, so it just needed to update the local state for that particular task:

void base_os_done(rpc_status, target_machine, response) {
  if (rpc_status != RPC_SUCCESS) {
    log("rpc failed: ...");
    machine_status[target_machine].base_install_ok = false;
    return;
  }
 
  machine_status[target_machine].base_install_ok = true;
}  

Others were a bit more complicated and required looking inside a response to a RPC.

void key_install_done(rpc_status, target_machine, response) {
  if (rpc_status != RPC_SUCCESS) {
    log("rpc failed: ...");
    ssh_keys_ok = false;
    return;
  }
 
  if (response.ssh_key_hash != magic_constant_value) {
    log("key install finished, but hash is still wrong");
    ssh_keys_ok = false;
    return;
  }
 
  ssh_keys_ok = true;
}

In this case, it had to find a certain value in there or it wasn't actually successful. Notice that I'm just dropping straight out of these functions when things fail. I know that any of these things will be called over and over until things succeed, so bailing out here doesn't hurt anything in the long run.

...

I liked to visualize this entire process as one of those telethons where they get people to call in and donate money to their TV station. You usually see a bunch of people sitting there with telephones, and sometimes they have a counter or something at each station to see who's really making things happen. They receive calls in parallel and update those counters asynchronously. (If they don't do this where you live, go watch "UHF". They have a scene like this in there.)

Once in a while, the coordinator looks at all of these counters and status boards and makes a decision about what to do next. The person running the show knows that whatever is up on those boards is the absolute newest data they could possibly have. There might be more results "in flight", but they haven't arrived yet.

Eventually, all of the counters will reach some magic value and the telethon will end. They will have reached their goal. My program worked a lot like this. All of those blah_blah_ok flags were set by the RPC callback handlers. I just looked at those flags and kicked off RPCs to make things happen knowing those actions would eventually populate those flags with good data.

It's also important to note that this approach can be used to make it immune to the effects of a machine which "goes retrograde" during this process. For instance, let's say it gets past the ssh key install process. Then, while a later step is running, someone jumps on the machine and manually twiddles the ssh keys. They are no longer the right combination, and that machine should not go online.

This requires some reworking so that requests for the current state are sent regularly, so that if something changes, it will flip the is_blah_ok flag to false. Once that happens, this will make the test fail on the next pass, and it will kick off a RPC to fix things.

...

This assumes a relatively simple setup task where earlier failed steps make later steps irrelevant and there's only one way through the list. There are far more complicated scenarios out there where you need to branch this way and that in order to handle different environments. I haven't even tried to cover that here.

...

Finally, I wanted to mention the wrong sort of approach which existed before I was drafted to replace it. It was a glorified shell script which did things serially, blocked in many places, and amounted to infinite loops in some situations. Again with the pseudocode:

while (!machine_is_online) {
  log("machine is offline at hosting company");
  sleep(delay);
}
 
if (!base_install_ok) {
  install_base_os();
}
 
if (!ssh_keys_ok) {
  install_ssh_keys();
}
 
if (!web_data_install_ok) {
  install_web_data();
}
 
if (!web_server_ok) {
  start_web_server();
}
 
if (!web_content_test_ok) {
  log("wrong content returned");
  die();
}
 
machine_is_ok = true;

This code seems innocent enough. It runs all of the tests and starts corrective actions. If everything is okay on the machine, it'll drop straight through and set the flag. Otherwise, it'll fix it right then and there and then keep on going, right?

Well, no, actually. This code assumes far too many things, like requests will never fail. What happens if you ask a machine to install SSH keys and the network goes down right then? It comes back up 30 seconds later, but your RPC times out and fails. This code as written won't even notice and will move on to the web data check. You might wind up with a host which has a running web server and no way to log in since it didn't get any SSH keys! Oops.

Someone might notice this and try to fix things by adding while loops.

while (!ssh_keys_ok) {
  if (!install_ssh_keys()) {
    log("failed to install ssh keys, trying again");
    sleep(delay);
    continue;
  }
}

Or, they might flip the logic around and test for success instead:

while (!ssh_keys_ok) {
  if (install_ssh_keys()) {
    break;
  }
 
  log("failed to install ssh keys, trying again");
  sleep(delay);
}

Let's say someone goes through and puts every one of these tests into a loop. Now, if any given RPC fails, it will just sleep and try again.

This is when I throw yet another monkey wrench in the works. Remember the scenario where an earlier step regresses for whatever reason? This can never catch it, since once you pass that step, you never try it again. Even if you somehow found out that something was wrong later on, you couldn't go back. "goto" is not an option here!

There's no way this nightmare could have ever been fixed to handle the harsh reality of production use with serious reliability demands. That's why it had to go.


February 3, 2013: This post has an update.