Writing

Software, technology, sysadmin war stories, and more. Feed
Thursday, September 1, 2011

Complicated situations sometimes call for state machines

Have you ever seen a system which was obviously thrown together as a series of hacks, each intended to address whatever thing had cropped up right then and there? You can tell the difference between that and something which has obviously been lovingly curated and put together with an awareness of the bigger picture.

This story is about one such example which involved individual machines being added to a big management system. This was happening in the scope of a testing framework, and those machines would get erased, reinstalled, fitted with different testing materials, and all of this. Once they had been reworked, they had to be re-added to the management system.

When I found it, it had basically been a shell script which always did this: "delete machine from system", then "add machine to system". It just made those requests even if they weren't necessary. It had no sense of error handling, and it had no idea if it succeeded or not. It just ran those commands every time, like the stupid script that it was.

Once I got over my disgust of this code, I resolved to do something about it. This looked like something which needed a state machine. There were quite a few inputs which could be 0 or 1, but only a couple of actions which could be taken. The key was mapping them all in a sane manner, and that was something this shell script would never do.

I laid this out on a legal pad: my inputs were the columns, and a final column at far right was "what to do". Each row was just another number. Obviously, the first row was "0 0 0 0 0", then it was "0 0 0 0 1", and then "0 0 0 1 0", and so on, in binary fashion. With this expanded in concrete terms, it was easy to figure out the rest.

Looking at a row, you could say "oh, if the machine is being repaired by the people on-site, then no other inputs matter, and we just ignore it". That actually batched up a whole bunch of rows, and turned into this:

1 X X X X - Ignore

Even though there were 5 inputs, there were not 32 distinct actions. Due to the grouping power of the "don't cares" (X in the cell, as shown above), it was actually far less.

After this, it was just a matter of writing code to gather the five data points and make the decision according to my state table. Then I wrote code to actually take those actions, and made it all loop until enough machines were happy.

Did it work? Of course it worked. It removed yet another source of unreliability and general wonkiness during testing, and it became a standard part of every test which touched that system from that point on.

Given how simple this is, you'd think everyone would use it. I guess it's more of a macho thing to just sling a bunch of code in a script and call it done.