Reaching a quorum of machines
Yesterday, I described a system which needed to run a series of setup tasks on a fleet of computers. These tasks were being handled by a glorified shell script up to the point when I encountered it, and it was frequently the cause of process failures. Because it would fail, the larger system would be unable to use those computers to accomplish useful work.
I described the handler at the core of things in that post, and some of the techniques used to make sure things didn't block. The basic premise was that all requests happened in the background and asynchronously updated a collection of state data about each of the computers. The "foreground" thread just checked that data and used it to figure out if a machine was healthy or not. If it found a problem, it would try to start a request to do something about it.
I briefly mentioned that all of this was happening in the scope of a bigger process but deliberately omitted the specifics from that post. For this post, I'll describe more of what was happening in the layers above that inner core.
The topmost interface to all of this was simple enough. You'd hand it a list of computer names and details about what sort of content you wanted loaded on there. You'd also tell it a few extra things about your request. One of them was a "mincount", where you would say that it was okay to proceed even if some of the computers were unhealthy. This way, you could provide a list of (say) 100 candidates, but also set your mincount to 80.
This was important since computers tend to come and go from the network based on all sorts of crazy things in real life. Expecting 100% of them to be up at all times when you have enough of them is just not realistic. For the purposes of this system, it didn't really matter which 80 machines it got, just that it got at least that many.
I did a few other things, too. First, if it reached the mincount, it didn't just continue right away. There might be some other machines still going through the setup process which will ultimately succeed. You want to capture as many of them as possible in case they fail out for other reasons later. There were other stages where machines could be rejected, and exiting as soon as you found #80 would just move the "all or nothing" scenario down the line.
By waiting a minute or two you could frequently grab several more systems which just needed a few more things done on them. This would allow things to proceed with 85 or even 90 of the machines on average. That gave a nice cushion against later failures.
There was also a maximum timeout on this process. If it didn't get to the quorum (80 or more) within some specified interval (like 15 minutes), it would assume something was very wrong and would give up. This would fail the process, but it was better than sitting there forever. A failure could be reported to the people who watch over the system and they could figure out what's going on.
Also, there were other uses of these same machines which might have only needed 40 or 60 of them to be up and running. Some of them might be able to start up and run to completion even if quite a few of the 100 are missing. The alternative is blocking the entire run queue, and that just didn't make any sense.
I added one final optimization to all of this. If you supplied a list of 100 machines and all 100 of them passed the tests on a given pass, it would drop straight through instead of doing the "mincount wait" described earlier. This way, if things really were completely healthy, you didn't have to wait around for something that will never happen: arrival of more machines.
So, to review, there were several ways through this system on every pass:
It could test all of the machines and find that all of them were healthy. It would drop straight through with a successful result.
It could test all of the machines and find it had a quorum (mincount), but didn't have all of them. If it had been this way for long enough (clock time), then it would give a successful result.
It could test all of the machines and keep discovering it didn't have a quorum at any point. Eventually, it would hit the maximum limit on how much time could be spent attempting to set up the machines. When that happened, it would give a unsuccessful result.
One thing I enjoyed about this particular system was wiring it up to show its internal state when it was running directly for a user. That way, if you needed a bunch of machines set up a certain way, you could run this CLI tool which was just a thin wrapper around the inner library. That tool would print updates about where it was and what it was doing to the various machines to get them running.
This let me do quick demos of why this sort of implementation trounced the old stuff. I could just run the tool with a large set of machines on the command line and then push back from my desk and let it run. It would turn that list into a set of healthy systems a few seconds or minutes later all without any input from me. The thing it replaced would get stuck or break more often than not and could never stand up to such hands-free treatment.
I invested a lot of energy in trying to remove failure points from that system. Upon removing one weak link, the next weakest link would make itself known. It was thankless work, but it did have quantifiable results: fewer failures and shorter setup times.
I don't know if it was ever appreciated by the users, though. Sigh.
February 5, 2013: This post has an update.