Writing

Software, technology, sysadmin war stories, and more. Feed
Tuesday, December 11, 2012

Canary as a verb

When I mention canaries, I hope most people think about cute little birds. However, I would hope that people who run large computer systems which are intended to be reliable also know a secondary meaning for the word. Instead of the usual bird sense that's a noun, they should be thinking about the verb form which has emerged, as in "to canary" something.

If you're already familiar with this, great! This will seem like old news. If not, please stick around. This sort of thing matters when you're trying to make systems which don't fall down and bug people.

As the story goes, miners used to keep canaries around to find out when poisonous gases were being released. If the bird got sick (or worse), they'd get out of there before they were also overcome by whatever had gotten into their air. This was supposed to work because the poor bird was far more sensitive to bad things and would be affected before it became life-threatening for people. So, in that sense, a "canary" is something which will show signs of badness before it affects too many people.

Bringing this concept over to the world of computer systems isn't much of a stretch, then. Perhaps you have 100 servers which normally run identical code and together handle the load for an enormous web site. You have monitoring which notices every error thrown by your site, so by looking at how many errors occurred in some span of time, you can arrive at a rate. If that rate is sufficiently low, then you're happy. Further, if it's about the same for all of your servers, there's nothing unusual going on.

So now it's time to push out a new feature for your product, which happens by enabling something in your server config file. You have this new "awesome push server" thing which allows you to make a change, and it wakes up all of the listeners, who then immediately pull a new copy of the data and apply it. Your commands look something like this:

$ vi webserver.config  # you add "new_feature = on" to the file
$ cp server.config /net/pushserver/config/webserver

Moments later, all of your web servers "snap to" and start running the new config. Five minutes after that, you realize this new feature causes the JVM to chew far more memory than it was before when under the load of production. Your machines start running out of physical memory and start swapping, and shortly after that, the OOM killer starts assassinating your Java runtimes. Your site goes down in flames.

The problem here is because all of your servers switched over at once, making all of them flap like crazy as they hit the memory problem. You didn't have any left over running the old code to support the site in case something went dreadfully wrong. Let's turn back the hands of time and try something different.

This time, take one of the web servers and point it at another config file which has the same contents as the other 99. Then, when it's time to do your first release of this feature, only switch it on in that config file for that one server. It should start running the new code and should start chewing memory. It'll probably keel over and fall out of the load balancing pool just the same, but all of its buddies will be there ready to take on its users. It stinks but it's not the end of the world.

Now you can just roll back that one config file and try it again later. Only the users who happened to have a session which landed on the test machine were affected. Some of them might have seen something new for a few minutes before things stopped responding. That's life.

You can get much more complicated with this sort of canarying scheme. Maybe you push to 1%, wait a day, then push to 10%, then go to 50% a day after that, and 100% by the end of the week. 1% is where you're looking for problems which will crop up in the new code itself and only affect the local server. It's the 10% and 50% stages where you start getting to see what happens with a new service running around on the network. If you've added a whole bunch of new load on some backend service, it might start feeling the pain around this point. By 100%, you'd better have all of that ironed out.

There's more to this, naturally. You aren't really testing anything if you don't have a way to associate problems with what kind of system it is. If all you have is a global error count and it's not broken down into any finer-grained sense, you can't be sure if it's coming from the new code or something else entirely. Even just a simple split between canary and non-canary machines would give some chance of being able to identify the source.

If this is old hat, then you're ahead of most places. Way to go!

Then again, if all you're doing is serving cat pictures, does it matter?