Writing

Software, technology, sysadmin war stories, and more. Feed
Monday, October 10, 2011

Primary, backup, manual failover? You've already lost.

There's a line by Steve Jobs about mobile devices which basically says "if you see a stylus, they blew it". It's the kind of observation which gives other people a rule of thumb to try using the next time they encounter a similar device. You don't have to take it at face value, but you can use the opportunity to judge it for yourself and see if there's any merit to his claim.

In that spirit, I would like to provide an observation for a service which is supposed to have high availability: "if you have a primary and a backup, you have already failed". Here's why I say that.

I noticed a pattern at some point. You'd have these people running a service and they'd know about maintenance or some other scheduled situation wherever their "primary" instance happened to live. They'd have to bring the team together, apply a whole bunch of changes, and do all kinds of crazy stuff to make it "flip".

One technique I saw was setting the whole system to read-only, waiting for it to settle down, and then demote their database's read-write master to a read-only slave. Then they'd take the "backup" node which had been a read-only slave and make it the master. This would usually work, but sometimes bad things would happen.

In any case, if it actually worked, then they would put the whole system back to its usual read-write mode and continue with life. They thought this was normal and saw nothing wrong with it. All I could do was shake my head and just wonder whatever gave them that idea.

There are multiple reasons why I think this is bad. First, it makes a user-visible outage. When you flip the whole site over to being read-only, the users will notice. Also, if you make this a regular thing, then the smarter ones will realize what you're doing. Once they figure that out, they will realize you are manually flipping service back and forth to dodge external dependencies and will realize what a rinky-dink operation you have. If you're selling service on the basis of reliability and "cloud this" and "cloud that", that doesn't work.

Second, it puts a human in the critical path for avoiding outages. This is just mean. It means some poor person has to sit there with a playbook and monkey their way through a process every time it comes up. What if that playbook gets out of date? Do they get to wake other people up at 3 AM if they can't figure it out?

Aside: I have a whole rant about having playbooks for a production service in the first place, but that is a topic for another time. Stay tuned, I guess.

Squirrel! Third, it does not cover those situations where your primary will die for unplanned reasons. Imagine that! What do you do when a squirrel decides to warm itself up by shorting across your service entry? Last time I checked, the squirrel union does not file notices when one of its members wants to go visit Elvis. They just do it.

Start thinking about multiple masters and a world of no primary. Once you get into that groove, then you can start planning for capacity. That opens the door to things like N+1, N+2, worrying about "shared fate", and so on.

Or, you know, you can just buy a lot of bananas and settle in for a long winter of manual failover monkey work. It's up to you.