Global load balancing configs and a really bad day
I have some words of advice for anyone who might try to set up a widely distributed service using "cloud" systems. Specifically, I want to share a situation which can happen and will cause you and your users to have a very bad day. First, I need to describe the infrastructure so that it will make sense to those not already doing this.
Imagine that you have a bunch of daemons listening on TCP ports on machines which are scattered all over the world. They tend to bounce around from one machine to the next. To make life easier on yourself, you also let them grab port numbers dynamically. Yes, I said easier. By not trying to reuse a port number, you can restart your binary and not have to worry about things like deciding between SO_REUSEADDR and having lingering connections keep you from starting up.
Of course, if both the IP addresses and ports of your processes keep changing, you need some way to keep track of them. Otherwise, how will your frontend load balancers route traffic to them? Let's further assume that you are (ab)using DNS to do this via SRV records.
For those not familiar with that kind of record, it lets you publish a bunch of useful things about a service. Instead of being limited to just IP addresses (which is what you get when you use A records), you can set the target name, priority, weight, and even port number. A sufficiently advanced client could do a query for an agreed-upon name and get back all of the instances you have running. With a short enough TTL, it would be able to keep up with changes, albeit at the expense of having to do a lot of lookup traffic.
This isn't a perfect analogy, but let's just go with DNS for the sake of having something not completely esoteric here. The point is, your load balancers can then see an incoming request, do a quick lookup, find all of the possible "end points" to handle that request, and then try one and see what happens.
So imagine you have this going, and life's pretty good. Your frontends are even smart enough to remember when a backend goes bad and will stop sending traffic to it for a while. They'll keep trying to reconnect in the background while sending traffic to the remaining replicas. You always keep more than enough backend processes running, so your site continues to handle the load as it ebbs and flows during the day.
Time passes, and you start using a helper script to keep things going. It's responsible for listing all of the possible sites where you might be running replicas. Maybe you have some running at Amazon site A, Rackspace site B, and Peer1 site C. Every time it runs, it does a search at all known locations, finds the IP+port pairs, and writes a new DNS zone and makes it live.
This gives you the opportunity to "drain" a region for maintenance. Let's say there's yet another super storm headed for the east coast and you expect bad things to happen to your colocated machines out there. You can just drop into the script and comment out site C. Then, the next time it runs, it'll only populate the list with replicas at A and B, and your frontends won't try to send traffic to the dead site.
This scheme will probably work well. Things will grow, and you will pick up many more sites. Instead of A B and C, maybe now you'll have 26 sites, one for each letter of the alphabet.
At some point it gets too messy to populate one list with all 26 possibilities, particularly since some of them make for suboptimal pairings of frontends and backends. You don't want a west coast US frontend to connect to a backend in eastern Europe if you can really help it. You'd much rather have it hit something also on the west coast of the US if at all possible. You can't argue with the speed of light, after all.
Now your script morphs some more. Instead of having one DNS name with tons of SRV records attached, you now have a bunch of DNS names, and your frontends use a different one based on whatever region they may be living in. You apply a whole bunch of "weights" to the different frontend and backend regions, so the script will go through and try to populate each list with the nearest two or three backend locations.
This, too, works fairly well. You continue to either comment things out or (more likely) edit the "skip these cells" list in the (Python) script and have it regenerate itself. Things rewrite themselves to avoid outages and life goes on.
Then, one day, something bad happens. A service which has a hard requirement on your availability goes down. Pagers are going off. Managers are running around demanding to know what's happening. The press picks up on it and starts talking about a massive failure at your company, since nobody can log into their cat picture sharing page today.
You point the finger at the database service which holds all of your data. It must be them. After all, everything is fine with us, so it has to be them. Yep. That must be it. You page them for good measure.
They wake up and start running around to figure it out, but nothing seems wrong. The database service people are getting tons of traffic from you, but they're actually holding up under the exaggerated load. They want to know why you've started beating on them so much all of the sudden. It's not affecting other users, but it's not a good sign, either.
This goes around and around for an hour or two. Then, somehow, someone finds it. Someone had pushed out a new config file for the DNS zones without adequately verifying that it was sane. Either they had edited it manually and screwed up entries that were then ignored, or they ran the script and it barfed because it wasn't able to find adequate backend replicas.
Then, despite the fact the zone file was broken, they pushed it into production. All of their frontends started noticing this, and started shoving ALL of their global traffic to just a meager handful of backend instances. Those instances naturally could not handle the load of the entire Internet's thirst for cat pictures, and started failing too.
Finally, someone realizes what needs to be done and pushes another DNS zone which contains an adequate spread for the traffic, and the frontends slowly follow to those locations. Queries start going through again, and logins to the cat picture service start working again.
All of this happened because a script had been taken too far and had never been given adequate sanity checks. It should have refused to write any output when there were inadequate backends available. Instead, it just used the one or two which still existed (as hard-coded 'last resort' fallbacks) for the whole world, and everything which followed was inevitable.
If your load balancing system involves scripts and humans pushing the output of those scripts without running a "diff" to see what's actually changed, you should worry. Also, if this sort of thing happens outside of the realm of a source control mechanism which has audit history and the ability to roll back to old versions, you should really worry.
Another piece of the puzzle here is that the frontends all tended to switch to the new data in unison. Instead of having a staged rollout, it all happened globally within the span of a couple of seconds. When you stage things, you can see if something's wrong because the staging site will start behaving oddly. The others will still be unchanged and will give those users normal behavior. This way you don't mess up all of your users at once.
Does any of this sound familiar? It's probably happened more than once already.