Consistency vs. speed tradeoffs in replicated services
Sooner or later, anyone who works with databases beyond a certain size will have to deal with the consistency vs. speed tradeoff. This is the basic problem where you have to decide which one is more important to your application. On the surface, it seems simple enough: you either do a write and an equivalent of a fsync() to make sure it gets pushed to all of your far-flung nodes, or you just throw it out there and hope it works out.
I've seen some designs which seem to freak out about replication lag and so they get this death grip on making sure everything is always up to date. Not surprisingly, they trade replication lag for application responsiveness. What always confused me about this is why it was always viewed as if you had to use only one technique for every aspect of your service when a hybrid approach might work better.
Here's a scenario. Let's say you have some flavor of database which has five replicas (I've shown three of the five in the diagram as sites A-C, and site D is an example site which doesn't have a replica). You have one on either side of the US, one in western Europe, one in Asia, and another in South America. They are set up in a multi-master arrangement, which means any of them will accept writes. Those writes are visible immediately if you query that same server. However, there are varying degrees of lag to the other locations ranging from seconds to under 10 minutes.
Then you have your application server. It's responsible for taking user requests and modifying state in the target item. Let's say it's a board game, and it's multiplayer. Anyone who's part of a given game can make a move ("join piece 1337 with piece 314"), and this server is responsible for updating the "game grid" accordingly.
Let's also say that you have computing clusters in the same five locations as your database replicas, so you can run your application server "close" to your data storage. This gives you really quick reads and writes since you don't have to cross the Internet to get to a storage node.
Finally, you have your web frontends. These take incoming HTTP requests from your users, turn them into RPC requests for the game, and send them to an application server. Then they take the response from the server, format it appropriately (HTML, JSON, that sort of thing) and return it to the user. These web frontends are located all over the world. Also, due to the fun nature of routing on the Internet, your users can and will "appear" at any of them given enough time.
How do you tackle this? Well, let's look at the extremes first and say that you're using your database for all state and are using the "eventual consistency" model of replication. Your users hit a web frontend, which hits an app server, which adjusts the "grid" as it can see it in its local copy of the database, then it returns a result.
Trouble is, user A is hitting frontend A with one version of the data, and user B is hitting frontend B with another. Eventually, you have some conflicting movements, and your game dies. Okay, so that won't work.
Let's go to the other extreme. You still use the database for all state storage, but now you have the lock-step behavior of forced synchronization of all reads and writes globally. Someone makes a move, and now you have to wait for all of the storage nodes to acknowledge and save it. If one of them is slow or dead, what happens? Does the app queue up the write? Does it just fail the operation and give a frowny face to the user?
Even if all of your storage nodes are reachable and behaving properly, there's a lot of overhead to worry about. You might kick all of those RPCs off in parallel, but they're all going to have to finish successfully before you can acknowledge the user's move. Odds are, they're going to notice, since the world is a big place and light is relatively slow.
Now I will propose a hybrid approach. Instead of using your database for all state management of a game, you instead only use it as a backing store, and you use the eventual consistency scheme for it. You also have a small directory server which does do forced synchronization, but you use it sparingly. Actually making changes in a game happens within the set of that application server, and only one application server may "own" that particular game at any given moment. All reads and writes for that game are thus directed to that app server.
In this scenario, the first user into a game hits a web frontend, which then hits the closest available app server, and that creates a game. This can do a one-time write to a forced-sync service which says "I now own game #12345". There will be a little lag as it syncs, but it's a small write, and it only has to happen once. Then it starts allowing moves by that user, and as it updates its local storage, it commits those changes to the database, which then updates the other nodes when it can.
Now let's say a second user connects to this existing game. They hit a local web frontend which looks up that game in its directory service. It finds that an app server somewhere is running it already, so it directs all further communications there. As other users connect, the same process repeats.
Assuming all of that works properly, you're done. Their game "lives" in an app server somewhere, and user traffic flows accordingly.
Okay, so let's throw a monkey wrench (spanner!) or two into the works. Let's say the database DA (database server, site A) underneath the app server AA (application server, site A) dies. The app server still has all of the state for the game, so it can just switch to the next-closest database server DB (database server, site B) and start writing there. Some old writes from server DA may eventually get to server DB, but you can use timestamps to just ignore them.
Ideally, clients would not notice the app server failing over to another storage backend. There might be a slight increase in latency as it waits for writes to complete to the backend since it is now relatively far away. It can always keep checking for DA to come back up and then go back to that.
That one was relatively simple. This next one is harder. Something bad happens to the instance of your app server which was hosting the game. Maybe it crashes, or maybe it gets booted off the machine where it's running. Let's assume for the same of simplicity that any state it had internally did get committed to the local database DA.
The next user request will try to make it to an app server for that game and will fail. It should find a stale lease in the directory service, in other words. At this point it will have to reach out to another instance of the app server in the same location (A) where the game was last hosted. That app server then has to acquire the lease for that game in the directory service (necessitating a sync delay), load up the state, and then pick up from where the other one left off.
This will probably happen in parallel with multiple app servers trying to acquire the lease. Naturally, "there can be only one", and the others will lose. They either need to forward the request which got them trying in the first place, or fail the request and tell the client to try again, since it should find the winner in the directory now.
In this scenario, aside from some user-visible lag, you'll probably survive since the state is picked up from exactly where it was before by virtue of reading back from the same database node.
Now I'm going to throw a double fault at you. Let's say all of site A goes down, which means you lose both the application server AA and the database server DA. The web frontends can no longer reach any application server in the same location where the game was hosted, so one in another location gets the lease. We'll call it AC (application server, site C).
AC's first order of business should be to reach out to the original database server which was "home" for that game, DA. This will fail. Now we have a conundrum. There could have been changes which were committed to the database and acknowledged to clients which now only reside on DA since they have not propagated out.
You could try to pick it up from whatever version of the data you've managed to get at DC (database server, site C), but there's no guarantee it's the latest. Even if you scan all of the databases to see who has the best "high water mark" (most recent write), even that isn't necessarily the last one which was committed back at A.
One thing you could do is go into read-only mode for the game until the A platform comes back up. Then you'd have access to any of those writes which might be stuck in there. It's a pity that you don't know if this is actually necessary, though! Maybe everything is actually current globally, and you just can't be sure, so you're playing safe instead.
A possibility at this point would be to introduce a globally-synced version number/high water mark storage system. This would make it so that every write to your database gets an incrementing number and then that number is synced globally before you let the client know it succeeded. It's smaller than pushing a full update out, but it still introduces some amount of latency as you wait for it to be committed. What do you do?
I have one thought. What we really need to know here is just how far the game got before replication died. Who knew that? Well, the app server did, but it's dead.
Did anyone else know? Of course someone else did. The clients did!
Assuming your clients are actually small Javascript programs running on various web browsers, then they can have some smarts. They could be given an opaque blob which says what the high water mark is every time they get an update. Then they return that same blob when they make a change.
The poor app server at C which is trying to make sense of this whole affair while hanging out in read-only mode will be receiving these blobs from waiting clients. It can use them to determine how far along the game got. It should converge on a small number of values, and the highest one is the last one that all of your current clients know about. If you have data up through that point in your database, then you're already caught up! There's no need to be read-only. You can keep going after a short interval to make sure you've gathered all of them.
Sure, there's a way for this to fail, too. If you have a triple fault, which is that you have the above scenario happen along with a client who got an update and then never checks back in, the game will resume from the prior move as if it never happened. Of course, this is a little like wondering if a tree falling in a forest makes a noise if nobody's around to hear it. If there's no observer, does it matter?
You can avoid some of the pain caused by losing an entire platform by being smart about shutting things down. If you shut down the storage systems first, the app servers can start directing writes elsewhere. Then if you do a soft shutdown of your app servers, they can do an active handoff to another one which isn't "doomed" and keep the game going. This can be as simple as poking them and saying "you have 300 seconds to live" instead of just doing a kill -9 when it's time to bring down a system.
Bad things will happen, and you just do your best when they do. When you know in advance that maintenance is coming, take the effort to have your systems drain traffic away sanely and methodically. Your users will appreciate it.