Software, technology, sysadmin war stories, and more. Feed
Monday, March 11, 2013

Synchronous config updates and thundering herds

I picked up the term "thundering herd" some years back when the Linux vs. NT wars were heating up. People were starting to run benchmarks comparing the two systems in terms of performance when serving web pages, and there were things which went poorly on the Linux side. Not all of it was strictly the fault of Linux itself (which is just a kernel), but that didn't matter to outside observers.

I first heard about the "herd" in the context of having lots of server processes all trying to handle incoming connections in parallel. The situation might go like this. First, you open a socket (stream, not datagram), bind it to a port number, then set it to listen for incoming connections. Then you fork off a bunch of children who all inherit that file descriptor and proceed to wait for incoming connections.

Apparently, when a single connection would come in, all of the children would wake up, but obviously only one would ever succeed in the subsequent accept() call to actually answer the connection. The others would find nothing to do and would loop back and go to sleep. This unnecessary waking and sleeping of processes which had no real work to do added overhead to the systems and slowed them down somewhat.

This particular memory resurfaced recently upon hearing about a failure mode in a relatively large distributed system. It seems that it would have an enormous latency spike every day at 9 AM local time (for the reporting user in Brussels). At that point in the year, 9 AM there was midnight here on the west coast of the US -- DST hadn't kicked in here yet. It's surprisingly common to see tech companies using US/Pacific as their time base, and so having something run at midnight isn't much of a stretch.

It made me wonder if they had some kind of scheduled process like a cron job which did a config push at midnight. Of course, for that to have the kind of impact this user was reporting, there would need to be a synchronous waking of all of the instances coupled with some kind of nontrivial work which blocked out user requests for a noticeable interval. It's not exactly the "thundering herd" from the performance wars of the late '90s, but it's a similar concept.

I started thinking about what you'd have to do to design a system to make it hang up like this. One way would be to have a dispatching service which handles all of your config files. All of your clients connect to it and register callbacks for receiving updates. That is, when your code initializes the library for this service, it also says "run my 'NewConfig' function whenever this changes". Later on, when a fresh config file comes out, the updates fan out to all of those clients, who all receive the blob of data and wake up in their NewConfig() functions.

Usually you don't want configuration details to change while a request is being serviced. This would be a problem with a multi-threaded program, for instance. You get around this by having a mutex which blocks reads from the config file it is being updated. A request handler (which can be run at any time via the threading model) might look like this pseudocode:

HandleFeature(request) {
  ReaderMutexLock l(&config_mutex_);
  if (!config_.feature_enabled) {
    // return error
  // actually do something and send reply

During normal operations, these handlers wake up, grab a "reader" lock, and that's no big deal, since all of the other handlers are doing the same thing. There can be any number of reader locks open in parallel since they won't conflict.

However, when it's time to apply a change, the NewConfig function has to take a full read/write mutex which will block everyone else:

NewConfig(new_data) {
  temp_config_ = ParseConfig(new_data)
  if (!temp_config_valid) {
    // return error
  MutexLock l(&config_mutex_);  // blocks everyone else
  config_ = temp_config_;
}  // exiting function unlocks mutex

That's the lightweight version, where it does all of the processing (ParseConfig) before taking the lock. It does the absolute minimum amount of work with that lock held: it copies the config in, and then it returns (which, incidentally, runs the MutexLock destructor, which calls config_mutex_.Unlock(), and that unblocks everyone else, in case you were wondering).

Here's another way to handle the same situation:

NewConfig(new_data) {
  MutexLock l(&config_mutex_);  // blocks everyone else
  config_ = ParseConfig(new_data)
  // do stuff with config_
  // run a bunch of update functions
  // generate statistics
  // write many counters to the log file
  // push a status report to disk
}  // exiting function unlocks mutex

This one does the parsing and a bunch of other things with that mutex held, so as long as it's "thinking", anything else which might need to obtain a lock on that same mutex is blocked. This means your request handlers might be unresponsive.

Now, this in and of itself does not have to be the end of the world. A process which takes a break when it gets a new configuration file doesn't have to bring down your entire service. You can have other instances of that same process running, and your load balancers can merely steer the requests to those which are still available.

The problem comes when you mix both this "lots of work with a mutex held" and the "thundering herd" in the same system. Now you have a situation where all of them wake up simultaneously, receive the update, take their respective mutexes, and start applying the update in parallel. While this is going on, your poor load balancers can't send the requests anywhere, since all of your replicas are out to lunch at the same time!

This is the kind of stuff which leads to user-visible delays.

What can you do about it? That all depends on what your system is supposed to do. If you have a situation where you really, truly, need all of your instances to run in lock-step in terms of configuration, you don't have much choice in the matter, now do you? You just have to suck it up and have all of them update in parallel. You can try to minimize the pain by making the updates run as quickly as possible, but there will still be some period of time in which your availability is technically zero.

It would be better if your system could handle a mixed configuration, even if only for a relatively short interval, but that sort of thing typically has to be included early in the design of a project. When you have that kind of flexibility, you can just update your instances gradually and let them run in small batches (or even one at a time). Even if your load balancer isn't smart enough to notice this and redirect hanging requests to other instances, it'll still only affect some small percentage of users. The others will be hitting replicas which have already been updated or haven't yet been updated and won't have any lag. The ones who do see lag will only see it briefly, since it's likely their next request will be served by another replica, or in the case of "sticky sessions", their server will finish and will go back to normal.

If you tell all of your receptionists to go to lunch at the same time, don't be surprised when nobody answers the phone for a while.