Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, February 16, 2015

The load-balanced capture effect

I'm going to describe a common frontend/backend split design. Try to visualize it and see if it resembles anything you may have run, built, or relied on over the years. Then see if you've encountered the "gotcha" that I describe later on.

Let's say we have a service which has any number of clients, some small number of load balancers, and a few dozen or a hundred servers. While this could be a web site with a HTTP proxy frontend and a bunch of Apache-ish backends, that's not the only thing this can apply to. Maybe you've written a system which flings RPCs over the network using your company's secret sauce. This applies here too.

Initially, you'll probably design a load balancing scheme where every host gets fed the same amount of traffic. It might be a round-robin thing, where backend #1 gets request #1, then backend #2 gets request #2, and so on. Maybe you'll do "least recently used" for the same basic effect. Eventually, you'll find out that requests are not created equal, and some are more costly than others. Also, you'll realize that the backend machines occasionally have other things going on, and will be unevenly loaded for other reasons.

This will lead to a system where the load balancers or even the clients can learn about the status of the backend machines. Maybe you export the load average, the number of requests being serviced, the depth of the queue, or anything of that sort. Then you can see who's actually busy and who's idle, and bias your decisions accordingly. With this running, traffic ebbs and flows and finds the right place to be. Great!

So now let's test it and by injecting a fault. Maybe someone logs in as root to one of your 100 backend machines and does something goofy like "grep something-or-other /var/crashlogs/*", intending to only search the stack traces, but unfortunately also hitting tens of GB of core dumps. This makes the machine very busy with 100% disk utilization, and it starts queuing requests instead of servicing them in a timely fashion.

The load balancers will notice this and will steer traffic away from the wayward machine, and onto its buddies. This is what you want. This will probably work really well most of the time! But, like so many of my posts, this isn't about when it works correctly. Oh no, this one is about when it all goes wrong.

Now let's inject another fault: this time, one of the machines is going to have a screw loose. Maybe it's a cosmic ray which flipped the wrong bit, or maybe one of your developers is running a test build on a production machine to "get some real traffic for testing". Whatever. The point is, this one is not right in the head, and it starts doing funny stuff.

When this broken machine receives a request, it immediately fails that request. That is, it doesn't attempt to do any work, and instead just throws back a HTTP 5xx error, or a RPC failure, or whatever applies in that context. The request dies quickly and nastily.

For example, imagine a misconfigured web server which has no idea of your virtual hosts, so it 404s everything instead of running your Python or Perl or PHP site code. It finds a very quick exit instead of doing actual work.

Do you see the problem yet? Keep reading... it gets worse.

Since the failed request has cleared out, the load balancers notice and send another request. This request is also failed quickly, and is cleared out. This opens the way for yet another request to be sent to this bad machine, and it also fails.

Let's say 99 of your 100 machines are taking 750 msec to handle a request (and actually do work), but this single "bad boy" is taking merely 15 msec to grab it and kill it. Is it any surprise that it's going to wind up getting the majority of incoming requests? Every time the load balancers check their list of servers, they'll see this one machine with nothing on the queue and a wonderfully low load value.

It's like this machine has some future alien technology which lets it run 50 times faster than its buddies... but of course, it doesn't. It's just punting on all of the work.

In this system, a single bad machine will capture an unreasonable amount of traffic to your environment. A few requests will manage to reach real machines and will still succeed, but the rest will fail.

How do you catch this? Well, it involves taking a few steps backward, to see the forest for the trees.

Let's say you have 100 backend servers, and they tend to handle requests in 750 msec. Some of them might be faster, and others might be slower, but maybe 99% of requests will happen in a fairly tight band between 500 and 1000 msec. That's 250 msec of variance in either direction.

Given that, a 15 msec response is going to seem completely ridiculous, isn't it? Any machine creating enough of them is worthy of scrutiny.

There's more, too. You can look at each machine in the pool and say that any given request normally has a n% chance of failing. Maybe it's really low, like a quarter of a percent: 0.25%. Again, maybe some machines are better and others are worse, but they'll probably cluster around that same value.

With that in mind, a machine which fails 100% of requests is definitely broken in the head. Even if it's so much as just 1%, that's still completely crazy as compared to the quarter-percent seen everywhere else.

Believe it or not, at some point you may have to break out that statistics book from school and figure out a mean, a median, and a standard deviation, and then start looking for outliers. Those which seem to be off in the weeds should be quarantined and then analyzed.

This is another one of those cases where a reasonable response ("let's track backend load and route requests accordingly") to a reasonable problem ("requests aren't equal and servers don't all behave identically") can lead to unwanted consequences ("one bad machine pulls in far too many requests like a giant sucking vortex").

Think back to systems you may have used or even built. Do they have one of these monsters lurking within? Can you prove it with data?

Pleasant dreams, sysadmins.