Writing

Feed Software, technology, sysadmin war stories, and more.

Saturday, January 16, 2021

Failing to make progress under excess request load

There are some interesting ways for systems to fail.

One time I heard about a couple of people who rushed out some code. They somehow managed to go from "hey, I got an idea" to "it's running everywhere" in something like 15 or 20 minutes. This was almost impressive except it meant bypassing any notion of testing, soaking, canarying, or anything else you might want to call it. It might have been acceptable if there was already something on fire, but in this case, that wasn't happening.

Oh, did I mention this was a production service with actual people in the field relying on it to get things done? Yeah, that's kind of important. There were some actual stakes in this game here. It's not just like something where nobody would notice. People noticed.

Not long after their push finished, sure enough, everything did catch on fire, and the whole service went down. The people responsible eventually localized it to this one service which was running out of memory in some of that brand new code. It was either being killed by the orchestration system or just plain crashing (that is, killing itself). Every time one died, the load went up on the others.

Unsurprisingly, they reverted the code to the previous version.

The revert happened... but things didn't come back up. Why not? Well, while they had been down, traffic had been piling up. All of those clients still needed to get in and do whatever this service did. Instead of just having the steady-state load of whatever that moment in time might normally have, they had all of that *plus* all of whatever hadn't been handled during their outage. Also, someone believed that having the internal systems generate retries was a good thing, so they were actually multiplying the load: every request from the outside world might generate three to five on the inside.

This was a problem because their service was unable to process some requests while sloughing off others. All of those incoming connections were taken as if they were just as valuable as the existing ones. They accept()ed them, proceeded to read from them, and just got busier, and busier, and busier. This overhead made it impossible to handle the earlier connections which had arrived before the load shot up.

If the service had just refused to accept new requests past a certain point, it still would have looked like an outage for some users, but not all of them. Assuming they could in fact clear out the request backlog faster than new ones would arrive, then they would eventually catch up organically. That's a system which fundamentally returns to a stable state all by itself: it doesn't fall over while handling the existing work, and it gracefully sheds the rest without suffering.

Since that didn't work, they had to resort to terrible manual hacks to drop some percentage of their requests somewhere upstream from their actual service. Somewhere between them and the clients, something was configured to just throw away a surprisingly large number of requests.

This finally reduced the actual load on the service to something it could handle by itself, and that let it start making progress on the backlog. Eventually, it caught up, and they were able to remove the upstream limiter.

If this picture isn't particularly clear, think of it another way. Maybe you're a worker somewhere, and your boss has you looking at a "to-do" list - bugs, tickets, JIRA pain and suffering, whatever. Normally, you pick one off, do it, and move on: open it in a tab, do whatever, then close the tab.

Let's say you're actually a little more capable than that and can do more than one at a time. So you might open several tabs, work on them in short pieces, and then close each tab when that bit of work is done.

If your boss suddenly adds 3000 items to your list, are you going to open 3000 tabs in your browser? I sure hope not. A reasonable person might open the number that corresponds to whatever they are capable of handling simultaneously.

An unreasonable worker, meanwhile, would in fact pop open all 3000 tabs and promptly run their machine out of resources. They would get no work done, since the browser would lag, or crash, or start killing tabs, or who knows what. Instead of them still managing to do their usual number of items per unit of time, it would collapse to zero.

That's what happened here: the system was unable to slough off load without becoming affected by the load, and it never made progress. It never recovered until someone else came to the rescue.

Some weeks later while reviewing the outage, a question was asked: doesn't this mean the system is fundamentally unstable, if too many requests could knock it over, and it would be unable to recover by itself? There was no good answer.

I assume it could still happen today.