Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, November 14, 2019

Creating the notion of preview and runout modes for a service

Once a company gets to the point where it has hundreds of thousands of machines and complicated turn-up and turn-down processes for collections of those machines, it's easy to get into dependency hell. I've seen this happen and came up with some concepts that seemed to make things easier for everyone.

Turnup and turndown involve a bunch of different people. You have hardware ops or siteops or similar actually installing racks. You have network people dropping in net devices. There are cabling teams connecting everything up. There are TPMs (technical project managers) trying to juggle priorities, deal with dependencies, and keep everything moving. Then there are the operators on the software side with their runbooks who need to configure everything for the new cluster, building, region, or whatever. All of this and more has to happen before actual end users can "move in" and start making use of all of this infrastructure.

Then, a few years later, after depreciation has done its thing, it all has to happen in reverse order to throw out those machines and recycle the physical space, power, and cooling for the next generation. Somehow, you have to do all of this without blocking other projects, and without taking down the site/service/business.

The team I worked with for a time was responsible for running a bunch of servers which did the Raft/Paxos type thing to maintain quorum while holding onto client data. They tended to have several instances per region, and their clients had relatively strict latency requirements. They needed instances close by.

But, there was a problem. When a brand new region is being created, there probably aren't enough buildings, clusters, suites, racks, or rows to give a high degree of "spread" for such a service. You see, the usual problem is that you need to have at least three up at all times to maintain quorum, and to do that, you need to avoid "shared fate". You don't put more than one in the same rack. You don't put more than one in the same cluster (and/or pod, if your network is arranged that way). You spread them across power distribution nodes. You try to get them on different legs of the 3-phase power. It just goes on and on like this.

When your company has just moved into some brand new country and is just bringing its first hundred machines online, there is just no way to solve for that. Then you find out that they need the $new_region instances of your service up so they can start configuring all of the other stuff that depends on it.

What to do, what to do.

For a while, the procedure was just to cram everything into a handful of racks and hope nothing went bad. Of course, nobody usually remembered to go back and spread things out later, so this "temporary" (ha!) situation stayed around WAY too long, and of course caused all kinds of mayhem later. A year or two down the road, you'd say "why are all of these things in the same building when there are three now", and the answer would be "because nobody came back to it".

I decided this needed to change, and pitched the notion of a new level of service we called "preview" or "preview mode". In that mode, we ran instances of the service for $new_region, but we didn't run them from that region. Instead, we "reached in" over the backbone from the next-nearest site.

By calling it a "preview", we could also relax our latency guarantees. We'd do our best, but by not physically being in-region yet, there was only so much we could do thanks to the laws of physics. Customers didn't mind. They were actually made very happy by all this. The TPMs were overjoyed, and it's not hard to see why.

By disconnecting the appearance of $new_region services from the presence of a working network with actual machines in it, we took a mighty chunk out of their dependency charts. We could create the coverage for a new region in an afternoon, months ahead of the actual turn-up. They no longer needed to worry about us. When the next dependencies were ready to come up, we'd be there for them.

There was still the danger of forgetting about this and leaving things in this state for too long, naturally. I expected that the last line of defense would be customers asking why it was slower than every other region, at which point someone would realize they were still technically in preview mode, serving from out-of-region, and would rebalance things. I guess we probably could have added some kind of "throw an alarm if this is still like this by YYYY-MM-DD", and maybe they did, but I moved off the project before that could happen.

So what about the other side of things, when it's time to move OUT of a region? It happens now and then. Companies get a better tax break somewhere else and the next thing you know, you no longer exist in a certain part of the world. For that, we came up with "runout" mode. In this state, while the region was there, we'd start replacing in-region hosts with out-of-region hosts. This would naturally raise the latency again, just like in preview mode, and customers were warned about this.

Once we were serving everything from the outside-in, we'd release our hosts and would officially be considered "clear". Again, the TPMs rejoiced since we were no longer in the critical path. They could take care of everyone else without worrying about keeping us around to the bitter end.

Before we made this change, the number of outages caused by turndowns was ridiculous. It seemed like every month or so, someone would run a turndown script, have it fail, and would "--force" it and cause an outage. Or they'd decom more than 2 machines and the quorum would die and data would be lost. Those and worse outages occurred.

I should mention that "runout mode" also has the risk of leaving things around too long. It's possible to end up with service for a region that's long gone, with no more customers left, that's never coming back. Now you're wasting machines providing a service that nobody wants.

That's the sort of thing you should already be monitoring, though. I mean, you don't honestly run a service and then somehow miss that nobody's talking to it any more, right?