Software, technology, sysadmin war stories, and more. Feed
Thursday, April 23, 2020

Releasing software to the fleet far too quickly broke stuff

There were some people at a big company, and they had many computers -- like more computers than you'd normally think of anyone having. They had them in vast fleets spread all over the world.

Some of those people built software for those computers. This was not the OS software like Red Hat or Ubuntu at the bottom of the stack. No, this software sat just above that, and added a bunch of features that the OS could not or would not provide. These were things like configuration management and service discovery at the kind of scale this place needed.

It was this middle stuff which made it possible to run the actual services on top. The services are what provided utility to the customers. That's where the recipes, or cat pictures, or videos, or food orders, or gig worker matching or whatever else would live.

This middle layer of infrastructure tended to live in packages a lot like the ones used for the base OS since it was simpler to distribute and update that way. There was a config management regime on these computers, and they had a scheme which was pretty simple when it came to these infra packages. It went like this:

Wake up once in a while. Look at what version of the package is on the box. Look at what version of the package is available in the "package repo". If that one is newer than the one I have, bring it in and install it.

Every device in the fleet did this every 15 minutes or so. As a result, once you managed to commit a new package to the repo, a very large number of machines would pick up that change within 15 minutes. If you screwed something up, it would trash basically everything in the time it took you to notice and start trying to react. Better still, there was no "panic button" or anything else that would reliably stop the updates from driving every system off a cliff. Once it was committed, you were basically going to watch it happen in slow motion whether you liked it or not.

Unsurprisingly, this was eventually deemed a problem. Something needed to change. Work commenced on a new way to manage these infrastructure packages, such that they wouldn't all just wake up and hop to the latest version every time one appeared.

Here's how it worked: there would now be a pointer file for every package controlled by this system. That pointer file could contain two versions: there was the old one and then the new one. The pointer file also contained a number which specified which "phase" was active. This was some added magic which let you select groups of machines for an incremental rollout instead of going from 0 to 100% in one step.

Each numbered phase referred to a conditional test in an array. You could use a bunch of techniques to select a subset of the fleet. For example, phase 0 was frequently just a handful of systems, like the personal dev boxes used by the people working on the project. It might actually be a literal thing, like "hostname is one of (a, b, c, d, e)".

If your hostname was one of those 5, it would return true, and if they were at phase 0, it would activate and trigger loading the new version instead of the old version.

Subsequent phases would add on larger and larger groups, like "all developer machines across the company", or "all of the west coast", or "this entire datacenter", or "50% of everything", leading up to the final not-a-numbered-phase "global" which meant "every single machine gets this".

Some teams were very happy with this. They set up a bunch of phases and rolled out their releases in steps. These teams tended to catch things before they made it to 100% distribution. They weren't the problem.

The problem was that some teams set this up and then promptly ignored it. It was totally possible to leave the pointer at "phase global", and then just flip the version string from one thing to the next. This would make every single machine wake up and take the new version the next time it ran a config check (every 15 mins or so).

At least one team did exactly this... and went largely unnoticed until one day they somehow came up with a multi-gigabyte package and dropped it on the entire fleet at the same time. THUD.

Every single server had to download this thing (using network bandwidth), drag this thing in to the disk (consuming disk bandwidth), burn CPU time and memory to decompress it, burn more disk bandwidth to write it out uncompressed, then fling all of the contents to their final destinations (even more disk bandwidth) and update the various databases to say "yes, this package is now here".

Since this happened everywhere more or less at the same time, it meant the entire company's fleet of services all slowed down at once, and a bunch of things sagged under the load. People definitely noticed. It caused an outage. A SEV was opened. It came to review. They had to admit what had happened.

At that point, it was decided that something easier had to be done. The barrier to entry for safe and sane rollouts had to be lowered even more to bring more teams on board. The alternative of letting it happen again and again was unacceptable.

This was one of those times when we needed to build a system that made it stupidly easy to do the right thing, so that it would actually be MORE work to do the wrong thing ... like going 0 to 100 in a single step. Some people obviously couldn't be bothered to step their release along over the course of a couple of days, and we had to find some way to protect the production environment from them.

We wound up building something which did exactly that. I'll tell the story of how that came to be another time.

April 25, 2020: This post has an update.