Rolling out the "push train"
This is another part of my continuing story about doing safe releases of infra "middleware" (above the OS, below the actual user-facing services) on a very large fleet of machines. It started with a tale of people going way too quickly and breaking things. It continued with the early days of using a single-percent-at-once manual process. That manual process worked, but it needed to be automated. This is how we did it.
As mentioned before, I had taken a single package through 100 phases over the course of two weeks. It amounted to running this command 100 times in a loop and sleeping before going again:
software_roll --package=team1-thing1 --phase=n
This worked well enough, and I wanted to push more things at the same time with it. I had my own "run everywhere" things which had to go out to the fleet, and having a single offset for everything that was changing was helpful. This way, if anyone saw an anomaly in anything that was being pushed, we could tell right away if it was on a box that had just received the push or not.
Somewhere around here, the notion of it being a train pulling into a station with a bunch of stuff on board came around, and we started calling this the "push train". My script grew, and soon it looked like this:
software_roll --package=team1-thing1 --phase=n software_roll --package=team2-thing2 --phase=n software_roll --package=team2-thing2b --phase=n software_roll --package=team3-thing3 --phase=n software_roll --package=team4-thing4 --phase=n software_roll --package=team5-thing5 --phase=n
I should mention that every one of these commands actually did a commit to a source code repository behind the scenes. It amounted to "check it out, flip the number forward, and check it back in". It wasn't particularly quick in the first case, and now I had multiplied it by six.
Considering all of these packages were intended to move in lock-step, we needed to come up with a better approach. The "separate pointers for separate packages" thing just didn't track with the reality of what we wanted out of a release process. The solution we came up was dubbed the "catalog".
In the above world, every package had a separate file under source control. Recall from last time that every package had a file which looked like this:
phase: 50 old: 2.0 new: 3.0
The catalog scheme bundled all of that up into a single file, like this:
global_phase: 50 package < name: team1-thing1 old: 2.0 new: 3.0 > package < name: team2-thing2 old: 20200401-123456 new: 20200415-120133 > [...]
We still had the notion of an "old" and a "new" version, and the "phase" decided which version any given machine would follow. But now, it all came from a single file, and so we could do a single commit and save a whole lot of grief.
My script became a lot simpler once this was written. It went back to being a single command, albeit somewhat different:
catalog_roll --catalog=push_train --global_phase=n
Pretty cool, right? We could now single-step our entire "train" the same way we single-stepped individual packages, and they'd all obey.
This had proved pretty useful, and it was time for this to graduate to not being a script that one of us would run. We wrote a small program that would sit there and listen to requests to the network using the company's usual internal RPC secret-sauce stuff. If a person who was authorized sent the right sort of command, then it would go and do the phase-twiddling for you. It actually had enough guts to check out the file, flip the number around, and re-commit it for you.
All you had to do was poke it with a very small RPC client. My script changed again.
trainstation_client set_global_phase n
So what, you think. You went from one command to one command, only you added a service. While this is true, it's more important for what came next. We knew this was just one small step along the way to a better life, and were not ashamed to take it incrementally in order to solve the actual problems we'd hit instead of the ones we somehow guessed we would encounter.
The next step was to write another small tool which would do all of the math for you. You'd say "tool, take us to 25% today", and it would try to do that. It would look at the day of the week and time of day, and would try to come up with a plan that fit within the business rules. And yes, there were plenty of rules intended to keep people from making a mistake and screwing up the fleet.
Among other things, the program had...
A "no pushes before this hour" setting. The idea was that we wanted pushes to happen when the majority of engineers were at work, and that meant not starting before 9 AM on the west coast of the US/Canada. We knew this wouldn't last forever, but it was a good place to start.
A "no pushes after this hour" setting. Like the above, we assumed that most of the eng part of the company would be available until 6 PM on the west coast, and then it would get more and more sketchy as people went home, had dinner, or generally checked out mentally.
A "don't bump the global phase/% of push faster than this rate" setting. I mentioned in another post that the actual machines looking at these configs would run their checks once every 15 minutes. If we flipped at that exact rate, then one shard (roughly 1% of the fleet) would always be doing something. The minimum setting was around 10-15 minutes.
A "don't bump the global phase/% of push slower than this rate" setting. Some days, we had very little to do. We might have a whole nine hour window (9 AM to 6 PM) and only 10 bumps to do. The naive implementation would say "okay, we're going to sleep 54 minutes". This setting capped it at whatever we wanted. We typically used 45 minutes.
A "don't push on Friday" setting. We had plenty of time using Monday through Thursday given that it was a two week push cycle. There was no reason to use Friday by default. The schedule had been created with lots of slop time. Normally, we'd take Tue/Wed/Thu of week 1 and Mon/Tue/Wed of week 2. This actually left us two whole regular days to "catch up" if needed: the Thursday of week 2 and the Monday of week 3!
We'd use those days if we had skipped one due to a company holiday or something else which prevented a push from getting far enough. We could still go on Friday, but by default we didn't need it.
A "don't push on Saturday or Sunday" setting. This was like the Friday thing, only one more level of paranoia.
All of these were just defaults, and you could change any of them. This new tool, the "pacer", would let you set them to whatever you wanted. It would even do you a solid and tell you which flag to flip if you hit one of the limits.
This worked by doing all of the calculations up front. If you told the pacer that you wanted to jump forward 40% today, but you only had 3 hours left in the push day, it would fail fast. Let's look at the math to find out why.
40 bumps to do. 3 hours of usable time is 180 minutes. 180/40 is one bump every 4.5 minutes. 4.5 is WAY faster than the minimum setting which was usually around 10 minutes.
If you asked it to do that, it would immediately project forward and say "this won't work". It told you that you could do several things to work around it. You could pick a lower goal and do fewer bumps, and then pick up the next day to do some more.
10 bumps in 180 minutes would work fine, since that's a pace of 18 minutes per. You could try to push it to the wall and try to do 18 in 180 minutes, but you'd run the risk of having things slip a little and then it would run out of time.
You could also lower the minimum push delay below 10 minutes. It wasn't recommended, but the program told you how to do it if you really wanted to. Unsurprisingly, most people opted to just scale back their goal for that day in order to stay within a reasonable pace.
Similarly, if you asked it to go on a Friday, Saturday or Sunday, it would also say "this won't work because..." and would tell you what flags to flip if you really wanted to go. You could totally make it do something that wasn't so great for reliability if the situation called for it, but you had to be explicit about it. It kept you from accidentally bumbling into a bad spot.
This started as a simple program that you'd just call in a loop and it would bump the global phase when enough time had passed, assuming everything else was okay. Eventually, that loop moved into the program itself, and it was now a thing that you'd start at the beginning of the day and just let it do its thing while you go on with the rest of your job.
Unsurprisingly, at that point, it became a perfect candidate to also become a service, and so it left the realm of being run on our workstations and became yet another thing running on some shared machine somewhere. It got a proper name, and now you could shoot RPCs at it to ask it to do things, or ask it how it was doing. There was a small CLI tool written to make these requests for you.
Of course, this opened the door to a bunch of new situations. For one thing, there was now no way to "just ^C it" to stop a bad push. But, the good thing was that you no longer had to track down that person to actually DO the ^C. (Or find their machine and hop into it as root. Whatever.)
We exposed this as a "panic button" on the brand new status web page. The way it worked was simple enough: the status page showed the global phase (push %), all of the packages on board, and what versions they were using for "old" and "new". Then there was a giant [STOP] button up top. If you hit it, it gave you a popup that asked for a reason (but would accept nothing) and then gave you the final [submit] to kick it off.
When you hit that button, it would immediately tell the pacer to freeze wherever it was. It would still wake up once a minute to see if it was time to work, but it would then give up when it saw the stop bit was set.
That button also dispatched a page to whoever was on call, because, well, they were going to need to get involved eventually anyway. While anyone (and I mean anyone) could hit that endpoint to stop the "train", only one of the people on the admin list could set it to "run" again.
We taught people that it was okay to hit the button if they suspected a problem. If you even thought the push was causing trouble, hit it, and we'll figure it while it's paused. We'd much rather have a false alarm than have the usual problem where people are afraid to stop a push while more and more things are slowly catching fire. (After you've seen it happen a few times, you'd design your system this way too.)
People took us seriously. They pushed that button and we stopped. Many times, problems with a release were caught at a low percentage of the fleet. Things that previously would have gotten everywhere and poisoned a whole subsystem were now constrained to a subset and the rest of the machines were able to keep delivering results. This is exactly what we wanted.
If you're wondering what happened when someone had a bad release on the push train, there's one more part I haven't described yet: exceptions. We had added an extra optional field to the catalog design which allowed packages to have their own "override" phase. This way, we could control them independent of the global setting for exceptional circumstances.
Here's how it worked. Let's say it was day three of the push, where we start the day at 10% and end it at 25%. We get to 12% and someone goes STOP!!! and so we do. While we're stopped, they verify that their new release is in fact causing a problem (by rolling a new machine back, or rolling an old machine forward), and need to figure out what to do next. They have different options.
Meanwhile, the whole train is stopped, and the other customers are twiddling their thumbs and tapping their feet. They want to know when it'll get going again, since THEIR code isn't the problem.
This is when we'd "freeze" the troubled package in place by adding an override. We'd set them to the same number that was in the global setting.
global_phase: 12 package < name: foobar old: 2.0 new: 3.0 override_phase: 12 >
Then, we'd resume the push. They'd stay at 12%, and the rest of things would progress on to 13%, then 14%, and so on up the line as you would expect. This took the immediate pressure off the team which owned the bad package and let them figure out what would happen next.
The usual case was "please back us out". They wanted to abandon the new ("3.0") deployment and wanted to go back to having everyone on the old ("2.0") version. Initially, this involved us running another series of step commands by hand with delays in between. After we had done this a few times, it turned into an "exception pacer" tool which would do it for us.
The "epacer" had similar limits on how fast it would do things, and it would safely bring them from 12% down to 0% while the rest of the push continued. When it finally reached 0, we'd flip the config around to remove 3.0 from the catalog completely.
That is, it would go from this...
package < name: foobar old: 2.0 new: 3.0 override_phase: 0 >
... to this ...
package < name: foobar old: 2.0 new: 2.0 override_phase: 0 >
... to this:
package < name: foobar old: 2.0 new: 2.0 >
This was a deliberate two-update process intended to avoid the kind of "glitches" you learn about in the world of electrical engineering. Even though the updates should have been atomic and we should have been able to flip two fields at a time, this was easier to reason about and safer in the long run. Also, it was work the computers did, not us, so it didn't cause us any particular drama beyond the work required to write the code the first time.
Usually, teams would sit out the rest of the push and would try again when the cycle began again. We would ask them to go through the usual "open a SEV and figure out what went wrong" process. The idea was that while we'd help you out and keep everyone stable and happy, you couldn't just keep having failed releases. Any team that kept needing to roll back releases would eventually bring the "Eye of Sauron" upon it in the form of senior engineers and upper management looking in to find out why they couldn't roll out halfway decent code. This didn't happen often, but the implication that it could happen tended to keep people honest and caring about quality.
Naturally there was more to this system, but the details are subject to diminishing returns on the general interest to people reading about it now. We had changed the way people looked at how releases were done for infra-level plumbing code, and we stopped a great many nasty things before they could have taken the whole site down.
The best part was that this was done by a bunch of engineers who wanted to solve a problem. There was no manager leaning on us to make it happen. Nobody was screaming at teams to make them get on board. Indeed, other teams tended to hear through the grapevine that there was something that would take over their push and would deliver them to 100% of the fleet safely and sanely and with tons of oversight. All they had to do was hand over a release candidate version string before the push started, and it would "board the train" and go out shortly thereafter.
Was two weeks a long time? Of course it was. But, this was middleware plumbing stuff, not the actual services and not the OS. Two weeks was actually really fast for some of these teams. The teams who were already operating at faster rates didn't need us. We were all about picking up the very bottom tiers to get them to a place where they were part of the solution and not part of the problem.
In that sense, we succeeded. Had we kept at it, we intended to write a faster cadence which would deliver to 100% in just a week. We also intended to have a fully continuous push which would step its way up from 0% to 100%, would sleep a little bit, and would then grab a fresh set of packages and go again, all without human involvement.