Moving fleet software releases to a massively sharded scheme
This is a continuation of a story about releasing software from a couple of days ago. There, we set the stage for people first moving from the notion of "push everywhere at once" to "push in phases". Still, that wasn't enough to bring some people on board, and so more stuff had to be built.
I should mention there were both extremes here. The teams I mentioned previously would go from zero to 100% in a single shot. But, there were other teams which were scared silly of every single release, and tended to wait as long as possible between them. We're talking seven or eight months here -- complete and absolute insanity. Then, when they would finally push, they'd usually break something because far too much had changed in between.
As you might guess, the tendency for a release to break things made them release far less often, and the long delay between releases meant more could break, and usually did. It was a ruthless cycle of pain and suffering.
Something had to be done. I decided to become the lightning rod for all badness for their next release, and came up with a new way to do things. We were going to take the "phase" system from before and expand it greatly. Whereas this group used to go "our devs, all devs, all frontend boxes (!!!), everything", I threw that out and changed the game. Now, we would have 100 phases... made up of 100 shards.
What's a shard? Well, in this case, it was an attempt at spreading the very large number of machines into kinda-sorta-balanced groups. It worked in a consistent way by being based on the hostname: take it, hash it, squeeze it down to a number in the range 0-99, and there you go.
(Side note: I did not invent this, but rather found it there and decided to use it this way. Yes, this will never have exactly 1% of all machines at any given time, since things are constantly changing. In reality, any given "1%" shard had between 0.8% and 1.2% of the total fleet. This was acceptable for our purposes. Incidentally, if you are THE ONE and can come up with a totally consistent sharding system that also keeps things balanced, what are you doing here? Go get a patent for that thing and go wield it in court like a good little vulture.)
So, yeah, now we had 100 phases with 100 shards. It worked like this: during the lifetime of a machine, it would only ever hash into a single shard, and that mapped straight across to that same phase. That is, the list of phases was basically this:
[ machine.shard == 0, machine.shard == 1, machine.shard == 2, machine.shard == 3, ... ]
... you get the idea.
I said, team, I will do your release. I will watch for breakage. If something goes wrong, I will stop at whatever point it got to and we can decide what to do from there. If it breaks stuff, people can yell at me instead of you. I will be your air cover.
And then, I started turning the knob. It was a Tuesday, and we went to 1%... and sat there. I left it at 1% for 24 hours, just to see what would happen. It was trivial to find out which shard a machine was in, so if we started getting trouble reports, we could easily tell if it was related to the new infra middleware stuff or not.
The next day, I took it to 10%... one percent at a time. Since a typical work day might be eight to ten hours long, there was lots of time to do this. I just manually ran a command to set the phase to 2 and waited. Then I came back and set it to 3, and waited some more. This kept on going the whole day.
Was this a lot of manual garbage? Of course it was. But, it was working. Small parts of the fleet were upgrading on command, and we didn't have everything dropping out at the same time to pick up our new packages. Better still, any new behavior, bad or good, was constrained to an easily-identified cohort of the fleet.
We could even take our usual metrics and constrain them by shard, so we could compare the "still on old version" set to the "just got the new version" set. We could see if the numbers were changing in a good way or bad way, or if they were just doing the same thing.
Day three of the push went about the same. This time, I went to 25% (from the 10% reached on Wednesday). Since this meant slightly more "bumps", it had to happen somewhat faster -- the delay between changes wasn't as long. I think it was around this time that I switched to a small dumb shell script to do the work for me.
Essentially, at the beginning of the day, I figured how much time was left during my notion of "business hours", divided the number of steps into it, and got a "pace". This was simply how many minutes to wait between commands. That turned into a "sleep" call in the script, and the script was told to just iterate from 11 to 25, sleeping that long in between.
This went just fine, and Thursday ended at 25%.
The next day was Friday, and so, nothing happened. This was set up to be a two-week push cycle, and there was absolutely no reason to change things on a Friday, never mind the weekend itself. There would be plenty of time to finish it using the rest of the week when everyone was at work and not thinking of the weekend.
That weekend was uneventful. One in four machines in production had the new code on them, and things happened to be fine. Even if something bad had happened, fully 75% of the fleet was still on the old code, and so it probably would have survived.
Indeed, on Monday morning, I started it again, this time going from 25% to 50%. This was even faster than the 10-to-25 step from Thursday, so the delay was even shorter. My shell script did the work as before, and it was thoroughly boring.
Tuesday morning marked the beginning of the second week of this push cycle, and with it, the day we went the absolute fastest. On this day, I took the push from 50% to 90%. 40 steps in roughly 10 hours equals a "bump" in phase around every 15 minutes. Recall from last time that the machines ran their checks to see "should I upgrade things" every 15 minutes. By bumping the phase at the same rate, this meant that we might actually have multiple phases of machines "in flight" at once.
That is, starting at 50%, going to 51%, then 52%, and then 53% could all happen in about thirty minutes: we'd go to 51% at 9 AM, "dwell" 15 minutes, go to 52% at 9:15 AM, wait another 15 minutes, then go to 53% at 9:30 AM. Depending on how the hosts in those shards were aligned with the clock, we might have had some from both 51 and 52 moving at the same time, or 52 and 53.
This was intentional. By this point, the code had run for a week. It was already on half of a very large fleet. Having closer to 2% in flight should not have been a problem, and indeed, it wasn't.
The second Tuesday ended with 90% of the fleet sitting on the new version, with 10% of the final "holdouts" still running the old one.
The second Wednesday is when I came for that last 10%. This was a nice slow push since it only had to bump it ten times. It was around this point that I actually put a cap on the sleep times so it wouldn't take all day just because it could. After all, people were waiting around for this thing to be done so they could exhale again.
Once it got to 100%, it was time to flip the versions around. During this push, there had been an "old" version and a "new" version, and they were different. We'll say that "old" was set to 2.0 and "new" was set to "3.0".
Since we were at 100%, everyone was looking at the "new" field, and it was now safe to change the "old" field. I called this a "promotion", and it's just a single operation to copy the new version over top of the old.
That is, in one commit, we went from this:
phase: 100 old: 2.0 new: 3.0
... to this:
phase: 100 old: 3.0 new: 3.0
With this done, the "phase" now had no meaning. ANY value of it would yield the same results: you'd either get "old" ... and get 3.0, or you'd get "new" ... and get 3.0. This meant it was now safe to reset the phase back to 0 to prepare for the next push.
So, in a second commit, we went from this:
phase: 100 old: 3.0 new: 3.0
... to this:
phase: 0 old: 3.0 new: 3.0
That's where it sat until it was time to start the next push. While sitting at phase 0, it's safe to change the "new" field since nobody is looking at it.
This was key to the whole operation: you never changed a version field while someone was referencing it. If you did, everyone looking at it would immediately jump to the new version. That would then violate the intent of the new push design, which was "we move the pointer one percent at a time".
In two weeks, we had managed to safely push out something that had been scaring the team for quite a while. Nobody yelled at them. Indeed, nobody on the team had to remember to "bump it to the next phase". That was all on me.
The process had proven itself solid. Now it was time to start automating it. After all, there was no way I was going to sit there running commands by hand all day every day for two weeks.
I'll continue the story of how this evolved into an actual service another time.
April 29, 2020: This post has an update.