Software, technology, sysadmin war stories, and more. Feed
Saturday, May 2, 2015

The dangers of resetting everything at once

I want to describe a scenario I've seen too many times before, and I know I will see again.

Let's say you have a bunch of systems running approximately the same software. For the sake of this example, say they are all running Linux. We'll also assume they have a bug which makes the machine lock up after 300 days of uptime. Nobody knows about this yet, but the bug exists.

(Incidentally, Linux boxes have had at least one bug where things went stupid after a certain amount of time, like 208 days, but this isn't about that. This also isn't about Win95 and 49.7 days of uptime. Stay with me here. Focus.)

One day, there's a big announcement about the problem. It's all over the news in big headlines: "Linux boxes die after exactly 300 days uptime". People see it and react to it in different ways. One of those reactions leads to the bad scenario.

There's one sort of person who will see this and will take it upon themselves to reboot all of the machines right away. This way, by their logic, they "just bought the company 300 days of trouble-free operation". That's somewhat true: any host which was approaching 300 is now reset to 0 and won't trip this bug any time soon, but that's the end of the good news.

The bad news is that this person bought the company exactly 300 days of operation, and it's all synced up across the fleet. Whether this becomes an actual problem depends on what happens next.

If all of the systems are then patched and have the root cause fixed, then everything is fine. If the systems wind up getting put on some kind of "scheduled reboot" list, then it's goofy, but it winds up okay.

No, the problem is when that's the only thing that happens, and then nothing else changes after that. 300 days will elapse, and then all of the hosts will come down at the same time. So much for redundancy!

Even if most of the hosts are fixed before then, if even one group of them is left out, then whatever service is run by that group will go down.

So here's the trick: any time you see an announcement on date X of something bad that happens after item Y has been up for more than Z days, calculate what X + Z is and make a note in your calendar. That's the first possible date you should see a cluster of events beginning. It'll actually drag on for a few days past that point since not everyone gets the news at the same time, and those that do get the news don't do the "reboot the fleet" right away, either. It might take a bit.

There's a variant on this, too. Let's say there's an existing system which always gets rebooted or reset at a given interval. SSL certificates are typically renewed in terms of a year or two. Well, a little over a year ago, Heartbleed was the latest bit of online buffoonery, and everyone who was affected had to do two things: patch their OpenSSL code (or chuck it entirely...) and then re-issue their certs. This reset a bunch of certificates to having April expiration dates.

Sure enough, April just rolled around again, and a bunch of sites all had certs expire and had outages stemming from that. It's interesting to see that you can sometimes tell who heard about and acted on the Heartbleed news based on the order in which they expired.

Now, not everyone was affected by this clustering effect. Some folks probably managed to keep their old expiration date and just replaced the cert. Others probably saw this coming and have taken steps to spread out their expiration dates to get rid of the "April hotspot".

Still, I would expect to see ripples of this every April for quite some time to come.

Likewise, what's May 1, 2015 + 248 days? It looks like January 4, 2016.

Be careful out there.