Infinite loops and doomed machines
Remember that cascading failure post from a couple of weeks ago? I have another story about that same pseudo-database making life interesting. It has to do with the ability for nodes to have "parents" and "children".
It seems that one fine afternoon, someone was messing around with this system, and they were adding nodes, deleting other ones, and generally changing the relationships between them. Somehow, they managed to violate a certain invariant, which is that you shall not introduce a loop into the graph.
Yep, that's right. Somehow, one or more of the nodes in the system became its own ancestor. I don't think it was a direct "I am a parent of myself"/"I am my own child" relationship, but instead probably happened with some number of intermediate links.
Exactly how it happened isn't super important, because once it did, things got rather exciting. There was this helper program which was supposed to pay attention to changes like this and then keep a copy of the updated graph around for anyone who might want to know about it. It tended to use a fair bit of memory to keep this stuff "hot".
When the infinite loop was introduced, it apparently went off the deep end and started allocating memory like it was going out of style. Since the loop never ended, the chain of allocations never ended, either. The program got bigger and bigger, and eventually brought the box down.
Of course, since the same program was running with the same bad data on a whole bunch of identical boxes, it brought all of them down, too. Commands were sent to reboot them, and some of them actually worked, but the systems would just come back up and restart the helper. The helper would just load its state and would get itself right back into the bad situation from before.
I should mention that rebooting these machines was a rather interesting situation, since they actually had reboot on LAN using magic packets (!!!), but did not have consoles. So, assuming their NICs got set up properly, you could command them to reboot, but that's it. You couldn't grab it and do some "mount / and clear the cache" shenanigans to keep things from happening again. The machines were doomed to eat themselves over and over until the bad cache/state instructions went away.
An idea was offered up: why not just reinstall all of the machines? That would get rid of the bad state file, and they otherwise are expendable. People seemed to agree this was generally a good idea - they're all down anyway, so give them 15-20 minutes to reinstall and that'll be that.
Only, well, it wasn't 15-20 minutes. It turned out to take the better part of a week to get these groups of systems back online. What happened? Lots of things. First, the reboot on LAN wasn't always working so well. It didn't always get enabled on the machines. There was this "relay agent" host which had to be out there to put the actual magic packet on the wire, and it would be dead or missing a bunch of times. When those failed, humans had to go unseat and reseat the actual boards to make these machines restart, for they had no reset buttons or power switches, either.
Or, when the reboot DID work, the provisioning system did not. DHCP leases wouldn't get handed out. The stuff which serves up images to the PXE loader would fail. The actual OS ramdisk wouldn't load. The install script would try to contact some backend service that was completely overloaded due to having thousands of machines trying to reinstall at once. rsync would dribble data out slowly because, again, thousands of machines were all trying to do the same thing at the same time.
Looking back on this from years in the future, it seems there was only one possible somewhat-fast way out of the mess, and it would have still required a lot of physical reboots of server boards. It goes like this:
- Get a list of the broken machines.
- Configure the netboot stuff to hand them a ramdisk image that brings up the network and sshd and then pokes a URL to say it's alive.
- Trigger reboot-on-LAN reboots.
- Dispatch the humans for the rest.
- Have some evil script ssh into the ramdisk-booted machines, remount the filesystem read-write, and deal with the bad file.
- Remove each machine from the netboot/ramdisk redirect once they pass the last step, then reboot it again.
- Order some pizza and wait while it runs.
- Send some hard liquor up to the thankless folks at the datacenter.
What about making it not happen in the first place? We should talk about that too. (Except for THE ONE, who knows all of this already.)
The system should not allow a loop to be introduced. Obviously. But, it happened, and it'll happen again given enough time, so what else?
The system should not chase any sequence too far. So, even if you have an infinite loop, at least it'll eventually give up and say "your graph is too deep and I quit".
Memory allocation inside the process shouldn't be allowed to go on forever. I don't mean rlimit (since it doesn't cover RSS anyway), but rather doing your own bookkeeping and checks to make sure it doesn't run out of control.
The process should be in some kind of constrained space where it can't eat all of the memory on the machine. Stick it in a memcg, or something. Just watch out for those fun kernel issues if you're on certain old versions and play games with who gets to kill who.
The machines should be rebootable remotely with something that's more reliable than the machine itself. That probably means NOT using Yet Another Tiny Linux Box as the remote access controller, thankyouverymuch. Likewise, the console needs to exist and be accessible through the same sort of more-reliable provider.
Oh, and all of those provisioning systems? They need to Just Work, and they need to stand up to load, like an entire cluster, building, region, or whatever needing to get reinstalled at the same time. Otherwise, what'll happen when your company does the equivalent of releasing a controversial movie and all of the machines get wiped? Will they ever be able to get reinstalled?
At some point it might just be easier to go out of business.