Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, January 20, 2019

The night of a cascading failure

It's interesting to see random people on the outside world trying to make sense of someone else's outage just using the tools they have available. Some of them write up these essays as a way to pitch their own product. They end up essentially saying things like this: "if you had us, you would have known they were down, because we noticed all of the traceroutes stopped short all at once". That's actually true, but they go on and say things like "the engineering team likely shut off traffic". That can be true without being meaningful, so saying it doesn't really help matters.

"The engineering team likely shut off traffic", when phrased exactly like that, could mean anything from "someone deliberately turned off the site because something crazy was going on" to "someone triggered an epic bug and that then sent all traffic into a black hole". I mean, they're both the engineering team, and they both shut off traffic, but there is a world of difference in meaning behind them. Do you really think traceroute is going to tell you which scenario is going on? I'm guessing not.

Let's talk about epic bugs, instead.

So you have an engineer, and they're working on a request. They've been asked to do some token assignment to add a feature to a bit of plumbing. This is a big pseudo-database used as a lookup service. It has lots of nodes in it. The nodes have both "parent" and "child" relationships. They can actually have multiple parents and can inherit traits from them if desired.

Anyway, it's kind of important, since it's used for programs to find other instances of themselves in this vast sea of computers. It's also used for client programs to find instances of servers they want to talk to. It's a bit like DNS: you ask it for someone who's providing X, and it gives you a list of candidates: hosts and port numbers, that sort of thing. (Yes, DNS can do port numbers. Check out SRV records.)

Since it's important, the engineer doesn't mess with any of the existing nodes. Instead, they just spawn a child of the existing node and give it a name like "mytest2". The idea is that they'll be able to work on it safely. It's a good idea, but the code has other plans. They type the command and hit enter. The change propagates outward... and every single production service goes down at the same time, HARD, and doesn't come back up. Every. Single. Production. Service. Nothing survives.

It's a bit of an understatement to say "people noticed". Oh yeah, they noticed. Lots of people noticed. There was much wailing and gnashing of teeth. Police may have been called in various jurisdictions to say that they couldn't access their favorite whatever-it-is. Really.

People tried to fix this. It was the middle of the night, so nearly everyone was remote. They had to log in to their bastion machines in order to fiddle with production. This was standard operating procedure: ssh to your bastion, then ssh into production and get cracking.

Only on this night, it wasn't working. Their ssh would fail.

It would become clear much later that ssh was failing because the login process on the machines needed to look up users in LDAP (a directory service itself) and couldn't. The actual connection would fail because of the original outage: it needed to cross a production service, and since all of those were dead, it was dead too.

No more ssh to those bastions.

Now, granted, there were a few people still logged in, but since it was the middle of the night, trying to FIND all of them and bring them together in the same place to start acting coherently was interesting. Let's just say that eventually, enough existing connections and backdoors were glued together to where a few people could finally access production to see what was happening.

That's when they found a curious pattern: every single service had registered an update to the "mytest2" node, and had then thrown a std::out_of_range error. Then after that, nothing happened. Silence.

Eventually, someone figured out what had happened, and realized that "mytest2" needed to go. It was tripping some kind of bug. So, using some dark ninja magic to go around the now-broken systems, they reached into the low-level storage system and deleted just that node with the greatest of care. It propagated outward, and the exceptions stopped being thrown. Things came back to life by themselves for the most part.

The bug came down to some string handling and signed number shenanigans.

Let's say you are building a quasi-database, and you have a certain naming scheme for some of the nodes in it. That is, you can name them anything, but you decide that any of them which start with "pink.bunny.basis" (PBB) are special. The actual PBB node is the parent of the whole special set, and all of the others have names which continue on from it. So, you have "pink.bunny.basis.1234" and "pink.bunny.basis.2345" and "pink.bunny.basis.9876", and so on.

This technique lets you import data from a "legacy system" that had used nothing but numbers as identifiers. This let them switch the entire ecosystem around to looking up "PBB.number" in the new pseudo-database to get the same results.

To actually do this, there was some magic going on. Any time someone looked something up, they scanned through the children of the root PBB node. Then they applied some string operations to it.

First, they said, the root is "PBB", and all of the children are "PBB", a dot, and then a number. The number can be variable length since it might be "1", or "100", or "1000000". So let's find out how long the meaningful part of the string is by subtracting the length of it from the length of the parent.

size_t new_len = child.length() - parent.length();

If child is "PBB.12345" and the parent is "PBB.", then the equation is "9 - 4" and new_len is 5.

Next up, they figure they should do a sanity check, because checks are good, right?

if (new_len <= 0) { yell(); return; }

Now, confident they had a good number thanks to their check, they proceeded to use it in a string comparison operation. It was something like this:

if (child.compare(new_len, some_num, other_thing) == 0) { do_stuff(); }

Then it went off and did whatever it had to do. This code had worked just fine until that night, when it finally did not.

So just what happened? Underflow.

"mytest2" (length 7) is SHORTER than "pink.bunny.basis" (length 16). What does that do to the new_len computation?

size_t new_len = 7 - 16;

7 - 16 is -9, right? Well, it is and it isn't. Here, it isn't, because it's size_t, and it's unsigned. It can't represent negative numbers (that would be ssize_t, naturally). So, instead of giving you a negative 9, it gives you a number that's incredibly big. It's basically the biggest number you can fit into a size_t on your machine... less nine. If that's a 64 bit system, it's probably this: 18446744073709551616.

The fun comes a few lines later where the "compare" runs. It's asked to compare a string starting at position 18446744073709551616. This is for a string which only has 7 characters in it. Obviously, it can't do that. As per the spec, when you give string::compare a "pos" number which is greater than the length of the string, it will throw "out_of_range". That's exactly what happened.

There are a lot of things which could have gone better with this, and hopefully other people can look for similarities in their own software stacks and avoid the same kind of outage.

(Except, you know, if you are THE ONE, at which point you already know all of this. This is not for you.)

I'd say first off, you want to rig up your coding environment so that a well-intentioned but useless comparison throws a warning. Remember the check to see if new_len is less than or equal to zero? Since new_len is that unsigned (size_t), it'll never be less than zero. It might be CRAZY BIG, but it'll never be negative. The compiler probably let this one through because it CAN be zero, and so it "seemed legit". If you have a compiler flag to enable or a static analysis checker which can find it, use it!

Next, I'd wish that we could just treat basically all strings as evil black boxes of goo which are not to be messed with in terms of addressing individual elements. This thing with compare throwing out_of_range basically amounts to requesting that it do a [] access on a member which does not exist. I've found way too many crashes and other bugs in C++ code from bad [] accesses. This one is just more subtle than most.

A nice person at a not-so-nice company once advised me on how to handle this kind of stuff. Instead of slicing and dicing the string with substring operations like it was 1985 and I was back on my VIC-20, I should use a regex and be exquisitely precise about exactly what I wanted to get from the string.

In this case, they almost certainly wanted to know that they had a string that started with "pink.bunny.basis." and then a series of digits. You can imagine what that looks like as a regex, and then if you ran it, it would either match or not, at which point you could look at the all-digit group to see what the value was.

Granted, regexes are their own special circle of hell ("you now have two problems", per jwz), but it lets you avoid poking at individual characters inside a string with relatively low-level accessors.

Another thing I can wish for would be to have actual separate fields for the elements of the names. That way you don't find yourself having to concatenate strings to build the names, and you also don't have to "split on ." to break them back out.

That is, instead of having "A.B.C" as a string, imagine "A" pointing to "B" pointing to "C" as the name. You can still do your fancypants "find values of C within group A.B", but now it happens as distinct units and you don't have to do string slicing.

Third? Well, having the whole world die because that lookup service went down is certainly a bad thing. It basically meant that there was a hard runtime requirement on the lookup service being up and being sane. I can understand if you're unable to start up "cold" with it down, since you have no idea where things are. But having it die while you're still alive is another thing entirely. If it stops giving you results, you should probably keep using the last ones you got from it. Then you should scream bloody murder because something is VERY WRONG, and you're doomed if you should get restarted somehow. This opens the door to cache coherency and general management issues, so you really do not want to run in this mode very long.

Next, I'd say that only being able to fix production from bastion boxes is an interesting concept if they also have world-ending dependencies on it. You can get around a lot of broken stuff, but if you can't so much as start a shell on the machine, you're kind of hosed.

Options include having LDAP clients somehow do the same "last known good" schtick mentioned earlier while heeding the usual cache warnings. You could have have local accounts on the box (which don't hit LDAP) for emergencies. You could have a separate set of hosts which don't use LDAP at all for logins.

(Obviously, if you're not using LDAP itself, substitute "Active Directory" or "Netware Bindery" or whatever applies to your situation and/or decade you happen to be in. If you're on Netware, say hi to Doc Brown for me.)

Finally, think about the possibility of failure injection. Let's say someone codes up something nifty which is claimed to "survive the next outage" of this lookup service. How do you know they're right? How can you be sure? Do you wait for the next outage and go "yep, my thing stayed up". I guess you could, but that's just gross.

How about you find a way to make the thing throw an exception again right there in the same place? Put in some kind of knob and see what happens when it just starts failing all requests like it did on that cold night so long ago. Does the fix actually work? Are there side-effects? How long can it survive like this? (You want to know this so you know how long you have to fix it in the future: can it wait until morning, or do you need to wake up the whole team NOW?)

Then, how do you make sure it keeps working? Can you trigger this knob on demand? Can you make it happen periodically, preferably without a human having to remember to drive it?

Can you then spread this kind of thinking into the rest of the infrastructure? Can you get other people to worry about this kind of failure mode in their own services, and check their own tests, add their own injection modes, and come up with ways to generally stay afloat?

Remember that as you read through this story, you knew about the intrepid engineer and their "mytest2" change because that's how I laid it out. The people who had to fix the problem initially had no idea it traced back to that, and not just because there are thousands of changes going on at any given moment. They also couldn't log in to even SEE the list of changes, and then when they could, the list *broke* because of the bug. They also couldn't even page each other initially, because, well, again, everything was broken.

It takes time to get a handle on things and start establishing what's wrong, particularly if your tools and analysis techniques suddenly fail you. You have to back down the technology chain until you locate something which is still working, and then bootstrap back up from there to understand what happened. Systemic measurement errors are a real pain.

And hey, when it happens at your company, be kind to your coworkers. They didn't write the bug. They just tripped it.

...

Programming note: the above was my past. In a few days, I start doing it again. When my posting cadence drops off, don't be too surprised or alarmed. Know that my cycles will be put to good use.