Patching around a C++ crash with a little bit of Lua
Sometimes, inside a company, you find someone who's just so good at what they do and who has the fire burning inside them to always do the right thing. It's easy to fall into a spot where you just go to them first instead of chasing down the actual person holding the pager that week. At some point you have to tell yourself to at least *try* "going through channels" to give them a break.
But still, you wind up with some tremendous stories from when they came through and saved the day. I love collecting these stories and I periodically share them here. This is another one of those times.
Back up about a decade. There was something new happening where people were starting to get serious about compressing their HTTP responses to cut down on bandwidth and latency both: fewer packets = fewer ACKs = less waiting in general. You get the idea.
A new version of the app for one particular flavor of mobile device had just been built which could handle this particular flavor of compression. It was going out to alpha and then beta testers, so it wasn't full-scale yet. When it made a request, it included a HTTP header that said "hey, web server, I can handle the new stuff, so please use it when you talk back to me".
On my side of the world, we didn't know this right away. All we knew was that our web servers had started dying. It was one here, then one there, then some more over in this other spot, and a few more back in the first place, and it was slowly creeping up. This wasn't great.
We eventually figured out that it was crashing in this new compression code. It had been added to the web server's binary code at some point before, and it obviously had a problem, but I don't think we had a good way to turn it off from our side. So, every time one of these new clients showed up with a request, their header switched on the new code for that response, and when it ran, the whole thing blew up.
When the web server hit the bad code, it not only killed the request from the alpha/beta app, but it also took down every other one that same machine was serving at that moment. Given that these systems could easily be doing dozens of requests simultaneously, so this was no small thing! Lots of people started noticing.
That's when one of those amazing people I mentioned earlier stepped in. He knew how to wrangle the proxies which sat between the outside world and our web servers. It had a scripting language which could be used to apply certain transforms to the data passing through it without going through a whole recompile & redeploy process for the actual proxies.
What he did was quick and decisive: it was a rule to drop the "turn on the new compression" header on incoming HTTP requests. With those stripped from the request, the web server wouldn't go down the branch into the new (bad) code, and wouldn't explode. We stopped losing web servers, and we were now in a situation where the pressure was off and we could work on the actual crash problem.
I should mention that we were unable to just switch off the new feature in the clients. The way that clients found out what features to run in the first place was by talking to the web servers. They'd get an updated list of what to enable or disable, and would proceed that way. But, if the web server crashed every time they talked to it, they would never get an update.
That's why this little hack was so effective. It broke the cycle and let us regain control of the situation. Otherwise, as the app shipped out to more and more people, we would have had a very bad day as every query killed the web servers.
And yes, we do refer to such anomalies as a "query of death". They tend to be insidious, such that when they show up, they take down a whole multitenant node and all of the other requests too. Then they inevitably get retried, find another node and nuke that one too. Pretty soon, you have no servers left.
To those who were there even when they weren't on call, thank you.