Writing

Atom feed icon Software, technology, sysadmin war stories, and more.

Saturday, February 22, 2025

Answering reader feedback: war rooms vs. deep investigations

A reader asked me to follow up on something I had mentioned quite a while back. Way back in 2018, I wrote that maybe I'd write about "war rooms" and why they are bad for getting thoughtful analysis work to happen.

Here's my best take on it, in feedback style, where things are sort of off-the-cuff and I don't have a prepared statement ahead of time. Ready? Here we go.

My canonical example for a hellish war room is what happened on Friday, August 1, 2014. For people in the know (like anyone who took my class for new technical employees), that's the date of "Call the Cops" - an epic outage that took down all of Facebook for several hours. It was so named internally because the sheriff of LA County tweeted something like "we know it's down, it's not a law enforcement emergency, so stop calling us about it".

I've mentioned this any number of times in talks and posts, and other people have too. It's no secret. The site tanked bigtime that day.

It actually broke while I was waiting for my bus to arrive, so I ended up getting online from the sidewalk using my cell phone's tethering function, then rode in doing my very best to follow things while bouncing around as the bus shot up 101 to Menlo Park.

Then, since it was Friday and we were going to do the usual weekly "big show" of reviewing recent outages (SEVs), the Muppet-named room used for those events was instead repurposed as a "war room". It was a bunch of engineer type people thinking very hard and sweating a lot... in a relatively small room with not-great ventilation, door open or no.

It got smelly. Okay? I'm not going to sugar-coat it.

There's more, though. I've said a few times that I'm really not "fully functional" when I'm going through the Mac UI. For that reason, I ran VMWare Fusion on my company laptop and had Ubuntu running in a virtual machine, and used that for all of my work.

Still, actually doing work on the laptop proper (a 13 inch Macbook Air) was janky as hell because the key layout isn't quite right (want to alt-tab in Linux? better use option-tab!), and besides, the screen is tiny! It's a great little machine for browsing the web and reading e-mail, but it's no place to try to crack open a bunch of xterms and do Real Work.

Could I run my terminals in there? Yes. Did I? Yes, for a while. Was I effective? Not really. I missed my desk, my normal chair, my big Thunderbolt monitor, my full-size (and yet entirely boring) keyboard, and a relatively odor-free environment.

Fortunately, by that point in my life I had gotten old enough to where I was willing to take steps to bolster my own sanity and effectiveness, and said I'd get back on from my desk. I crossed back to our building, posted up at my desk, and proceeded to start cranking on things from there.

There were any number of things going on, and I hopped around trying to be useful. It took a while to finally settle down on one thing in particular: why had the machines effectively dropped off the network? Why was sshd dead? What was going on there?

My whole goal was to find out WHY the machines had seemingly nuked everything when they ran out of memory during the "push" that morning. Was it some pathological kernel "OOM killer" thing? Was it something else running amok and shutting down the wrong jobs?

We *had* to know, or we couldn't be sure it wouldn't happen again. Lots of other people were hacking away, trying to figure out how to reproduce what had happened that morning, and others were simultaneously cranking away trying to reduce the memory bloat on the web servers.

But, without finding out what the hell had happened that reduced the machines to init and this "fbagent" thing, we were underneath this Sword of Damocles situation where it could drop on us at any point and take us down *yet again*.

People figured out that yes, they had run the machines out of memory, specifically with the push - the distribution of new bytecode to the web servers. Other people started taking steps to beat back some of the bloat that had been creeping in that summer, so the memory situation wouldn't be so bad. I suspect some others also dialed back the number of threads (simultaneous requests) on the smaller web servers to keep them from running quite as "hot".

I still had my assignment to root-cause the damn "nuke the world" thing. There was no way it was going to happen in that little conference room without my usual fine-tuned environment, and that's why I bailed.

It took me a couple of weeks to really make any sense of it. I mean, really, I'm not exaggerating here. The outage was August 1st, and I finally figured out the sequence required to nuke all the processes on the box on the afternoon of August 19th. I know this because there was a screenshot floating around where I talked my way through it on IRC as it shook itself out... and that made it into one of my public talks.

That's over 18 days of not knowing why, and worrying what could have happened pretty much the whole time.

In those days, the machines were running Upstart for init (pid 1) so the only things that would come back up are things set to "respawn" in inittab (remember that?), so we'd get a getty on the console and little more. That's how I was able to use the out-of-band access to jump in and go "yep, machine is up but everything else (including sshd) is down".

[ Side note: If you did that today, I assume that systemd would end up restarting most of the things on the box, and you might actually recover from it. I'm not about to try just for a post, though! ]

Why did fork fail? Easy: the box ran out of memory. But, I had to reproduce that to know for sure. How did I do that? This took much longer, and was after chasing many dead ends based on rumors about "kernel OOM killers" and stuff like that. Were we deadlocking during the OOM kill? There was some scary stuff going on where the hosts would get really squirrelly while the messages spewed into the printk ring buffer. That consumed a bunch of time right there, and was also not what actually caused it.

Finally reproducing it involved shrinking my test system's swap size from what had been multiple gigabytes down to just 64 MB. Then I also ran some "memeater" things I had coded up: they would malloc() some space and dirty the pages by writing to them so they actually got physical memory handed to them. Then they just sat and waited around. After putting enough memory pressure on the box, it finally borked.

Even then, I thought it was the task scheduler thing the company had written for its own prod environment, because, again, everyone assumed it was guilty, and that was the undercurrent. But no, it wasn't. A few minutes later, I found the smoking gun: fbagent had logged something about "starting kill of child -1" at exactly the time everything died.

Years of nerding out on Linux boxes had taught me what killing "pid -1" would actually do, and this finally explained why this fbagent thing hadn't died in the slaughter of every process on the box. It was the thing doing the murdering!

This fbagent process ran as root, ran a bunch of subprocesses, called fork(), didn't handle a -1 return code, and then later went to kill that "wayward child". Sending a signal (SIGKILL in this case) to "pid -1" on Linux sends it to everything but init and yourself. If you're root (yep) and not running in some kind of PID namespace (yep to that too), that's pretty much the whole world.

Of course, I was looking at the checked-in source code and couldn't figure out just where the hell this -1 was coming from. The call to fork() *did* check for a -1 and handled it as an error and bailed out. So how was it somehow surviving all the way down to where kill() was called?

That was another rathole, and the answer was also a thing to behold: I couldn't see it in the checked-in source code because it had been fixed. Some other engineer on a completely unrelated project had tripped over it, figured it out, and sent a fix to the team which owned that program. They had committed it, so the source code looked fine.

[ Another side note: this person who fixed a bug in some code that wasn't their actual "job" was the kind of excellent behavior that used to be lionized there - "nothing at FB is someone else's problem". That credo died a long time ago. ]

Unfortunately, production (many many machines) was running the last release which had been cut WELL before that point. It had the bug in it: run machine out of memory, make fork fail, kill the world, all die, oh the embarrassment.

I can't imagine doing that kind of multi-window parallel investigation stuff on a teeny little laptop screen with people right next to me on either side, while the whole room flop-sweats about going out of business or whatever if they don't get it fixed.

I guess a "war room" might work out if you have a bunch of stuff that has to happen to deal with a possible "crisis" and then it's just a matter of coordinating it. You don't have people doing "heads-down hack" stuff nearly as much in a case like that.

I have actually seen such a gathering work out nicely, and I'll leave that as a tale for yet another time.