Writing

Feed Software, technology, sysadmin war stories, and more.

Friday, March 30, 2018

Non-uniform memory access meets the OOM killer

Do you have machines with NUMA? That's non-uniform memory access, meaning that depending on which CPU your code is on, it might be faster to some bits of your RAM and slower to other. Then, if it gets rescheduled to another CPU later, it might be slow to the first bits of memory and faster to others.

Sometimes you see this with really big systems with hundreds of gigs of memory. Such a system might be built for the purpose of being a glorified online cache. The rest of the box is secondary to its station in life which is to sit on the network, receive bits, stick them in RAM, and (probably) fling them back at others when requested.

If you're running such a beast, and you're on a NUMA system, odds are you are playing weird games with allocations, or are messing with interleaving, or any of those other things.

Imagine your surprise, then, when someone ships a new version of some low-level support program, and it goes to all of your machines at once, and they start out-of-memory-killing your memory storage stuff. In a matter of minutes, your entire service is dead and people start running around trying to figure it out.

In the course of troubleshooting the problem, someone notices the new version (which went everywhere at once), and people start looking at that. What could possibly be the problem there? Oh, it's asking for 1 GB of memory at startup.

But, they say, these machines have hundreds of GB of memory! Even though 1 GB from this dinky little utility is ridiculous, it shouldn't be a problem. What the heck happened? And why did it not kill every single box?

Back when this happened, those were the questions. The answers turned out to be kind of amazing.

Did you know you can force your memory allocation request(s) to only be served up by a given NUMA node? Did you then know that if that node happens to be nearly full, the kernel's OOM killer will wake up and start looking for something to assassinate? Well, you do now...

What happened started fitting into place. A new version of this tool had shipped out to 100% for some reason. It had this quirk where it puffed up to 1 GB of memory at startup temporarily. This new version also had this wacky little "feature" where it tried to bind itself to a single NUMA node.

Guess what happened when it started up on the CPU which corresponded to the NUMA node which had the actual service on it taking up 99% of the memory? On that node, there wasn't a free gig of memory, and the kernel tried to satisfy the request by killing the biggest thing... which was the memory storage service, of course.

Boom, down goes the service.

The worst part of a story like this is that the utility in question doesn't even need to be the thing which has started fiddling with NUMA performance tricks. It might be pulling in a library in which someone has started doing that stuff. Their latest build might have just scooped it up, and the next thing you knew, you had a binary of death.

"You can OOM a single NUMA node" thus entered my list of things to worry about when a box seems to have plenty of memory but still goes off and slaughters innocent (but big) processes.