Writing

Software, technology, sysadmin war stories, and more. Feed
Wednesday, September 22, 2021

A different take on the NUMA OOM killer story

I was digging through some notes on old outages tonight and found something potentially useful for other people. It's something I have mentioned before, but it seems like maybe that post didn't have enough specifics to make it really "land" when someone else does a web search.

So, in the hopes of spreading some knowledge, here is a little story about a crashy service.

On a Wednesday night not so long ago, someone opened a SEV (that is, a notification of something going wrong) and said the individual tasks for their service were "crash looping". These things ran in an environment where they were supervised by the job runner, so if they died, they'd be restarted, but you never wanted to keep it in this kind of state.

This got the attention of at least one engineer who happened to be hanging around on the company chat channel where production stuff was discussed because a bot announced the SEV's creation. The SEV-creating engineer was also in that channel as best practices dictated, and a conversation started. They started debugging things together.

It looked like this had started when some update had rolled out. That is, prior to the update, their service was fine. After they changed something, it started crashing. Things were so unstable that some of the tasks never "became healthy" - they would go from "starting up" to "dead" and back again - they never managed to reach a steady "ready/online/good/healthy" state. This meant they weren't serving requests, and the service was tanking.

The job supervisory service was noticing this, and it logged the fact that the binaries were exiting due to signal 9, that is, SIGKILL. Given that the job supervisor it wasn't sending it, the process wasn't sending it to itself, and nobody else was on these boxes poking at things, what else could it be?

It was the kernel. Specifically, it was the OOM (out of memory) killer.

date time host kernel: [nnn.nnn] Killed process nnn (name) ...

This line in the syslog explained that it was getting smacked down for using too much memory. However, the odd thing is that the amount of actual physical memory in use... the resident set size... (RSS) was merely 6 GB. This was on a 16 GB machine. It should not be dying there, particularly since the machine was supposed to be dedicated to this work. Nothing else had "reserved" the rest of it.

The question then became: if it's a 16 GB box, why does it die at 6?

One of the engineers noticed the job was using "numactl" to start things up. This is one of those tools which people sometimes use to do special high-performance gunk on their machines. Non-uniform memory access is something you get on larger systems where not all of the CPUs have access to all of the memory equally well - hence, the "non-uniform" part of the name.

Because of an outage a few years before, there was a nugget of "tribal knowledge" floating around: if you are doing funny stuff with NUMA and request an allocation from just a particular node, you can totally "OOM the node". That is, the system will have plenty of memory, but that particular node (grouping of CPU + memory that's close by) might not! If you get into that situation, Linux will happily OOM-kill you.

They dug around in their config, and sure enough, they had numactl specified with "membind". The man page for that tool hints at what can happen but doesn't quite convey the potential danger in my opinion:

--membind=nodes, -m nodes
Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes. nodes may be specified as noted above.

It's not just that it'll fail - you'll probably *be killed* if you happen to be the biggest thing on there. If not, you might live, but you'll cause something far bigger (and probably much more important) to die, all because of your piddling memory request!

The engineers pulled that out of the config, pushed again, and things went back to normal. It's not clear if they later came back and did "interleave" or otherwise twiddled things to get a clean "fit" into each of the available nodes on their systems.

If you ever switched on membind with numactl and started a mass slaughter of processes on your Linux boxes, this might be why. I hope this helps someone.