Writing

Feed Software, technology, sysadmin war stories, and more.

Saturday, March 31, 2018

ulimit is not a silver bullet for memory problems

I've written about the Linux OOM killer on several occasions. In some stories (like the one from yesterday), it's been the villain, brought to life by some other misbehaved program. In other stories, it was fingered as the cause of everything on the box disappearing, only to turn out that something else did it.

Everyone seems to have a quick solution to these problems. Not all of them actually work. Yesterday's post generated at least one response which amounted to "just use ulimit". It's not that simple. It won't save you from something with runaway RSS. Not on Linux.

First off, some basics. People tend to call it ulimit, as in the command you've probably run from inside bash (or zsh, or dash, or a couple of other shells), but that's really just a shell builtin. Try that same thing in tcsh or tclsh and you'll quickly discover it's not a standalone program. What we're really talking about is setrlimit(), and specifically on Linux. I'm not covering any other kernel here today.

Anyway, setrlimit, right? You might read the man page for it and discover a bunch of interesting-looking resources. Given that you're trying to keep the program from scarfing down RSS, RLIMIT_RSS seems mighty tasty. You might even try to use it as a sigil to keep your process from doing too much evil.

Unfortunately, if you just start using it without really reading the docs, you will eventually find out that it's not doing anything for you the hard way. Once you turn to the man page, you will find this:

Specifies the limit (in pages) of the process's resident set (the number of virtual pages resident in RAM). This limit has effect only in Linux 2.4.x, x < 30, and there affects only calls to madvise(2) specifying MADV_WILLNEED.

Are you on Linux 2.4.29 or below, and are you calling madvise() that way? Given that 2.4.29 apparently dates to January of 2005, I'm guessing not. As a result, RLIMIT_RSS is doing nothing for you.

Maybe you go back to the man page again, and this time you spot RLIMIT_AS. That talks about address space, virtual memory size, and stuff like brk() and mmap(). Those have something to do with how much memory you're chewing, right?

So, maybe you turn that on, and maybe nothing bad happens. You will, however, eventually discover that VSZ and RSS are two very different things. You can have a process which legitimately has a very big address space without ever using much physical memory. If this is the case with your program, and you have this limit set, you will smash into it without getting anywhere near actually running the box out of RAM.

This tends to show up in C++ programs that throw uncaught exceptions for "bad alloc". It's because nobody expects 'new' and friends to fail, but they certainly can, and they definitely will if they hit their heads on this particular limiter.

This limiter works, but it's not what you want. It will give you many false positives and will cause outages all by itself, while also not catching the actual RSS consumption problem that you care about.

What to do, what to do.

Some people handle it by having a background thread in their program which wakes up once every so often and takes a peek around to see how big it has gotten. Then, depending on the implementation, it might send a hint to the actual "real world" stuff to back off and start ignoring requests until it stops growing so much. Or, it might just wait for it to cross a "line of death", at which point it kills the process from the inside.

Other people resort to cgroups. This is something on Linux where you basically can get the OOM killer to fire based on a smaller-than-machine-size limit while only considering the specific tasks inside that "container" (term used loosely here).

It wouldn't surprise me if still other folks try co-opting malloc and friends in their low-level libraries and force them to do all of the accounting, and just have them start failing requests past a certain point. It's unclear exactly what kind of overhead this would give you, particularly if you have lots of hot paths which are constantly asking for or releasing bits of memory.

Still other folks resort to more than one of these techniques, and that's pretty much my preference: have the app try to be aware of itself and throttle back if at all possible. Then have it kill itself if it runs away. Then if that doesn't work, have some kind of "container" limit, and finally, if that fails, the whole system has a global limit.

It's all terrible, naturally, but this is what you get when you build on top of systems that assume every single resource will always be available. It doesn't matter whether it's memory, disk, network, or something else entirely. Given enough time and permutations, something will disappear out from under you, and if you have no way to notice it and handle it clearly (perhaps due to terrible layering schemes), the only possible outcome is a mess.

In retrospect, it's amazing some of this stuff works at all.