There are sysadmins who aren't systems programmers?
I've received some feedback from a few people thanking me for talking about what life is like as a sysadmin. They apparently have been using my posts as proof that their style of handling things is not crazy and that other people actually do the same thing.
What I found surprising is that apparently some people work in environments where they do not think that a sysadmin should know anything about systems programming. I guess to them, folks with those jobs should just install shrinkwrapped software and keep it running by turning knobs. I wonder how they ever got that idea.
If you're running any sufficiently complicated system and have never had to puzzle out the details of compiling, linking, libraries, kernels and/or kernel modules, you're far luckier than I have been. It seems you'd need a charmed life where everything is always compatible with everything else and all of the documentation is 100% accurate.
Then again, some of these people must exist, since they really do not understand what's going on with memory allocation and things like that. I once had a customer call up at my web hosting support gig who insisted that something was very wrong with his free/used counts.
Finally, I had to show him something like this:
#include <stdio.h> #include <stdlib.h> #include <unistd.h> int main() { char* x; x = malloc(1048576 * 1024); printf("Sleeping...\n"); sleep(30); return 0; }
I called it "eatmem.c" and fired it up. Then I showed him the result in ps. It had a VSZ of 1052628 and a RSS of 320. It had asked for 1 GB of memory and his machine only had 512 MB, and yet it clearly was running without being OOM-killed or whacking the entire machine by chewing swap. He had to see it to start believing the nature of overcommiting allocation strategies.
Next, I changed it a little to actually use some of that memory.
int i; char* x; x = malloc(1048576 * 1024); for (i = 0; i < 1048576 * 1024; i += 1048576) { x[i] = 1; }
The VSZ stayed the same, but the RSS jumped up to 4536. Making it dirty twice as many regions by only incrementing i by 524288 made the RSS change to 8496. I kept fiddling with the numbers to show how it would grab more physical memory as it actually started doing things with that space.
He made the connection and signed off, satisfied with this discovery. There are people who really need to see and play around with stuff in order to "get it". I know I won't be confident in describing something until I've worked out my own internal model for how it behaves, and that only happens after a lot of experimentation.
You can't figure out this sort of thing if all you ever do is fiddle with app settings. They're just situated too high up in the stack to give anything more than a blurry, laggy, and incomplete view of your system. If no part of your software ever keels over or starts acting strangely, you might just get away with it.
It's when reality sets in that you either start throwing people and money at the problem, or you sit down, figure it out, and deal with the root cause. In terms of scaling, one of these things is not like the other.