Troubleshooting shell scripts that grew slower with age
Sometimes, I start writing posts, and then shelve them because they aren't quite ready, or I don't feel like putting them out, or maybe it's just not clicking into place, or who knows what. Some of them hang out "on ice" for quite some time and may never surface.
Here's one which was written about four years ago and yet was never released. I did mention it in passing early in 2017, but that post deliberately skipped the full technical dive into what was happening.
So now here we are, over four years later, and bash has been fixed, so this probably doesn't affect you any more. But, for a while, it did, whether you realized it or not. I've decided to post it anyway since it still seems like an amusing troubleshooting romp. I've punched it up to fix some of the parts which fell flat originally (which is probably why I never posted it before).
To be clear: this problem was reported and fixed a long time ago. So, unless you have a truly ancient bash on your machine, this does not affect you.
Enjoy.
Lots of systems run processes which are started and are supposed to never end. A bunch of them are just overgrown shell scripts. Maybe you've seen one, or even written one. It might look something like this:
#!/bin/bash while true do thing1 | thing2 other_thing sleep 1 done
This shouldn't come as any surprise, but people write this kind of glue constantly. What might be eye-opening is that it can make you waste a stupid amount of CPU time in bash itself. This isn't some kind of "interpreted things like shell scripts are slow" rant. No, this is more like "bash is using a linked list to try to store all of the child pids which it has created, and linked lists get slow after a while".
Our dumb little example script up there runs "thing1", "thing2" and "other_thing" once a second. That means that we get at least three new PIDs which now have to be tracked. Have you ever noticed that you can refer to processes as "%1" and it'll know which one that was? Or, have you ever done a "wait" in a script and noticed it knew which children to wait for? It has to track those someplace, and it happens in a C struct called "bgpids":
struct bgpids { struct pidstat *list; struct pidstat *end; int npid; };
Okay, technically, that's not the list, but it points to it. It points to the head and the end, specifically. It also has a counter. The actual list happens in the "pidstat" struct:
struct pidstat { struct pidstat *next; pid_t pid; int status; };
What would you think if I told you the linked list just gets bigger and bigger as your script runs longer and longer? It has a maximum size, sure, but that limit is probably high enough to consume a whole bunch of CPU time once your script has been running for long enough. Why does this happen? Let's dig into the details.
The ultimate size of that list is influenced by the number of unique PIDs that bash can create while running your script. This might sound a little weird, so let's take a step back for a moment to explain.
Let's say you have a fairly boring Linux box. Your kernel.pid_max (look in sysctl or /proc) is probably set to 32768. That means your system starts with 1 (init, probably from systemd these days) and goes up from there. Once it uses 32768, it wraps back around. Naturally, it skips those in use, so 1 won't get used again, but eventually, you will get "low" pid numbers showing up for brand new processes again.
Assuming the script runs long enough, the child processes will probably end up with every single process ID in that range, minus the ones where things started and never stopped -- init, sure, and things like crond, syslogd, sshd, or whatever. It'll probably be MOST of the space between 1 and pid_max!
Back to our list. Every time bash spawns something, it goes and looks in that list to see if perhaps it's already seen that one before. It not, it adds another entry. The list gets longer and longer, and the number of checks keeps going up. Comp Sci people should start thinking "big O notation" if you haven't already.
I'd guess that you'll probably end up with 32000 entries in that linked list eventually. What this means in practice is that bash will flip through all of them over and over again. It's using CPU time to do all of this pointless housekeeping, and it's taking longer and longer to come back around and start your stuff up.
Here's a chunk of code from inside bash's bgp_delete():
for (prev = p = bgpids.list; p; prev = p, p = p->next) if (p->pid == pid) { prev->next = p->next; /* remove from list */ break; }
This happens any time something new starts. I grabbed the source and jammed a little bit of instrumentation in it (in my usual "debug via printf" fashion) to show exactly what's going on. It just increments that number inside the loop and prints it once we drop out.
fprintf(stderr, "XXX ran %d comparisons\n", comparisons);
With that in place, running the new binary shows how quickly it adds up:
$ ./bash $ touch /tmp/foo XXX ran 0 comparisons $ touch /tmp/foo XXX ran 1 comparisons $ touch /tmp/foo XXX ran 2 comparisons $ touch /tmp/foo XXX ran 3 comparisons $ touch /tmp/foo XXX ran 4 comparisons $ touch /tmp/foo XXX ran 5 comparisons $ touch /tmp/foo XXX ran 6 comparisons $ touch /tmp/foo XXX ran 7 comparisons
Surprise! It happens for interactive shells, too. I originally started writing this to talk about what happens when you have a long-lived script, but it turned out to also happen in long-lived interactive sessions too. The more you start, the more gunk there is hanging out in there.
Now that we know what we're dealing with, let's torture the shell a bit and see just how bad it is. I'll start from a clean bash-4.3.30 from upstream and run the following stupid script which will expose the problem without making me wait forever.
#!/bin/bash while true do find /dev/null > /dev/null done
Note: this is a deliberate pathological example since it doesn't sleep. That just means it shows the problem sooner rather than later. Given enough time, even a script with a sleep in it will slow down to this degree.
I called it "loopy" since that's what it does, and what it is. Now it's time to run my demented little script. Here, I demonstrate that the kernel's limits are boringly low (32K) and my session has a fairly low limit on how many processes it can create, and then start it up in the just-built bash:
$ /sbin/sysctl kernel.pid_max kernel.pid_max = 32768 $ ulimit -u 15940 $ ./bash ../loopy
The loop prints nothing, so that's all that shows up on the terminal. After waiting a bit to let it get nice and saturated, I can clearly see the CPU time suck from another terminal: to look at the CPU time spent by this thing.
$ ps -o pid,pcpu,etime,time $(pgrep -f loopy) PID %CPU ELAPSED TIME 27778 26.6 02:12 00:00:35
There we go! After two minutes, we've been on the CPU about 35 seconds, or about 26% of the time. That's the time spent by the shell itself, NOT the work done by "find"! That's a ton of overhead just to run the same stupid do-nothing process forever, and that's how I caught this originally.
Okay, so just how bad does that structure look now? Let's just peer inside with the help of the debugger:
$ gdb -batch -ex "print bgpids.npid" /proc/27778/exe 27778 warning: Could not load shared library symbols for linux-vdso.so.1. Do you need "set solib-search-path" or "set sysroot"? 0x00007f592dd193ec in waitpid () from /lib64/libc.so.6 $1 = 15940
So, notice, 15940 is the very same number as the max limit on user processes for that session. I can raise it, but unfortunately the shell picks that up when it starts when it does "sysconf(_SC_CHILD_MAX)", and that affects how big the list can get. That means it's too late to change it for that shell (pid 27778), and that one has to go.
Recall that I'm running this fresh build of bash-4.3.30 from inside my usual shell, so I can just identify that, raise the limit on it, and THEN start up a new child bash-4.3.30 running "loopy", and it'll pick up the new "maxchild" value, which will let the list grow larger.
So let's just abort that thing... and find out what this parent shell is:
$ ./bash ../loopy ^C $ echo $$ 7030
Right, so the parent is 7030. From a root shell elsewhere, I'll just bonk it with prlimit to give it some magic powers:
# prlimit --pid 7030 --nproc=65000:65000
Back in 7030, it's now visible, so let's get going again.
$ ulimit -u 65000 $ ./bash ../loopy
After letting that soak for a while, here's what it looks like in ps:
$ ps -o pid,pcpu,etime,time $(pgrep -f loopy) PID %CPU ELAPSED TIME 1542 21.2 02:14 00:00:28
Strangely, it actually goes down. It turns out to be really expensive at first, but eventually it almost sorts itself out. The pids eventually wrap around, and it manages to pull off a cleaning run.
After jamming way more debug printf cruft into the shell, it now looks like this:
$ ./bash ../loopy XXX bgp_delete: pid=7261 (npid=0) XXX ran 0 comparisons XXX bgp_delete: pid=7262 (npid=1) XXX ran 1 comparisons ... skipping thousands of lines ... XXX bgp_delete: pid=7258 (npid=32429) XXX ran 32429 comparisons XXX bgp_delete: pid=7259 (npid=32430) XXX ran 32430 comparisons XXX bgp_delete: pid=7261 (npid=32431) XXX *** removed *** XXX ran 1 comparisons XXX bgp_delete: pid=7262 (npid=32431) XXX *** removed *** XXX ran 1 comparisons XXX bgp_delete: pid=7263 (npid=32431) XXX *** removed *** XXX ran 1 comparisons
There it is: eventually it wraps all the way around, scores a hit, and deletes it. npid is still massive, but now it's starting to find the IDs at the beginning of the list, so it can jump out of the list quickly instead of hanging around to burn CPU time.
Of course, there's a catch. This was happening on a really quiet box with almost nothing else going on. I only powered it up to do the digging for this post and very little else runs on it. Let me show you what happens when there are other processes coming and going, so the PID numbers aren't all in a neat line, and it won't manage to get lucky by finding them "up front" after it wraps.
This is what happens with a bit of process churn happening:
XXX bgp_delete: pid=18736 (npid=23404) XXX *** removed *** XXX ran 7632 comparisons XXX bgp_delete: pid=18738 (npid=23404) XXX *** removed *** XXX ran 7632 comparisons XXX bgp_delete: pid=18740 (npid=23404) XXX ran 23404 comparisons XXX bgp_delete: pid=18743 (npid=23405) XXX *** removed *** XXX ran 855 comparisons XXX bgp_delete: pid=18745 (npid=23405) XXX *** removed *** XXX ran 855 comparisons XXX bgp_delete: pid=18746 (npid=23405) XXX *** removed *** XXX ran 7633 comparisons XXX bgp_delete: pid=18748 (npid=23405) XXX *** removed *** XXX ran 7633 comparisons XXX bgp_delete: pid=18751 (npid=23405) XXX *** removed *** XXX ran 857 comparisons XXX bgp_delete: pid=18752 (npid=23405) XXX *** removed ***
Uh oh. Now things are all over the place and we're running bunches of comparisons before finding our match. Since other things are running on the box, we usually don't end up with n, then n+1, then n+2 because something else gets those instead. The number we get isn't at the beginning of the list, so the shell has to dive deep to find it.
Also, notably, this happens well before mixing out "npid":
XXX bgp_delete: pid=30379 (npid=21034) XXX ran 21034 comparisons XXX bgp_delete: pid=30381 (npid=21035) XXX ran 21035 comparisons
In other words, on a normal machine, you will hit this much sooner, and it probably won't drop off to relatively little CPU utilzation. Also, if you run multiple long-lived looping scripts like this, they will all interact with each other, and will collectively make themselves much slower!
Now let's throw a nice wrench into the mix here and make things even more interesting. For a lot of places running large machines with many hundreds of processes, each with hundreds of threads, you quickly realize that even 32768 pids just isn't enough.
Incidentally, when you DO run out of pids, fork fails. Check those return codes, folks.
It's time to raise the limit from 15 bits (32768) to 22 bits (4194304) and run it again with the box in a relatively quiet state:
$ /sbin/sysctl kernel.pid_max kernel.pid_max = 4194304 $ ulimit -u 65000 $ ./bash ../loopy ... wait a bit ... XXX bgp_delete: pid=77782 (npid=65000) XXX ran 65000 comparisons
Check that out. Now we won't even have a chance of finding a match (and thus ending the loop sooner) until that PID number wraps. Given that it's now something like 4.2 million before that happens, that'll be a while. Even then, we're still subject to luck and how busy the machine may be. At this point, the odds are good that we'll be doing 65000 comparisons every single time.
As mentioned before, this was fixed a long time ago.
I wonder just how much global CPU time was clawed back by that one fix.