Writing

Feed Software, technology, sysadmin war stories, and more.

Saturday, August 16, 2014

Multithreaded forking and environment access locks

Back in 2011, I wrote that you shouldn't mix forks and threads. That particular story was about dealing with Python, but don't think that you're immune just because you use C++ or even "nice simple C". You aren't. If you use certain parts of your C library after "fork" but before "exec", you run the risk of getting stuck forever.

Code speaks volumes, so let's start with a little something in C:

#include <stdio.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/wait.h>
 
static void* worker(void* arg) {
  pthread_detach(pthread_self());
 
  for (;;) {
    setenv("foo", "bar", 1);
    usleep(100);
  }
 
  return NULL;
}
 
static void sigalrm(int sig) {
  char a = 'a';
  write(fileno(stderr), &a, 1);
}
 
int main() {
  pthread_t setenv_thread;
  pthread_create(&setenv_thread, NULL, worker, 0);
 
  for (;;) {
    pid_t pid = fork();
 
    if (pid == 0) {
      signal(SIGALRM, sigalrm);
      alarm(1);
 
      unsetenv("bar");
      exit(0);
    }
 
    wait3(NULL, WNOHANG, NULL);
    usleep(2500);
  }
 
  return 0;
}

This dumb little program does just enough to demonstrate my case. It starts up a background worker thread that calls setenv() every 100 microseconds. It's purposely doing this a lot to make sure you can see the problem right away.

The main thread of this program continues on and forks off children which then attempt to call unsetenv. It also sets up an alarm handler which will print an "a" after 1 second. In other words, if 1 second elapsed between that alarm() call and the exit(), it should print "a". Every time a child gets stuck, you get another "a".

So, okay, let's run it.

$ gcc -Wall -o forkenv forkenv.c -lpthread
$ ./forkenv 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa^C
$ 

Check that out! Lots of alarms are firing off. We're getting stuck!

What's going on? Let's catch one of these stuck children in gdb and find out.

0x00007ff3bf0ea20e in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ff3bf0ea20e in __lll_lock_wait_private () from 
/lib64/libc.so.6
#1  0x00007ff3bf02a6d3 in _L_lock_718 () from /lib64/libc.so.6
#2  0x00007ff3bf02a4aa in unsetenv () from /lib64/libc.so.6
#3  0x00000000004009fc in main ()

It's trying to get a lock in glibc within unsetenv(). This never succeeds, since we're in the child and no thread exists on this side of things to release that lock. We copied the lock in the "set" state, and there it will stay forever.

All you have to do to trigger this is make a copy of your process (with fork) while setenv or unsetenv is running in another thread. If you then try to use one of those functions in your child process, it will hang.

You can actually get away with this for a very long time if you get lucky. But, sooner or later, it will happen. If you see your process blocked in "__lll_lock_wait_private" with unsetenv or setenv listed earlier in the stack, you've probably done this, and it just decided now was the right time to pop up and make trouble for you.

Play with the values for those usleep calls to see what I mean. Lower numbers mean less time spent sleeping and more time spent with a lock potentially held. Lower them both to 1 and you'll have a monster on your hands. Try it and see.


January 30, 2017: This post has an update.