Writing

Feed Software, technology, sysadmin war stories, and more.

Tuesday, July 12, 2011

Investigating gcov crashes after fork() on OS X

I've been working on improving some code with more test coverage. One of these newer libraries calls fork() and execv() to run some external programs. Imagine my surprise when I tried to run it in coverage mode and it crashed with "Abort trap". I did a lot of digging to figure out just what was going on. This is my tale.

My original program had a lot of stuff going on. It had a whole bunch of test cases and other crazy things happening. Any of those bits of my code or the third-party testing framework library could have been responsible. They all had to go. I reduced it down to a single .cc file which had a function which would fork() and execv() something. This reproduced the problem nicely, and it meant all of that testing stuff was not to blame.

After a bunch of runs through valgrind, and gdb, and dtruss, and all of this, I realized that it was just fork() which was blowing up. I could throw away all of that execv() gunk. Great! My reproduction case shrank again. I kept banging on it. Finally, I got it down to this:

$ echo "int main() { return fork(); }" > fork.c
$ gcc --coverage -o fork fork.c
$ ./fork
Abort trap
$ gcc -o fork fork.c
$ ./fork
$ 

Yeah, now we're talking. One syscall and it all goes down in flames. Now I knew exactly what to blame: the intersection of the libgcov code and fork(). It wasn't anything else. The exact call trace implicated something they added in Snow Leopard for faster shutdowns: there was a "_vproc_transaction_end" right before that call to abort().

I went further and found the source code for libvproc.c online. It lists a bunch of functions which are called by stuff all over the system, including Apple's version of libgcov. It also showed me where things were crashing. I decided to add a call to _vproc_transaction_count() in my code both before and after the fork. It didn't look good.

$ cat fork2.c
#include <stdio.h>
#include <vproc.h>
 
int main() { 
  printf("pre-fork count: %d\n", _vproc_transaction_count());
 
  fork();
 
  printf("post-fork count: %d\n", _vproc_transaction_count());
 
  return 0;
}
$ gcc --coverage -o fork2 fork2.c
$ ./fork2
pre-fork count: 1
post-fork count: 0
post-fork count: 0
Abort trap

So not only is the child winding up in some uninitialized state, but the parent is too...? That's messed up. I decided to throw caution to the wind and call their vproc_transaction_begin() like gcov, just to see what happened.

$ cat fork3.c
#include <stdio.h>
#include <vproc.h>
 
int main() { 
  printf("pre-fork count: %d\n", _vproc_transaction_count());
 
  fork();
  vproc_transaction_begin(0);
 
  printf("post-fork count: %d\n", _vproc_transaction_count());
 
  return 0;
}
$ gcc --coverage -o fork3 fork3.c
$ ./fork3
pre-fork count: 1
post-fork count: 1
post-fork count: 1
$ 

No crash! This is probably far from ideal, but I'll take it. It's enough to add a quick preprocessor hack in my code to call that when running tests on Apple machines.

I've opened a bug with Apple. It's #9759049, but I don't think other people can see it, so that's probably of little use to anyone but me. For everyone else, enjoy the workaround.

#if defined(__APPLE__)
  if (testing_mode_) {
    vproc_transaction_begin(0);
  }
#endif

September 29, 2011: This post has an update.