Software, technology, sysadmin war stories, and more. Feed
Tuesday, June 7, 2011

Don't mix threads and forks

Not too long ago, I worked on a project with a substantial Python code base which was used for running automated tests. It had quite a few foibles, oddities, and things which were just plain wrong. This is but one of them, something fairly deep and fundamental which created all sorts of fun when things got hairy. I came in fairly late in the game and had to work backwards to figure it out.

The code in question used ssh to do things on distant systems. It's clear that it had initially been designed to do little more than "shelling out to" the actual OpenSSH /usr/bin/ssh to do its dirty work over and over again, whenever something had to happen on the far end machine. It was a glorified shell script engine.

Well, at some point, a decision was made to do even more with ssh instead of using a locally-available RPC mechanism. This lead to demands like being able to multiplex commands over those ssh connections and recycling existing connections, all things that running that ssh binary through popen (or whatever) just would not do. The code was switched to use a SSH client library instead.

Nobody realized it right away, but this introduced a new failure mode where the processes would get stuck and would never make any more progress. Apparently some weak correlation between this new library and stuck processes was made at some point, and there may have even been an attempt at fixing it, but it persisted anyway.

This is about where I enter the picture. At some point, it became obvious to me that my test runs were failing due to something getting stuck. I'd start something up and it would get to a point where it's supposed to fork and just keep going and ... it would hang. The difference is that I dug into it.

What I found was troubling: my stuck python process was sitting in a futex() syscall which would never return. A bit of reading taught me this was just its way of waiting for a lock to clear, but there was nothing which could clear it. It had just forked, it was just one thread, and nobody else was accessing that memory. It would be there forever.

So I added logging to start figuring out what else was going on during all of this. Perversely, it actually got worse. The highly-instrumented code would freeze its children far more than the untouched stuff. This was a clue, however frustrating it was.

I wound up learning how to use gdb on a Python process -- yep, you can actually get meaningful stack trace data -- and found out that my newly forked process was getting stuck deep in some Python library logging call. That call acquired a mutex, which in this case was implemented with futex(). It would get stuck because someone else already had it.

But wait, how could someone else have the mutex when we just came from a fork() moments ago? This was a mystery briefly, until I started doing traces on the parent process with gdb and friends. That's what finally cracked the puzzle.

Back in the parent process, there were many threads running around doing things. Even though the actual testing suite itself never used threads, it had called upon that SSH library which did quite prolifically. I already knew threading and fork was a no-no, but had to prove it to these people some other way, since they obviously didn't know this principle already and would never take me at face value.

I wound up gathering instrumentation which showed one of the SSH library threads calling the Python log call while another thread called fork(). That fork did the usual copy-on-write thing, and so the child woke up with all of the parent's data structures, including whatever bit that said "someone has this lock". But, since there was no one to finish using it and release it on the child side, the first call to logging there would block forever.

This should have made a solid case for dumping SSH and that library altogether, but it didn't quite work out like that. Last I heard, they were still working on a way to safely allow fork() in a threaded process by running around and cleaning up mutexes in some kind of "at fork" handler.

I wish I was making that up.

August 16, 2014: This post has an update.