Software, technology, sysadmin war stories, and more. Feed
Sunday, January 13, 2013

Unintended consequences of daemonizing C++ code

I created an interesting problem for myself the other day while working on some C++ code for a client. It was a direct result of unintended consequences after completing a feature request, and it also had a touch of "heisenbug" type behavior for extra fun.

As part of this project, I've written something which talks to a third-party persistent storage backend. The backend has a client library which is just an ordinary C library and header file. I have a class of my own which links against that in order to read and store things.

One of the feature requests which came down the pipe was to make this program turn itself into a daemon when it was run. This isn't a big deal, and I've done it plenty of times. I just wrote the usual "call fork, parent exits, child closes stdin/out/err and becomes a session leader before carrying on" code. That part worked just fine.

I also added a testing switch to make it run in the foreground during development work. It's a lot easier to test things on the real binary when you don't have to chase down something which has escaped into the background. It also means you have stderr right there for logging purposes.

This was all okay, but then something happened when I went to stand up a test instance of this server. It started up and went into the background just fine, but then things started misbehaving. I'd poke this server with a request which needed to hit the backend storage, and it would fail. The third-party library said something like "server closed the connection". That made no sense.

This library used either a Unix domain or TCP socket to talk to the storage server. I started up strace and watched where it sent the request along that socket and got -EPIPE, aka "broken pipe". Something was going on which made that socket go bad, but what?

Further confounding matters was the fact that running the same code and skipping the call to drop into the background would let it work just fine. I could watch it with strace in test mode and it would send the same data over the socket and it would work just fine. WTF?

I got the idea to strace the program from the beginning so I could follow the actions of both parent and child as it entered the background. Something weird was happening which I couldn't see when I attached to the already-running background process, and watching from the beginning was the only way to see it.

That's when I noticed that the parent was explicitly closing its file descriptor to the server. It happened right after fork() returned. I thought "close-on-exec" for a moment, but that didn't make any sense since I wasn't calling exec. That line of reasoning wasn't directly useful, but it did get me thinking generally about "stuff that runs when things shut down", and that's when it hit me.


As previously mentioned, I have a class which wraps this third-party C library. It has a simple little constructor which does no real work, an Init() function which calls the library's "open connection" function, and a destructor which calls the library's "finish connection" function.

What was happening was now obvious: when the parent returns from the fork() and calls exit(), all of the destructors run, and that calls the "finish connection" code in that client library.

Now, granted, the child process has its own address space courtesy of copy-on-write behavior, but there was only ever one file descriptor open to that storage server. When the parent shut down, it wrote something down that fd which said "I'm done here", and the server disconnected it. The child had no idea it was now sitting on top of a useless file descriptor where the far end had disconnected. As a result, when it did its next write(), that failed with the "broken pipe" error.

I did a couple of things about this. First, I decided that calling destructors when the parent exits is inappropriate for this particular program and switched over to using _exit() instead. Next, I did some digging around in that library's API and found a way to detect failed connections and a way to request a retry.

With those changes, I don't break my server connection, and I'm also resilient to other things which might cause it to go down. If that happens for whatever reason, it will bring the link back up.

If you have a C++ program which works fine in the foreground but always fails its first external communication when it's allowed to background itself, something like this might just be happening in there.