Writing

Feed Software, technology, sysadmin war stories, and more.

Friday, November 4, 2011

Abandon all hope, all ye who nest closures and callbacks

Have you ever heard of closures in a programming context? They seem neat, and like so many tools, they can fill a useful purpose. They serve as a way to bundle up a call to a function with some arguments. Think of it as a function pointer which can carry some or all of the arguments needed to run that target function.

I've seen them used to handle a few useful cases, but the thing I want to talk about here is what happens when it goes very, very wrong. What happens when someone builds an entire system out of this entire scheme? I lived with this for a while, and it was anything but fun.

For those who haven't encountered them yet, here are the basics. Let's say you're working in C++. It's possible to rig a closure system by using some templating magic. It basically scoops up a pointer to the instance of your object (if any) and then makes copies of your arguments. It might look like this:

void MyClass::StartSomething(int foo) {
  Closure* done = NewCallback(this, &MyClass::SomethingDone, foo);
 
  DoSomething(foo, done);
}
 
void MyClass::DoSomething(int foo, Closure* done) {
  done->Run(foo, lengthy_calculation() == foo);
}
 
void MyClass::SomethingDone(int foo, bool ok) {
  if (!ok) {
    fprintf(stderr, "It didn't work.\n");
    return;
  }
 
  fprintf(stderr, "It worked.\n");
  global_ok = true;
}

So let's look at this in a linear fashion. Somewhere, you create an instance of MyClass and then call the StartSomething() function with an int argument. Control passes to StartSomething(), then into DoSomething(). That then calls lengthy_calculation(), yielding a bool result depending on the return from that function.

Finally, DoSomething runs the Closure it was handed, and control jumps into SomethingDone() where it prints something to stderr depending on the value of that bool. SomethingDone() then returns to DoSomething() which returns to StartSomething(), and that returns to you.

OK, you think, that's not so bad. There's a little bit of misdirection and general loopiness, but you can follow it. Okay. Well, this was just the basics to establish how closures work in this model. Now I'll add in some of the details which were present in the evil system in question.

Imagine that lengthy_calculation() is, in fact, lengthy. It takes a while to do whatever it does, and maybe you don't want to sit there waiting on it to happen. What happens next is that someone whips out the threading library and starts making things run in parallel. Now the code looks more like this:

void MyClass::StartSomething(int foo) {
  Closure* done = NewCallback(this, &MyClass::SomethingDone, foo);
 
  RunInNewThread(DoSomething(foo, done));
}
 
void MyClass::DoSomething(int foo, Closure* done) {
  done->Run(foo, lengthy_calculation() == foo);
}
 
void MyClass::SomethingDone(int foo, bool ok) {
  if (!ok) {
    fprintf(stderr, "It didn't work.\n");
    return;
  }
 
  fprintf(stderr, "It worked.\n"); 
  global_ok = true;
}

I'm purposely blackboxing the "RunInNewThread" function here. Just know that it will run whatever it's given, and it happens in some other thread, so your call to it returns automatically and without blocking. Now let's look at the timeline.

You still call StartSomething() with an int. Then it calls RunInNewThread, and that returns right away, and then StartSomething returns to you. That's it.

Meanwhile, off in another thread, DoSomething is invoked. It calls lengthy_calculation() as before, gets that bool result, then calls SomethingDone which prints a message. SomethingDone returns to DoSomething, DoSomething returns to the new thread runner, and then the thread runner exits.

The parts of those last two paragraphs execute in parallel. You don't know exactly how they will work out due to CPU scheduling craziness, the phase of the moon, just how lengthy that calculation is, and so on. For all you know, SomethingDone might manage to run before control returns from StartSomething!

Making any assumptions about the ordering of events like this is just asking for trouble. If you somehow need one side to wait on results from the other, now you have to get into this whole synchronization mess which adds even more wonkiness to the picture.

Some of the unit tests for this software were written by people who either didn't know about this last requirement or flat out didn't care. They got things working for them and were happy with it. It passed code review and was checked in, and that was the last time they touched it. Naturally, years later, when the timing changed or some other thing started behaving differently, I tripped over it.

Their unit tests looked approximately like this:

  MyClass myc;
  myc.StartSomething(1234);
  usleep(10000);                   // Yes, really.
  
  if (!global_ok) {
    log("test failed");
  } else {
    log("test succeeded");
  }

That's right, they were actually starting it up, waiting an arbitrary amount of time, and then checking for the results. Any time that we got out of the usleep() before SomethingDone() finished, it would bomb.

This is what you call a flaky unit test. Any time you have sleeps and asynchronous behavior, nothing good can come of it.

There's more fun to this kind of design. Let's say you want to troubleshoot something. Whether you use a debugger or well-placed printfs, you probably debug things by building up the call stack in your head. A calls B calls C calls D, and it bombs just before it calls E. Then you work backwards to see where the bad value was introduced, right?

Okay, so let's look at what happens here. Remember that DoSomething, lengthy_calculation() and SomethingDone() all ran in another thread. Their entire call stack begins at RunInNewThread(). You can't tell which function called it. If you try to backtrace it, you get something like this:

#0   0x0123 in SomethingDone ()
#1   0x4567 in DoSomething ()
#2   0x89ab in thread_runner ()
#3   0xcdef in start_thread () from /lib64/libpthread.so.0
#4   0x1337 in clone () from /lib64/libc.so.6

Got that? You can only trace back as far as where the thread started somewhere in your system's C libraries. gdb's "bt" is of limited use to you in this world.

"But Rachel", you say, "I know that DoSomething() always gets called as a result of StartSomething() being called earlier, so if I see the one, I know the other one happened somewhere too". Well, maybe, maybe not.

What if there have been *two* calls to StartSomething() in the time you're trying to unwind? Which one of those calls lead to the particular instance of this "bottom half" thread which then blew up? What if it's 10, or 100? How do you tell them apart?

Short of plumbing some kind of request identifier all the way through your system and making sure it surfaces in places like log entries and gdb dumps, you can't tell. There are just too many moving pieces to keep track.

And wait, it gets better. Nothing says that StartSomething() is the only thing which can call DoSomething(). Any other part of your code which can see DoSomething() can create a Closure pointing to it and then hand it off to be run in another thread.

Maybe some clever maintenance programmer comes along later and creates something like this:

void MyClass::StartSomethingElse() {
  Closure* done = NewCallback(this, &MyClass::SomethingDone, 0xabcd);
 
  RunInNewThread(DoSomething(0xbcde, done));
}

Now you have two routes into DoSomething and SomethingDone! How do you know which one was used to get in there? Again, given just this, you don't know. This crazy new function actually went and hard-coded the values instead of using a variable, and they don't even match! Imagine your surprise when you find out that they're not the same when called this way.

The final horror I will tell in this tale is about nested Closures. Recall that you can create a closure and pass it arguments which then wind up going to the target function. Well, a closure is just a pointer, and a pointer can be an argument, so why not this?

void Func1(Closure* cb) {
  Closure* done = NewCallback(&Func3, cb);
  RunInNewThread(Func2(done));
}
 
void Func2(Closure* cb) {
  Closure* done = NewCallback(&Func5, cb);
 
  lengthy_calculation();
 
  RunInNewThread(Func4(done)):
}
 
void Func3(Closure* cb) {
  cb->Run(wrap_up_result());
}
 
void Func4(Closure* cb) {
  another_calculation() {
  cb->Run();
}
 
void Func5(Closure* cb) {
  did_lengthy_calc_ = true;
  cb->Run();
}

Are you dizzy yet? Quick, in which order do things run? Which ones run serially and which ones run in parallel? How many threads are used here? How many entry points are there to this cascade of calls and callbacks? What was this guy smoking when he designed this?

All of this and worse is currently running in production somewhere, babysitting services that you probably rely on.

This is what passes for "the finest in software engineering" today. Give me a break.

I'll pick this one up in a future post and talk about what you can do about it.