Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, November 14, 2011

My failure to see the forest for the trees while bug hunting

I've heard the maxim that premature optimization is the root of all evil. The general idea is that you get some weird notion in your head about what might actually be going on and start trying to correct for it. Meanwhile, it's just a tiny little speck compared to the tornado of badness going on somewhere else. You would have noticed that tornado if only you didn't have your nose pressed to the speck.

I now think this can apply to bugs and troubleshooting. For the longest time, my scanning software has had issues related to staying up and running. I had it running in a hacked-up shell script just so it could loop back and start over again. It was just enough to get things working, or so I rationalized it to myself.

One of these hacks was having it call alarm() every couple of seconds. This gave a really horrible watchdog effect so that if something got wedged, it would just die and the shell script would restart it. Some of the docs for the libraries I was using suggested that "the PLL just locks up from time to time" and sure enough, that was what I was blaming.

I did all kinds of stupid stuff trying to figure out what could be wrong with that PLL. Finally, after far too much fruitless poking and prodding at the code, I decided to start over. It turned out that there's a fundamental problem with the way GNU Radio uses boost's locking and mutex support that goes pretty deep into boost itself. Any time you reconfigure a flow graph, you run a pretty good chance of tripping it.

I finally decided to redesign my program to never reconfigure a flow graph once it gets up and going. Instead, I just shut down the idle parts of the graph with knobs which don't force it to lock and unlock. This sucks a little more CPU power and isn't exactly pretty to look at, but it does work.

Now, the amazing thing is what happened to the reliability. Where it might have run 20-30 mins between locking up before, now it'll go days. I haven't actually taken it beyond a day or two yet due to the restarts from pushing new code.

Sure, there's still some fundamental issue deep down in the system, and some day I may get around to figuring it out, but for now, I have a stable program again. The PLL had nothing to do with it. I shouldn't have put that much trust in the documentation of these libraries.