Software, technology, sysadmin war stories, and more. Feed
Sunday, October 16, 2011

If killall is the answer, you're asking the wrong question

Have you ever noticed that when you see one thing that's cosmically bad in a program, it's usually not alone? It's the broken window theory: once someone smashes through and nobody else cleans it up, it's open season for anyone else to do the same. This is a story about one of many such cases I found, and this one involves the "killall" command.

So there was a daemon which ran on a bunch of machines. Its job was to act as a software installing (or removing) agent. It just sat there and kept track of what it had installed and what its "boss" server thought it should have installed. When there was a difference, it started applying changes and kept going until there was nothing left.

There was a catch, though: this thing was slow. It was intentionally made that way so that it wouldn't perturb the machine much, or at least, that was the official explanation. The problem is, of course, if you wanted to load something on and get going right away, slow is not what you want.

Someone decided to provide a way to pull out the limiters and make it sync 100% right then and there. That by itself was uninteresting. What made it bizarre is how they implemented it.

Remember this thing was a daemon. It just sat there waiting for commands and other stuff on a network socket. So, they just added an endpoint where you could poke it and it would run the fast syncer, right?


Oh no. You had to log in via ssh and then run a command. Let's call it "sync-now". That particular command would then start up and kill the existing daemon. It would run "killall sync-daemon", then it would pause to let it die.

Oh, it gets better. Next, it would restart the "sync-daemon" process with different flags. These flags basically took the limiters out of the main loop and let it run as fast as it wanted to run. Meanwhile, "sync-now" would start polling the "sync-daemon" via its network connection to see what it was doing. It would keep doing this until the daemon went idle again.

Once it got to this point, it would then use killall on the daemon yet again, wait for it to die, then it would restart it one more time with "normal" (rate-limited) flags. Then, finally, "sync-now" would die.

So that's pretty ridiculous already, but the rabbit hole goes deeper. If you got curious and went looking, it turned out that "sync-now" was just a symlink to "sync-daemon". Yep, it was the same binary for both jobs. It actually looked at argv[0] when starting to figure out which "personality" to become.

Putting this all together, it means you had a single program which looked at its own name and decided to behave differently. It then went out and shot the instance of its other personality in the head and stood up yet another instance of that personality, but with its own special values. Then it let that one do some dirty work, then it killed that mercilessly and swapped the original configuration back in.

This isn't software. This is like a bad psychological thriller novel.

There's a final bit of craziness here. Let's say you had some automated process which wanted to set up a bunch of machines and then run some tests on them. You might write a simple loop which connected to the machines and ran "sync-now".

Odds are your loop would come back around before it finished syncing. Let's say you had a naive implementation which would then connect back to the boxes and say "hey you, run sync-now again". That's when your life got really interesting.

Remember that "sync-now" works by shooting the "sync-daemon" in the head with killall. It'll do that to any instance of it, including the one which gets started up temporarily to do a full-speed sync. So, when you logged in again and ran it again, you broke the first one which was still going. It would just sit there and flail around while the second one started doing its thing.

This would just keep happening over and over. If the syncer didn't make any progress in the interval between calls, you would never finish. It would keep pulling the rug out from under itself over and over and over.

I won't even bother describing how evil the code looked. All you need to know is that it looked at argv[0] and went this way or that way in a big if-then-else type construct. Test coverage was low or nil.

Naturally, you know someone got a nice promotion for writing that thing.