Software, technology, sysadmin war stories, and more. Feed
Friday, July 8, 2011

Treat your children (processes) well

I've seen some pretty messed up excuses for code. These can usually be explained by inexperience or ignorance, but sometimes it seems to be willfully malicious. Other times, there's just plain genuine cluelessness at work, and the purpose of a code review is to keep that stuff out of the tree so that nobody else gets stuck dealing with it.

One time, there was a program which was being used to run some kind of benchmark. That is, it was a fairly large compiled chunk of C++ and other stuff and was being kicked off by a shell script, and that shell script was being kicked off by a Python program. So, granted, at this point, we're already pretty deep into the realm of all things evil, but someone actually found a way to make it worse.

I had noticed a problem with our testing machine. It seems it had been running out of disk space. Also, it was getting bogged down, and was starting to thrash a good bit, since something was tying up memory and CPU time. A glance showed my troublemaker. It was that benchmarking program. For some reason, it was getting wedged, and it kept a whole bunch of huge log files open, so our usual log rotator thing wasn't able to reclaim that space.

This particular test had an owner, so I filed a bug with him: please make this thing not do this. He didn't seem to know how. I said, look, just make sure you shut down anything you've started when your wrapper script exits. Leave things the way you found them. He didn't know how to do that, either. He wound up booking an appointment with me in person to talk about it.

So, we chatted. I told him that anything you start via fork() or similar on these systems is going to give you a process ID. You can just hang onto it and kill it later. If your original process doesn't do anything weird like forking again (and becoming a daemon), odds are that will do what we want. If it does, then you might be able to get something going with process groups. Yes, I had to explain fork, kill, setpgrp, setsid, and all of that other stuff to someone who supposedly was employed as a systems tester, but I digress.

I got an update a couple of days later. He had come up with a design. It amounted to "list all of the processes on the system, filter it to look for things named benchmark_test, then kill them". In other words, in command-line speak, he added a script to do this:

ps aux | grep benchmark_test | grep -v grep | xargs kill -9

Now would be a good time for me to mention that this machine frequently ran multiple instances of the same test. It also ran a whole bunch of stuff as that same user account, with no telling what it was called. At best, this script would kill itself. The first failure mode is that it would kill any other instance of itself, including a completely unrelated test running on other target systems. The second failure mode is that it would also hit anything else which happened to have "benchmark_test" somewhere in its name.

Despite being shown how to do this stuff, he chose to ignore me and went back to his old tried and true scripting monkey approach. I said that was unacceptable. He went off and did other things. Months passed. Really, it was something like four months later when this came back up. In the mean time, we still had to occasionally clean up our system by hand to get rid of this cruft.

When this code review resurfaced, it happened rather strangely. He just posted a comment to the thread asking if it was okay now. I figured, okay, awesome, he must have finally invested the effort required to understand this and do it correctly. I looked at the code. It hadn't changed. Not only had it not changed in the past week or two, it was still the exact code from four months before when he first sent me that horrible brute-force process killing approach.

Given that his comment was "is this okay now", my answer was just: "no". Having already invested quite a bit of time trying to teach this person how to do his own job, I was no longer interested in elaborating. He didn't like this response at all. Within a couple of minutes, I had mails from both my boss and his boss saying how I need to be more constructive and all of this.

Oh. You did not do that. At this point, it was game time. I broke out all of our old communications, including the parts where I wrote a bunch of small programs to show him how things like fork() and kill() really work. I mentioned all of the back and forth and the time I took to explain all of this in person, and backed it up with the example programs. The managers in question backed off.

A couple of weeks later, he disappeared for some other part of the company. I guess he found a place where those kinds of shenanigans are welcomed with open arms.

Side note: it would have been far better if the program never got wedged, and we didn't have all of this wrapper stupidity going on, but suggesting a fix of the root cause to him would have been like showing a dog a card trick. I knew that much would fail. I just never thought something so trivial could go so wrong.