Welcome to 2014, in which RPM still gets stuck in futex
I get questions now and then: what do I do when I'm not writing? These days, it could be any number of things. Some of them can be described in fairly general terms and others I prefer to leave alone for now.
One technical issue was that of wrangling RPM as described back in December. I got through all of that and got my tools working to clean up whatever messes might come along. That much is basically done.
Along the way, I've run into yet another class of problem: rpm and yum processes which do some rather broken things in how they handle futexes. It always starts as the same thing: someone has a box which is supposed to install or upgrade a package by itself, and it hasn't. They take a look and notice that yum or rpm or whatever can't get any work done. Sooner or later, it comes to me.
A pattern became evident with these systems: they all had one or more RPM-related processes sitting in "futex_wait_queue_me". Those which weren't in that particular function were sitting in a call to fcntl() or similar, trying to get a lock. Naturally, I turned to strace to see just what they were up to, and that's when things got weird.
While attaching to the fcntl-waiters didn't do anything, more often than not, attaching to the futex-waiters did. They'd manage to finish the futex call and would get going on their work. Then they'd exit. This would then release the locks, and everything else would wake up and go to work, too. A few seconds later, it was all back to normal as if nothing had ever broken.
What a nice heisenbug! Go to investigate it, and it changes... and disappears, too. How fun.
Like the prior situation, I had two major ways to pursue this problem: I could go digging into RPM and the Berkeley DB stuff and maybe even glibc and beyond yet again, or I could come up with a fix first and then maybe come back to the problem. That's about where I am now: I wrote a fix. I might go back and dig into this crusty old RHEL-variant version some day, but the value of that is already low and continues to drop.
So, what's the workaround? It's pretty ugly, but it does let the machines keep doing their jobs without sending humans in. It's also a "light touch" and does as little damage as possible.
First, look for processes which are touching the RPM database which are in this state. Then attach to them just like strace would -- yep, that means ptrace(). Then detach and give them a poke if necessary to get them running again. Wait a few ms and look again. It'll probably disappear all by itself, having run to completion. If another one is there, then poke it, too. Run a few passes just to be sure.
If this fails, then okay, go nuclear and SIGKILL the stuck ones.
"Look for stuck workers and SIGKILL 'em" is what everyone has done up to this point. It has the potential to make things worse.
I found a happy middle ground. It's a hack, but it does work.
Pragmatism is weird that way.