Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, December 15, 2013

Helping RPM stay afloat on big fleets of machines

I've been doing some work lately which touches a lot of Red Hat-derived machines. This is not a new pattern in my life -- they always tend to turn up in one place or another. I first encountered their "enterprise" flavor during my days as a web support monkey, and came to like it. When it comes to having a dedicated server somewhere which needs to basically keep working and not have massive changes sprung on you, they have that down cold.

This is not to say that it's perfect. There are things which tend to go wrong, particularly when you start talking about lots of people using lots of machines. Beyond a certain point, it becomes impossible to care for each and every one on an individual basis, and you have to start rigging up checkers and fixers.

These situations are not always straightforward. Consider the case that machines can and will reboot at any time. This could be due to the kernel flipping out, a wayward command run on the wrong machine, hardware errors, power problems, or anything else you can imagine. Now also consider this can happen during an update of the oh-so-important RPM database - the thing which tracks all of your system's packages -- the OS itself, if you will.

Most of the time, nothing bad happens. You get lucky and it manages to finish up in a consistent state, or at least a trivially-recoverable one. Nothing bad happens, and all of your stuff keeps working. All of those automated processes which check on the machine, install, remove, and upgrade your packages continue to work. If you have a small number of machines, this is probably your life: it just never happens.

So then, let's crank up the number of machines and suddenly those tiny little percentages of failures start yielding actual results. Your RPM database decides it's run out of lock slots and won't run. This stops your automatic upgrades, patches, and everything else involving packages. It might even start breaking other things depending on how your system management stuff works.

Let's say someone notices the "failing RPM" situation and they decide to do something about it. I imagine the result will be a shell script with the best of intentions. First, it'll try to do something with RPM to see if it works or not. If it gets an error, then it'll do a "rpm --rebuilddb" and exit.

It'll resemble this:

  rpm -q foo || rpm --rebuilddb

I imagine that will automatically resolve a few situations. It might even do its job without making life too terrible for other "tenants" of the machine. It'll go into cron, and it'll run regularly.

Then, one day, it'll stop helping and it'll start hurting. The RPM database will get into a state where db4 (the base library under librpm itself) decides to enter an infinite loop. This one is fun, since it's not making syscalls and it's not making library calls. That means both strace and ltrace show you absolutely nothing, and it'll just sit there burning CPU. Start another process, and it, too, will chew another core on your machine.

Every time that cron job starts up, it gets another core. Give it a couple of days, and soon you're in a world of hurt thanks to the loop from hell which will never end. You're bleeding machines, and something has to be done about it. It's time to start troubleshooting.

It takes gdb to show you that it's __env_alloc deep inside db4, complete with nested levels of C preprocessor gunk.

for (elp = NULL;; ++q) {
	SH_TAILQ_FOREACH(elp_tmp, q, sizeq, __alloc_element) {
		STAT((++st_search));

There's the beginning of the loop which is burning CPU. What's a SH_TAILQ_FOREACH? Excellent question. It turns out to be this:

#define SH_TAILQ_FOREACH(var, head, field, type)       \
        for ((var) = SH_TAILQ_FIRST((head), type);     \
            (var) != NULL;                             \
            (var) = SH_TAILQ_NEXT((var), field, type))

Hey, more all-caps words! That means more macros!

#define SH_TAILQ_FIRST(head, type)                                  \
        (SH_TAILQ_EMPTY(head) ? NULL : SH_TAILQ_FIRSTP(head, type))

Oh look, two more. Let's chase the first one.

#define SH_TAILQ_EMPTY(head)       \
        ((head)->stqh_first == -1)

It's testing to see if the member of a struct (not a class -- this is plain old C) is -1. Now back up to SH_TAILQ_FIRST and notice another macro hanging out. What's that one?

#define SH_TAILQ_FIRSTP(head, type)   \                                    
        ((struct type *)((u_int8_t *)(head) + (head)->stqh_first))

This one is taking a value and is adding something else to it, then it's bouncing it through two casts for some reason. One of those casts is u_int8_t, and the other one is coming from the macro call.

So jump back to SH_TAILQ_FIRST. Now mentally try to wedge both of those macros into that one, and don't forget about the ternary stuff going on -- see that question mark and colon? Yeah.

You're testing something's stqh_first to see if it's equal to -1, and if so, yielding NULL, otherwise, you're yielding the result of this doubly-cast addition on ... something else.

It just goes on and on like this. I won't spam you with the rest.

I realized I could keep going down the rabbit hole of how db4 worked, or I could start figuring out what to do about it. The "analysis" side looked like a bottomless pit and I wanted to start delivering results. As it turned out, the "recovery" side was simple enough: db4's own "db_recover" will put things back together without an exhaustive rebuild.

Of course, this recovery needs to be kicked off somehow. Attempting a RPM database operation and seeing if it hangs is good enough, but how do you do that without getting stuck yourself? Well, you do, and you don't. That is, you have to fork off a child, and let that child make the attempt while sending updates back to the parent over a pipe. If it stops phoning home, you know you have a problem.

Then you just call db_recover and set about killing off the other stuck processes - you did want to get back that CPU time, right?

This is just the beginning, of course. There are other failure modes, including those where you can open the RPM database but then it refuses to let you query anything. There's one where you can query some, but not all of the packages. Get far enough down the list and it'll get stuck trying to acquire a lock of some sort.

Pursue this long enough and you will discover the fun of running "fuser" on a machine which uses NFS and has at least one dead mount. It'll start digging through /proc, it'll find a reference to the dead filesystem, and will enter "state D" forever. Joy! Soon, you will be in the business of monitoring and maintaining NFS mounts. Say hello to the forced umount and the inevitable tradeoffs between staying frozen and possibly losing data by killing a stuck write in another process. (If this sounds familiar, it's because I did it with smbfs back in the '90s. The names change, but the problems remain.)

Get through this, and the next problem to emerge won't be NFS or RPM. It'll be yum instead. Yep, you can have a system which runs RPM commands and looks healthy, but then fails a "yum check". Those are all sorts of fun, too.

This is pretty much where things are now: I'm chasing a variety of loosely-related problems which can screw up fleets of machines. The biggest concern is not getting stuck doing nothing but this. Just like meddling in the affairs of other countries with military force, this sort of thing needs a clear exit strategy from the outset.

Get in, get it done, and get back out.