Writing

Feed Software, technology, sysadmin war stories, and more.

Tuesday, December 20, 2011

How to sanely reboot a bunch of machines to test kernels

Here's a testing scenario I've seen quite a few times. All you have to do is take a given Linux kernel image and get it to start on a bunch of machines and then run some tests on those same machines. This sounds easy enough, but there are a bunch of corner cases that people seem to always miss.

First off, you have a few dozen machines. You need to access all of them to load the kernel. You could do it with ssh, but now you have two problems. Instead, think in terms of servers. You're going to need something persistent on the machine to help you run things anyway, so write something which sits there and runs RPCs for you.

One of those RPCs accepts a kernel image, installs it on the machine, and does whatever magic is needed with LILO or GRUB to make it know about the new kernel. This RPC either works and returns a success message or fails and returns a failure message.

Another RPC reboots the machine. Call /sbin/reboot, poke init via a pipe, or whatever applies for your userspace. I would advise against just calling the kernel's reboot() syscall, in other words.

A third RPC reports on whatever kernel version happens to be running. Think along the lines of "uname -a", but don't take it literally! Get that scripter-think out of your head and go look at the man page for uname(2) instead of uname(1). The former is a library call, while the latter is a userspace program.

The library call is easy: you call it with a pointer, and you get back a value and possibly data in your structure. Calling the userspace program is annoying. You have to create a pipe, then fork, then do dup2 craziness, and then exec. Then you have to parse all of that garbage!

Even if you cheat and use popen() or system(), you still have to parse it. Meanwhile, those of us who went straight to the library are relaxing somewhere because we have things already split up: sysname, nodename, release, and so on. Yay.

Now that you have a clueful helper process on your test systems, you need something which will use them. Okay, so you write some code that will act as a client. It opens RPC connections to all of your target machines and figures out where they are, then acts accordingly. It's a simple state machine thing.

If the machine is in the first state, we haven't done anything to it yet. Tell it to load the kernel and proceed to the next state. When you hear back about the kernel load progress, you either move to the next state (success), or you go back and try again (failure). The next step is to tell it to reboot, and again, you have to wait for it to come back to know what happens next.

How do you know when the machine has actually rebooted? Think about this. You're basically trying to tell that the machine was up, then down, then up again. Right now, some people are probably thinking about calling out to run "uptime". Ugh! No!

First, calling "uptime", the userspace tool, would be evil for all of the reasons I detailed above regarding uname-the-tool.

Second, that's not really telling you what you actually need to know. You're trying to infer action based on some magic numbers. I assume you'll say "oh, the uptime number WENT DOWN, so we must have rebooted". Uh, great, yeah. Here's why that's no good.

Instead, how about looking at something which is unique for every boot and will change when you reboot? Now you're talking. Go look at the value of /proc/sys/kernel/random/boot_id. It's been in the Linux kernel for a very long time.

Now that you can tell when the machine has moved on to a new boot, then that state should ask for the kernel version. That's when you compare it to the one you loaded and make sure it started. Maybe it ran fsck and then rebooted afterward, and that made it start your default/stock kernel again. You have to be able to discover this and recover from it.

Also, how do you make sure it actually ran your kernel from right now and not one with an identical version/uname string? Again, you just have to think ahead. This one is a little messier than the rest, but you can put some magic string in the kernel's command line when you add it to LILO or GRUB.

Make the magic part of your kernel command line something random and change it every time. Then make sure it's in /proc/cmdline after you reboot. If it's there, your kernel must be running. If there's some file in /proc which will give you the running kernel's signature or similar, that would be far better, but I am not aware of one at this time.

If you make it this far and find the right string, the kernel loading and rebooting part is done ... for that machine. You didn't forget about the others, right? You should have been taking all of them through these same state changes in parallel. Doing it serially would take forever.

There's more, though. Let's say you have a pool of 100 machines. I hate to tell you this, but you won't always have all 100 available. Machines break. They go off in a corner and sulk. Maybe your test only really needs 85 machines to run. That's great! But can your code handle it?

I've seen far too many systems which will just sit there looking stupid because they can't get that 100th machine, even though it could run just fine with the 99 it managed to scrape together. If you plan for this, then you start thinking about having a quorum, and a limited delay to collect as many machines as possible.

This, too, is easy. Every machine that makes it all the way through the load-reboot-version check state machine contributes to the count. If your count equals your target (100), you're done. Go start your test!

If you're up at the minimum level (85), then start a countdown. As long as you stay at or above that level, keep the clock running. If the other 15 show up and you hit 100, go start your test!

Also, if the clock runs out but you still have at least 85 machines, then ignore the rest and go start your test!

If you can't get to even the minimum level in some amount of time, give up. Something is very wrong.

This should not be that difficult. Still, I've seen people get this wrong so many different ways. The big one was dealing with the machines which would reboot into the "wrong kernel". It's like they had never thought about the probability of fsck wanting to run and then forcing a reboot. The more machines, the more likely any given test will trip it!

The problem is that they had built their system as a linear "this, then this, then this" scheme. If you had an oddity which needed to "go back", you couldn't. It would just give up and fall over. FAIL.

Compare that to my scheme with simultaneous state machines for each target system with well-defined "forward" and "back" progressions.

The winner should be obvious.