Software, technology, sysadmin war stories, and more. Feed
Wednesday, March 21, 2012

Wake-on-LAN is great, but what about reboot-on-LAN?

I used to do a bunch of invasive tests on distant servers. By invasive, I'm talking about pushing experimental new kernels or other bits of the operating system and then rebooting. It had a high potential for failure.

When things worked correctly, everything was fine. The machine would come back up on the network and I could talk to it again. That assumes the kernel and OS decided to play nicely with the hardware. But what happened when things weren't so good?

When a machine failed to come back up, there were surprisingly few options for recovery. The first thing to try was to just wait. Some systems have software or hardware watchdogs and will reboot themselves if the OS fails to check in. In that case, my boot loader would bring up the default kernel instead of an experimental one, and everything should work again.

Unfortunately, not all of my machines had these watchdogs. They also usually didn't have serial consoles or other out of band management systems, and there were no switched PDUs. What happened next was decidedly low-tech.

After a reasonable delay to see if it worked itself out, I would have to file a request for a hands-on reboot. Someone would go out to the machine, unplug it, count to 10, and then plug it back in. This was slow and annoying. It also made me feel guilty for wasting some person's time with my silly requests.

It seemed like the sort of problem you could solve with a little embedded firmware magic. Half would live in the switch and half would live in your NIC. Unlike wake-on-LAN, this would be out of band so no other hosts could generate valid requests. I don't trust the whole "magic packet" affair for a machine which is already up, in other words.

My logic is simple enough: always leave enough of the NIC running so it can independently look for the magic signal on its port. Maybe it's some weird agreed-upon sequence of voltages which would normally never happen. However it works, when it gets that signal, it just needs to frob the motherboard's reset line. That's it.

Obviously you would need support for this in your switches and a proper authentication and authorization system to control access. This also means trusting your switches to not attack your machines with reboots, but let's face it. A compromised switch could just turn off the ports for the same effect right now, so this isn't adding much in the way of new vulnerabilities.

Just for the sake of completeness, I will mention that just being able to reboot the machine isn't useful unless you've done your homework. First of all, when you mess with a new kernel, you need to use LILO's "only boot this image once" scheme or whatever the equivalent may be for GRUB. Second, if you are doing scary things to the rest of the stack, you might need to have some extremely low-dependency fallback route just in case.

I'd say "test this stuff before you roll it out", but the whole point of these systems was to do the testing before it got to production machines. There comes a point when you just have to suck it up and try it on a real device somewhere.

If the system truly gets messed up, your only hope may be to bounce it remotely and then catch it with the PXE boot environment as it comes up. Whether it goes into a manual rescue, an automated rescue, or an automated reinstaller is entirely up to you.