Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, December 25, 2022

Unintentionally BREAKing a serial console

I heard about a neat bug once that was caused by the interaction of some hardware that was missing some electronics and some software which was just doing what it was told. It had to do with the "access of last resort" you'd use on a machine that was otherwise dead to the world: the console.

Imagine a datacenter with tens of thousands of Linux boxes running. Sometimes, they break and fall off the network. Fortunately, they have a "mini-me" type thing attached which then allows you access to a serial console. It's not quite the same as being there with a monitor and keyboard plugged into the box, but it's frequently enough to dig out of a real mess without getting in a car (or worse).

It seemed that people had been trying to fire up the console on their systems and weren't getting the expected results. What's supposed to happen is that they connect, hit ENTER once or twice, and it should pop up something like this in reply:

Linux x.y.z (something-arcane.foo.bar.company.example)

login:

They'd hit ENTER and at best, nothing would happen. Sometimes, it would just be a jumbled mess. Obviously, if the machine was unreachable over the network, we couldn't decode it, so it took a bit to find one that had a broken console but which was still available over the network.

What we found was interesting. The thing that actually puts up that login prompt is a process called getty (or some variant, like "agetty"). Its job is just to sit there and handle the serial line, read your login name, and get you connected to a login process to carry on from there.

For this to work, agetty and the serial port on the host have to agree with the serial port on the client in terms of baud rates (and other things too, but let's keep this story simple). If you get one out of sync, the other end will have no idea what you're talking about.

If you've never crossed paths with this before, imagine you're a dog that can only hear whistles at a specific set of pitches: one high, one low. Someone who uses the wrong set of frequencies won't make much sense to you. Baud rates are a little like that.

Somehow, the getty on these machines had gotten into a state where it wasn't running at the same baud rate as the actual system which was providing remote access to the serial console. We knew there was a feature in getty that would look for a "serial break" (imagine a really long low whistle in the dog analogy) and it would cause it to rotate through a list of baud rates.

This feature was probably intended to avoid a chicken-and-egg situation where you plug a terminal into a serial port on some Unix box and can't talk to it because it's at some rate that you can't change to. So, you keep poking it with BREAKs until it comes around to something that you can reach, and then you proceed from there.

We didn't have the ability to jam a BREAK down the line from the remote console client system, so what gives? These were two subsystems that were part of a much larger rackmounted beast, so it's not like there was an old-school serial cable running between them. They were probably just traces on a board somewhere. Something didn't add up.

This is when I heard some really neat troubleshooting from someone who actually understood this stuff (i.e., not me): they knew that on other hardware, they had installed buffers between the two systems to keep the electrical low state at boot time from triggering the BREAK behavior.

Unfortunately, they hadn't done this same thing on this particular type of hardware. It was missing the necessary electronics magic (they called it a "pullup") to keep things from getting out of hand when the controller restarted. Oops.

Their solution was to disable that behavior in their getty config. Since the server was hardwired to the only client it would ever have, there was no reason for it to honor a BREAK to do that sort of thing.

That was it. The machines probably still have the same electrical situation to this day and send all kinds of wild crap down the line when their controllers reboot, but at least their gettys won't care.

If you're someone who's never done serial stuff on a vaguely Unixy box and you're bored over the holidays, maybe this is your time to check it out. Find a box with a serial port (good luck!), plop a getty on it, then wire it up to another box with a serial port (more luck to you!) and see if you can get them talking to each other.

Failing that, check out the magic of someone who already did that and then some. Enjoy!