Four interacting decisions break ssh access
Have you ever really stopped to appreciate how one small decision in isolation is no big deal, but when it interacts with other small decisions, sometimes the outcome can be something really broken? I ran across a case of this that spanned software and hardware realms, authentication, authorization, and time synchronization.
Decision #1 happened in hardware. Someone decided that this particular type of machine did not need the usual battery-backed hardware clock that has otherwise been common on PC type boxes for the past 30 years. If you remember when MS-DOS stopped asking you to set the date and time when you powered on your bitty box, that's how long ago this happened.
Well, someone apparently figured it would save money, so this hardware had no hardware clock. When the system booted up, it would come up at the same point every time. It wasn't quite 1980, but it was still pretty far in the past.
Decision #2 is software. It has to do with the way time sync works on a typical Linux type box. Typical init sequences at boot time might try to run ntpdate once to set the clock off the network, and then they try to start up ntpd which sticks around and disciplines the clock. If ntpdate fails, it is not re-attempted. If ntpd fails, it might be restarted eventually.
Decision #3 is about how ntpd works. If you start it up and it polls for time and finds that upstream is way too far off from the local system clock, it just bails out without doing anything. This is the default behavior, and I can't really argue with it. If your machine is that far out of whack, you probably want to investigate before proceeding.
Decision #4 is about the choice to use certificates for ssh access to the machine. These certs are a lot like the ones used for SSL/TLS on the web, and as such have both an expiration date and a "not valid before" date. Outside of that span of time, the cert does not work.
Take all four of these decisions, throw them into a single pot, and pull out a device that's subject to all of them. What happens?
Well, most of the time, nothing bad happens. Everything is fine.
But, one day, the machine gets rebooted during a network outage of some sort. Maybe it's a power failure that also reboots its upstream switch, and that switch hasn't come back up yet. At any rate, the machine boots up back in 2008 or whatever, and it goes to run ntpdate. Since the network is down at that point, ntpdate fails.
Next, it tries to run ntpd. ntpd also fails to reach the network and so does nothing. The init sequence carries on and eventually the machine starts trying to start up the usual workload it normally runs. Keep in mind it's still set to January 1, 2008.
Time passes, and the network comes back up. The machine starts trying to do stuff, and since it's ten years (!) in the past, it fails miserably. The data points coming in from it are all screwed up. Alarms start firing. People eventually notice.
That's when someone tries to log in with ssh to debug things... and fails. They are stunned. They have access to every other machine like this one, so why not this one?
Have you figured it out yet?
When this happened, someone had to get the root password and log in through the out-of-band console to start debugging. They eventually ran 'date', saw the wildly out-of-spec time, and started putting the pieces together. Then when they saw something in the logs about "certificate not valid", it started falling into place.
The person trying to ssh in had a cert that was valid from (say) 2017-01-01 to 2018-12-31. However, the machine thought the current date was 2008-01-01. That cert was not yet valid, and wouldn't be for another nine years. It rejected the login.
If the machine had a hardware clock, it would have come up with time pretty close to when it had gone offline, and this wouldn't have happened.
If the failure to run ntpdate and/or ntpd was a show-stopper, the init sequence would have kept restarting things (possibly by rebooting the box) until it finally came up on a sane network and got synced up.
If ntpd allowed wild jumps in time sync, it would have eventually fixed things itself, assuming the init system on the box (like systemd) was configured to keep restarting it. Remember that it would have to be an on-box "babysitter", since nothing else would be able to ssh in to fix things.
Finally, if ssh wasn't using certs, or didn't honor the "not before" date, or if certs were issued with a VERY OLD "not before" date, the login would have been accepted.
Flip any of those bits and things work out just fine. It's the combination of those four decisions that added up to a really annoying situation.
So, if you have an embedded machine in your life that fails ssh when it gets rebooted off-net and yet allows it when it reboots on-net, maybe this is why. Take a good look at how you're logging in, the specifics of the clock, and see if this is happening to you.
Tick tick tick. Time is hard.
March 22, 2018: This post has an update.