Writing

Feed Software, technology, sysadmin war stories, and more.

Wednesday, April 10, 2024

Going in circles without a real-time clock

I have a story about paper cuts when using a little Linux box.

One of my sites has an older Raspberry Pi installed in a spot that takes some effort to access. A couple of weeks ago, it freaked out and stopped allowing remote logins. My own simple management stuff was still running and was reporting that something was wrong, but it wasn't nearly enough detail to find out exactly what happened.

I had to get a console connected to it in order to find out that it was freaking out about its filesystem because something stupid had apparently happened to the SD card. I don't know exactly why it wouldn't let me log in. Back in the old days, you could still get into a machine with a totally dead disk as long as enough stuff was still in the cache - inetd + telnetd + login + your shell, or sshd + your shell and (naturally) all of the libraries those things rely on. I guess something happened and some part of the equation was missing. There are a LOT more moving parts these days, as we've been learning with the whole xz thing. Whatever.

So I rebooted it, and went about my business, and it wasn't until a while later that I noticed the thing's clock was over a day off. chrony was running, so WTF, right? chrony actually said that it had no sources, so it was just sitting there looking sad.

This made little sense to me, given that chrony is one of the more clueful programs which will keep trying to resolve sources until it gets enough to feel happy about using them for synchronization. In the case of my stock install, that meant it was trying to use 2.debian.pool.ntp.org.

I tried to resolve it myself on the box. It didn't work. I queried another resolver (on another box) and it worked fine. So now what, on top of chrony not working, unbound wasn't working too?

A little context here: this box was reconfigured at some point to run its own recursive caching resolver for the local network due to some other (*cough* TP-Link *cough*) problems I had last year. It was also configured to *only* use that local unbound for DNS resolution.

This started connecting some of the dots. chrony wasn't setting the clock because it couldn't resolve hosts in the NTP pool. It couldn't resolve hosts because unbound wasn't working. But, okay, why wasn't unbound working?

Well, here's the problem - it *mostly* was. I could resolve several other domains just fine. It's just that ntp.org stuff wasn't happening.

(This is where you start pointing at the screen if this has happened to you before.)

So, what would make only some domains not resolve... but not all of them... on a box... with a clock that's over a day behind?

Yeah, that's about when it fit together. I figured they must be running DNSSEC on that zone (or some part of it), and it must have a "not-before" constraint on some aspect of it. I've been down this road before with SSH certificates, so why not DNS?

I added another resolver to resolv.conf, then chrony started working, and that brought the time forward, and then unbound started resolving the pool, and everything else returned to normal.

By "everything else", I also mean WireGuard. Did you know that if your machine gets far enough out of sync, that'll stop working, too? I had no idea that it apparently includes time in its crypto stuff, but what other explanation is there?

Backing up, let's talk about what happened, because most of this is on me.

I have an old Pi running from an SD card. It freaked out. It took me about a day and a half to get to where it was so I could start working on fixing it.

This particular Pi doesn't have a real-time clock. The very newest ones (5B) *do*, but you have to actually buy a battery and connect it. By default, they are in the same boat. This means when they come up, they use some nonsense time for a while. I'm not sure exactly what that is offhand, because...

systemd does something of late where it will try to put the clock back to somewhere closer to "now" when it detects a value that's too far in the past. I suspect it just digs around in the journal, grabs the last timestamp from that, and runs with it. This is usually pretty good, since if you're just doing a commanded reboot, the difference is a few seconds, and your time sync stuff fixes the rest not long thereafter.

But, recall that the machine sat there unable to write to its "disk" (SD card) for well over a day, so that's the timestamp it used. If I had gotten there sooner, I guess it wouldn't have been so far off, but that wasn't an option.

Coming up with time that far off made unbound unable to resolve the ntp.org pool servers, and that made chrony unable to update the clock... which made unbound unable to resolve the pool servers... which...

My own configuration choice which pointed DNS resolution only at localhost did the rest.

So, what now? Well, first of all, I gave it secondary and tertiary resolvers so that particular DNS anomaly won't be repeated. Then I explicitly gave chrony a "peer" source of a nearby host (another Pi, unfortunately) which might be able to help it out in a pinch even if the link to the outside isn't up for whatever reason.

There's a certain problem with thinking of these little boxes as cheap. They are... until they aren't. To mangle a line from jwz, a Raspberry Pi is only cheap if your time has no value.

As usual, this post is not a request for THE ONE to show up. If you are THE ONE, you don't make mistakes. We know. Shut up and go away.