Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, September 7, 2015

One checkbox equals non-UTC fun

GPS time is not quite the same as NTP time. They are actually somewhat different things, and if you use one of them to feed the other verbatim, you might just have a bad day. What's the big deal? Yet again, it comes down to leap seconds.

GPS time doesn't have leap seconds. It's just been incrementing steadily for a little more than 35 years. UTC, on the other hand, still does have leap seconds, and it makes life interesting for people every couple of years when the accumulated offset calls for an adjustment. We just went through one of these back in the end of June/beginning of July.

Why am I bothering to write about this? Well, it's actually not that difficult to shoot yourself in the foot and mix up these time scales with a commonly-used commercial GPS-based NTP appliance. There's a nice checkbox which says, quite simply:

Ignore UTC Corrections from GPS Reference

It looks tempting, right? The default is unchecked, which gives you the double-negative effect, resulting in honoring UTC corrections from GPS... which is probably what you want, whether you understand it or not.

If you check that and restart ntpd on the box, it will start giving you time that's (currently) 17 seconds fast of what most people would expect. This number will grow as more leap seconds are inserted in the future.

So let's say you have a bunch of these things in your organization spread throughout your network. Maybe one of them has it checked and the others don't. Odds are, ntpd will declare shenanigans on the one outlier and it will ignore it. But, what if no other sources are available? Or, what if multiple sources get this checked somehow?

If enough sources claim the time is 17 seconds fast, ntpd will eventually conclude that the local machine is the crazy one, and will step the time to compensate. The machine will skip over that time and will be running fast.

Now let's say at some point after that, the local consensus is that time is actually back where it should be. At first, ntpd will act like all of its sources are insane, with huge offsets and jitter values approaching -17000 milliseconds. However, as time goes by, the jitter will drop because they are consistently offset by the same amount. Eventually, ntpd will again realize the local clock is the crazy one, and it will step the clock just like it did before, but now, it'll go backwards.

When this happens, any program on the machine which is looking at the real-time clock is going to start rocking and rolling as it repeats the last 17 seconds. If you're using functions like time(), gettimeofday(), or even some flavors of clock_gettime (the REALTIME ones) in your code, you're going for a ride!

There's a nice warning about this in the vendor's documentation:

CAUTION: NTP time is based on the UTC time scale. Distributing GPS time over NTP is non-standard and can have serious consequences for systems that are synchronized to UTC. This action should only be performed by a person who is knowledgeable and authorized to do so.

It's kind of neat in an evil way. Flip that switch and you will start publishing time over NTP that isn't equivalent to what you'll usually find when you speak NTP to arbitrary hosts on the Internet. Again, the vendor warns users to distribute such time only on "private or closed networks", and to avoid doing it in public, and to even lock it down to keep people away from it.

Somehow, I think a mere checkbox + daemon restart is a little thin in terms of protecting people from themselves. I think there's an added bonus in that you can apparently check the box and not restart the daemon right away. That leaves a wonderful timebomb just waiting for the next time something causes it to restart: a power outage, reconfiguring the network, changing ntpd's peers, or just clicking the UI's restart button.

You could be good for years, and then one day...