Writing

Feed Software, technology, sysadmin war stories, and more.

Wednesday, September 27, 2017

ntpd won't save you from one particular rogue bit

I found a new way to screw up time on Unix boxes a couple of months ago. Someone reported a box that looked very very strange, and I took a look. Sure enough, it was in the year 2153. Nothing was fixing it. Normally, ntpd will fall over if it's really far away from the current time. Then we'd notice that ntpd was not running. But this? It was totally cool with this.

How did it happen? Near as anyone can tell, when the system came up, it managed to read a value with a spurious extra bit set from the real time clock. So, instead of only the lower 32 bits being set, one slightly higher also was set.

That bit was number 32. As in 2^32, or 4294967296. That's about 136 years worth of seconds (yeah yeah, leap seconds, leap years, I know, go away, that doesn't matter here).

So instead of the time being, say, 1506560587, as it is as I write this tonight on September 27, 2017 (local time), it'll be 5801527883, or November 4, 2153. Slight difference, right?

ntpd... didn't care. ntpdate... also didn't care. Why should they? As far as they can tell, the lower 32 bits of time is just fine, so they think nothing is wrong. That's all they care about.

If you don't believe me, here, check it out. I came up with something small and stupid which will mess up your clock in a controlled fashion. It will take you to the future or back again, and the ntp tools will keep on going like nothing ever happened.

First, let me me show you what running ntpdate looks like normally using my handy testing box:

epenguin:/# date; ntpdate pool.ntp.org; date

Wed Sep 27 18:06:30 PDT 2017

27 Sep 18:06:36 ntpdate[765]: adjust time server 69.89.207.99 offset 0.000582 sec

Wed Sep 27 18:06:36 PDT 2017

epenguin:/#

It takes about six seconds to do its thing, and you can see that my system clock was pretty close to the time server which it happened to pick from the pool.

Now here's what happens when your clock is really screwed up. Notice the "offset" when it fixes things.

epenguin:/# date -s "Dec 25 2017"; ntpdate pool.ntp.org; date

Mon Dec 25 00:00:00 PST 2017

27 Sep 18:08:12 ntpdate[769]: step time server 173.71.73.207 offset -7627914.525158 sec

Wed Sep 27 18:08:12 PDT 2017

epenguin:/#

Huge offset, right? That's because it isn't Christmas yet.

Given this, you'd assume that an offset of hundreds of years would generate a ridiculous offset as it dragged the clock backwards, and, well, you'd usually be right, except for this special situation where you hide all of the "extra time" behind that one bit.

Here, I run my time shifting program, and watch what happens.

epenguin:~# date; ./timeshift; date

Wed Sep 27 18:09:54 PDT 2017

Your clock is in the present and is boring. Warping time!

old tv_sec: 1506560994

new tv_sec: 5801528290

Sun Nov 4 00:38:10 PDT 2153

epenguin:~#

Uh oh! Save us, ntpdate!

epenguin:~# date; ntpdate pool.ntp.org; date

Sun Nov 4 00:38:31 PDT 2153

4 Nov 00:38:38 ntpdate[776]: adjust time server 204.9.54.119 offset 0.002158 sec

Sun Nov 4 00:38:38 PDT 2153

epenguin:~#

Yep, not gonna happen. You are stuck in the future.

I'm using ntpdate for simplicity here, but trust me, ntpd will accept it too. It'll happily come up, pick a source, and will declare itself synchronized.

Don't believe me? Try it for yourself. Here's the dumb little tool which will flip 2^32 in your system clock.

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
 
int main() {
  struct timeval tv;
  struct timezone tz;
 
  if (gettimeofday(&tv, &tz) != 0) {
    fprintf(stderr, "gettimeofday failed: %s\n", strerror(errno));
    return 1;
  }
 
  if (tv.tv_sec & 0x100000000) {
    printf("Your clock is in the future.  Pulling it back.\n");
  } else {
    printf("Your clock is in the present and is boring.  Warping time!\n");
  }
 
  printf("old tv_sec: %ld\n", tv.tv_sec);
 
  tv.tv_sec ^= (long) 0x100000000;
 
  printf("new tv_sec: %ld\n", tv.tv_sec);
 
  if (settimeofday(&tv, &tz) != 0) {
    fprintf(stderr, "settimeofday failed: %s\n", strerror(errno));
    return 1;
  }
 
  return 0;
}

Incidentally, if you try to reproduce this by using a bunch of shell magic to run 'date +%s', OR it with 2**32, then 'date -s' that value, it'll take some luck to reproduce this with ntpd. Part of the problem is that 'date -s' will shear off the fractional bits of the current time. If the resulting time that ntpd sees is more than a few milliseconds off, it'll step the clock, and that will clear out the future time. That's why I wrote this program: testing it with a shell script involved far too many failed attempts.

Interested in reading more on the topic? Check out "How is Time encoded in NTP?".

Be sure to pay attention to the whole thing about the NTP era.

Incidentally, when this happens, 'hwclock --systohc' will probably fail. It generates some neat error messages. This means you can't actually persist the bad time to your RTC. In all likelihood, it will vanish after a reboot, and will not be seen again until you happen to have a single bit error in the clock at boot.

Of course, if you have enough machines and enough reboots, you'll probably see this eventually. You may have already seen it and just didn't know why it happened, or why it persisted. Hopefully this helps.

Time is hard.