Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, April 12, 2018

How did this machine jump into the future?

Not all time synchronization problems come from a hardware issue. This is a story about something else entirely with a healthy dose of unintended consequences.

I don't remember exactly how this came on my radar, but someone had a very big problem. One of their many machines was apparently operating several days in the future and was making a mess of things. Depending on which server was picked by the load balancing methods, it would either work just fine or it would get very interesting.

To give you some idea of how this can make things strange, think about what happens when you use your server time in order to "stamp" events as they are received. Now imagine it's a chat system. Users fire off messages at the chat, and those HTTP requests wind up at one of any number of servers. Maybe one time in 50, the message gets stamped in the future, and it really screws up the timeline.

They eventually found the machine with the "bad clock" and wanted to know what had happened, and why ntpd wouldn't fix it. I could answer the second question first: ntpd will refuse to start if you're too far out of sync with what it thinks the upstream time is. You can override this with certain flags, but those come with their own caveats.

ntpd was actually starting over and over and over on this box. It would come up, check the time, notice it was too far out of whack, and would shut itself down. Then something else would come along, bonk it on the head, and try to start it again. At no point did it actually succeed.

While this was going on, the server processes on this machine were still running, and so any time they got traffic from the outside world, they would apply the ridiculous time to those events.

We put the clock back to about where it should be and ntpd came back up and started doing its job again. This put out the fire temporarily, and now the remaining question was: how did this happen?

This one turned out to be a matter of reading the shell history. At some point in the past, someone had run a command like this:

date -s @1524580601

I had actually seen this one before and so it all fell into place. We had a human logged in as root a couple of days earlier, and they had apparently come across a Unix timestamp (1524580601) and wanted to know what it looked like in human-readable terms. They knew to run date with a switch and '@' in front of the number, but they made one key mistake.

They used '-s' instead of '-d'.

What's the difference? Well, both of them will translate the number into something like "Tue Apr 24 07:36:41 PDT 2018", but 's' also sets the clock while doing it.

Did this person intend to change the system clock and bump it ahead several days? I'm guessing they did not, so what happened?

Assuming you are using a QWERTY-ish keyboard layout, have a look at the home row and notice what you find. 's' and 'd' are adjacent left-hand keys. It wouldn't be terribly difficult to miss and hit the wrong one. Also, when you ran the command, it would give the output you expected without so much as a comment about what else it had done. As a result, you would have no idea that you had done anything wrong!

There are so many things which could have happened differently here. Here are just a few.

First, logging in as root to the box meant that this command had sharpened teeth. If you were running as some non-privileged user, it would have no effect. Save root for when you actually need it.

Second, logging in -at all- with ssh when all you want to do is look at logs can suggest that you are missing some kind of automation or other "hands-off" access methods.

Third, you could defang 'date' itself, such that '-s' no longer functions on your machine. This could be done with a wrapper which detects attempts to invoke it with that arg and rejects it. You could also try something a little more wacky by making it suid to some non-root user (!), so even if root runs it, it'll wind up casting off its permissions.

Both of these hacks are well outside the realm of "least surprise" and are mighty fragile. Your next OS/package update which happens to touch /bin/date will almost certainly undo the change. That means you'd then have to layer on something else to keep putting your hack in place.

Fourth, the server software probably should decline to accept work if ntpd is not up and running and happily synced. Of course, once you do this, you probably also get to figure out a way to make clients ignore that if too many systems start claiming it. Otherwise, the next time you have a ntp outage (but the clocks are otherwise fine), you'll also have a service outage as they all declare themselves unhealthy.

You get most of the wins by just implementing #1.