Writing

Feed Software, technology, sysadmin war stories, and more.

Tuesday, June 11, 2013

Generating time_t values without touching mktime and TZ

Back in March, I wrote that "time handling is garbage", specifically referring to the hoop-jumping which is required to do something which should be stupidly simple: convert an unambiguous time string from ASCII to a time_t value. That is, turn "Tue, 01 Nov 2011 10:54:00 -0700" to seconds since the epoch, which should be 1320170040.

Sure, strptime() will turn that string into a "struct tm", and mktime() will allegedly turn that into a time_t, but it's not that simple. mktime honors your local time zone rules. That is, it tries to make it into 10:54 AM where you are. The "-0700" part is lost completely.

This means if you happen to test it in a place like the west coast of the US where -0700 is your offset part of the year, your code may seem just fine. Then someone else will run it and they'll get an offset. Yet another person will run it in some other part of the world and will get still another offset. It's insanity.

So you say, okay, what I want is mktime() that goes to UTC instead. You can dig around and discover "timegm", and then you will also find that it comes with a nice warning:

These functions are nonstandard GNU extensions that are also present on the BSDs. Avoid their use; see NOTES.

When I see "nonstandard GNU extensions", I think "run away". I don't want to go down that road. The man page goes on to suggest an alternative: save a copy of the "TZ" environment variable, then set it to "", use mktime, then restore the value of TZ.

I am not making this up. Go look at the man page for timegm on a Linux or BSD box. This includes Mac OS X! Seriously, go look. It's trippy.

What's so bad about that, you might ask? That's easy. Besides just feeling icky and wrong, it also raises the question of thread safety. How, exactly, are you supposed to frob the TZ environment variable back and forth without potentially affecting other accesses to the environment in other threads?

Okay, so let's say you change all of your {g,s}etenv() calls to use your own library, and put a mutex on it to make it thread-safe (and introduce new sources of delays). Great. You've just kept the environment from potentially being screwed up if the get/set calls were themselves non-thread-safe. But what about the actual value?

What if while thread 1 has set TZ to "", thread 2 tries to do something and winds up using that value instead of your local time? Won't it suddenly have a completely insane idea of what your time is? Or, better still, what if two of these time conversion processes start up at the same time? Maybe one of them restores TZ and then the other one goes and does its work, and oops, it just used local time instead of UTC.

So now what, do you wrap the whole time thing in its own mutex and force all calls to it to be serialized?

It just goes on and on like this:

13939 getenv("TZ")                               = NULL
13939 setenv("TZ", "UTC", 1)                     = 0
13939 mktime(0x7fffd45b3640, 0x601010, 6, 7, 1)  = 0x4eafdf40
13939 unsetenv("TZ")                             = <void>
13939 printf("%ld\n", 1320174000 <unfinished ...>
13939 SYS_write(1, "1320174000\n", 11)           = 11
13939 <... printf resumed> )                     = 11

Last night, while talking with a friend about this, I decided I had enough. I'm not trying to calculate DST offsets or deal with leap seconds. I'm trying to use UTC here which has neither of those things as far as Unix is concerned. This should be a simple matter of an offset calculation.

1970-01-01 00:00:00 UTC is 0. Everything since then is some point past that, and it can be calculated based on multiples of things: normal years, leap years, months, days, hours, minutes, and seconds. It's just a matter of figuring out how many of them have elapsed since the epoch, doing the multiplication, and adding it all up.

After poking around in glibc and the BSD sources and even the IANA tzcode stuff, I decided to write my own library. I know that time handling is hard. I know that I probably am going to make mistakes. I know it can go horribly wrong. But, it's the only way to possibly find a way out of this Unix time mess, so I went and did it.

A couple of hours later, I had a simple little function.

time_t time_to_utc(uint64_t year, uint64_t month, uint64_t day,
                   uint64_t hour, uint64_t minute, uint64_t second);

Yep, I used unsigned 64 bit values throughout. I did this whole thing with a distinct lack of caring for optimization. I was focusing on correctness: getting all of the offsets right, and handling all of the different rules which apply, like the whole 4/100/400 thing for leap years.

I also added some rudimentary error checking. If you pass me a time before 1970-01-01 00:00:00, it will kick back a 0. Likewise, if you give me something ridiculous like "February 35th" or "December 0th" or a day in "Smarch" (you know, the 13th month...), it'll also kick back a 0.

This is not really intended to be used for error checking. It's up to the callers to give me sane values. Perhaps I'll write an "extended" version which runs checks, returns a bool and sets a time_t*, and calls time_to_utc internally. Whatever.

Then I started testing this. I decided to start with the POSIX.1 reference point which states that 536457599 is 1986-12-31 23:59:59 UTC. The test is simple enough, and looks like this:

  EXPECT_THAT(time_to_utc(1986, 12, 31, 23, 59, 59), 536457599);

It works. So then I threw it a fun time which happened not too long ago and actually is used in fred as a placeholder for naughty posts with no apparent time value:

  time_t utc = time_to_utc(2009, 02, 13, 23, 31, 30);
  EXPECT_THAT(utc, 1234567890);

Right, so far, so good. I went on from there and had it check dates in 2000, 2100, 2200, 2300, and 2400. Yep, at this point, it's well beyond that which even 32 bit time_t will be able to do. You might be aware of the coming signed 32 bit time_t apocalypse in 2038. Well, even unsigned 32 bit time_t runs out in 2106.

I purposely test values right on those marks:

  EXPECT_THAT(time_to_utc(2038, 1, 19, 3, 14, 7), 2147483647);
  EXPECT_THAT(time_to_utc(2038, 1, 19, 3, 14, 8), 2147483648);
  EXPECT_THAT(time_to_utc(2106, 2, 7, 6, 28, 15), 4294967295);
  EXPECT_THAT(time_to_utc(2106, 2, 7, 6, 28, 16), 4294967296);

They're also fine.

So, next, I wrote a "fuzzer". It seeds the C PRNG from /dev/random on my machine and then starts grabbing values from random(). It makes up years anywhere between 1000 and 3000, months between 0 and 19, days between 0 and 40, and so on. Yes, many of those will be invalid. That's the whole point. I want to make sure they get rejected.

I also take the generated date and feed it to GNU date, doing something like this:

date -d "(date string)" +%s

Then I take the number it returns and compare it to the one my function generates. If it differs, it bombs.

I ran it a few hundred thousand times. It didn't bomb. It looks like this:

2931-05-05 00:33:09 = 30336942789 [OK]
2214-11-06 20:05:12 = 7726651512 [OK]
2661-11-06 06:38:02 = 21832612682 [OK]
2508-12-09 04:27:08 = 17007251228 [OK]
2375-11-07 05:08:24 = 12807349704 [OK]
2832-09-03 02:46:10 = 27223353970 [OK]
2983-02-25 12:47:17 = 31972020437 [OK]

This little function has no dependencies on /etc/localtime, /usr/share/zoneinfo, or anything else of the sort. It's just a bunch of hard-coded constants (days in a year, days in a month, months in a year, seconds in a day, etc.) and a whole bunch of addition.

It only includes time.h so as to have a time_t definition, and it includes stdint.h to get definitions for uint64_t. Right now it has stdio.h included for debugging prints, but I'll be removing them shortly.

I'm pretty happy with it.

Want to see it? I'm going to put it into fred.

Support the Kickstarter and if funded, it'll become open source. Then the whole world can kick the tires and see how it holds up under pressure.


July 5, 2013: This post has an update.