Software, technology, sysadmin war stories, and more. Feed
Thursday, January 3, 2013

Catching date-related failures before they become critical

There's an interesting date-related bug in the news this week. It seems that Apple's "Do Not Disturb" feature on iPhones and similar devices isn't going to work normally until January 7th rolls around. Snippets of code found here and there suggest it comes from using "YYYY" instead of "yyyy" when formatting dates and the resulting ISO week number craziness which results.

My usual reaction to such bugs after I stop chuckling is to wonder if I might have done the same thing anywhere, or if it's just a matter of time until some of my code trips over the same sort of problem. Inevitably, my thoughts lead to things like unit tests, but it's clear that's not a full solution for something like this. It's actually an interesting thought problem to build up to something which could make this not happen.

Let's say I had this same problem in my code. First, I'd need to have a unit test which covered that case and would actually uncover this kind of situation. Then I'd have to make sure it actually gets run. What good is a test which never gets a chance to be executed?

Assuming this situation, then once January 1st rolled around, if I happened to run the test, it would tell me I now had a problem. This might be mildly useful to alert me to something that I might not have discovered through normal use or by reading user feedback. It's better than nothing but it isn't great.

The problem with that setup is that I might not run any given test for many days. Maybe I'm working on another part of the code and am only running those tests regularly, or I'm on vacation and I'm not touching it at all for a week or two. Clearly, there is room for improvement.

The next thing to do here is to make sure tests like this are run on a regular basis and without any humans involved. Some sort of automated process will need to give them a spin every so often to make sure everything stays happy. It also needs to let someone know when the tests fail so they don't just sit there unnoticed in a broken state.

With this change in place, it would mean that shortly after the clock rolled over to 2013, the test would fail and someone would be notified. This is definitely better than when it ran exclusively manually, but it still hits me at the same time that it hits real users. I might be able to react to it sooner, but they are still going to be affected.

The next step is where things get a little bizarre. In order to stay ahead of the clock, I need to make sure my tests run in the future - as far as they are concerned, that is. Maybe I'd make everything offset by 90 days so that I have a whole quarter's worth of warning when something really bad happens. Running my test right now would make it think the current date was April 3rd, 2013! Likewise, this bug would have been triggered back on October 3, 2012, giving plenty of time to figure it out.

There is the question of how to make this work. Playing around with the system clock is right out. Trying to keep the machine from going insane would be mighty painful. It seems like the most reasonable way would be to abstain from calling things like gettimeofday() and time() in my top-level code and instead rely on a common library. That library would be responsible for calling those functions most of the time. This level of indirection would give me the opportunity to slip in a testing rig when necessary.

For the purposes of my "run in the future" test, I'd replace that time utility library with one which offset all of its "now" calculations by a specified interval. Then I'd just crank that interval up to 90 days and let it fly. It would return adjusted numbers to my test code, and it would carry on as if that really was the current date.

This sort of thing seems like a candidate for a good weekend project: set up clock abstraction, get your code to use it instead of being wired directly to the real clock, make tests which exercise time-sensitive features, use a fake clock to push them into the future, get those tests running automatically, and then have them phone home when they fail.

How hard could it be?