Writing

Feed Software, technology, sysadmin war stories, and more.

Friday, March 1, 2013

Debugging script's timing output

Lots of things in the world of free software are a mixed bag. On the one hand, it's free, and you have total access to the code to do whatever you want with it, including fixing it when it's broken. On the other hand, it's still code, so it tends to break just as much as anything else, and you wind up having to fix it as a result. The "script" utility is my target today.

If you are running Linux, chances are you have script by way of the util-linux package. It also exists on other systems like OS X, but it may have a completely different origin. Today, I am specifically looking at the script from util-linux, and even then, just at one feature: the "timing output" switch.

The man page explains the "-t" switch as follows:

Output timing data to standard error. This data contains two fields, separated by a space. The first field indicates how much time elapsed since the previous output. The second field indicates how many characters were output this time. This information can be used to replay typescripts with realistic typing and output delays.

That sounds pretty cool, right? You can make recordings and capture all of the little nuances which went into the session, like when someone paused to think for a moment. It's a great way to see that whoever made that recording is in fact human.

There's a snag, though: the output from this feature isn't entirely sane. I had to work through this the other day, and decided to document my findings for the benefit of anyone else who might run into this down the road.

Yesterday, I posted a "sneak peek" at this problem. In it, I showed a simple "echo" command running inside script with very strange timing. According to the timing data, it would take up to one full second to emit just five letters! This is completely incorrect since my machine was never that slow. script is giving us bad data.

You can reproduce this by running the command a bunch of times. In the output, you should see the time increasing from roughly 0.000000 to 0.999999 and then starting over. What it's actually showing you is how far along you are in the current second of the clock. It's not giving you any other usable timing information in that first blob.

Here's why, found at line 315 in the script.c I have from util-linux 2.19:

double oldtime=time(NULL), newtime;

Upon entry to the dooutput() function which is the real workhorse for script's logging, it initializes "oldtime" with a value from time(). Even though it's storing the value into a double which would allow fractional data, time() can only provide whole numbers back. As a result, the value stored is something like 1362172012.000000 and not 1362172012.128947.

This would be okay if it used time() again later on to get the second value which it then used in subtraction. It's not great because you'd only have 1 second of resolution, and really short intervals would usually come out as 0, but it would be a start. Instead, the code does something else down at line 338 within the main loop:

gettimeofday(&tv, NULL);

Here, it socks away a copy of the clock from gettimeofday(). That function stores both the seconds (like time, above) and the microseconds from the system clock in a "struct timeval" called "tv". We'll say it stored 1362172012 for seconds and 888888 for microseconds.

Next, at line 341, it calls read() to wait for input from the session. The amount of time we spend in this read() is what we care about.

After that returns, it uses the value in "tv" to create "newtime" at line 351:

newtime = tv.tv_sec + (double) tv.tv_usec / 1000000;

This takes the whole number of seconds and whole number of microseconds and turns it into a single value. This means newtime is now (approximately) equal to 1362172012.888888. Remember though, oldtime is actually 1362172012.000000, or very close to it. Look what happens next at line 352:

fprintf(stderr, "%f %zd\n", newtime - oldtime, cc);

Yep, it subtracts the little one from the big one, and .888888 pops out. It should have actually gotten .759941, since timing really started at 1362172012.128947, but by using time(), it sliced off that part of the clock and threw it away.

The good news is that this little particular bug only bites us once, since the next thing which happens is that newtime becomes oldtime at line 353:

oldtime = newtime;

Now oldtime is 1362172012.888888.

The loop starts over. We actually just tripped over another bug while going through this loop for the first time. Did you spot it?

Allow me to demonstrate the impact of the second bug. Here's the test case. I am going to ask script to run the following sequence:

That sequence should write three characters (echo -n means no newline, so it only prints the "1" each time), and it should take about two seconds. It might be slightly higher than two seconds, but it should clearly be around two seconds.

Let's try it and see what happens:

$ script -t -c "echo -n 1; sleep 1; echo -n 1; sleep 1; echo -n 1" >/dev/null 2>t

$ cat t

0.318966 1

0.002778 1

0.999991 1

We know the first number is going to be crazy due to the first bug, so it can be ignored until we fix it. But hey, what the heck is going on with that second number? Why is it so low?

The answer has to do with where it's reading the clock. The function boils down to this pseudocode when all of the cruft is removed. I've put the line numbers of the matching original code in [brackets] next to the pseudocode:

dooutput() {
  set oldtime from time()    [315]
 
  // (do other initialization stuff.  nothing big.)
 
  loop forever {
    set tv from gettimeofday()    [338]
 
    call read() and wait until something happens    [341]
 
    set newtime from tv    [351]
 
    subtract oldtime from newtime and print result    [352]
 
    copy value of newtime into oldtime    [353]
  }
}

Looking at it this way, it should start making sense. We read the clock on the way into the function. We then do some initialization stuff once, and start the main loop. Once inside the loop, we immediately read the clock... and then go to sleep inside of read(). This is the problem.

What we've actually just done here is measured how long it takes to go from starting the function to starting the loop. Also, because of the first bug (using time() and not getting microseconds as a result), we're not even doing it correctly!

The first time through the loop, oldtime and tv (and thus newtime) are almost identical. Then newtime is copied over oldtime. That means on the second time through the loop, oldtime is whatever the clock was before the first read(). tv and newtime, meanwhile, have whatever the clock was before the second read. Then, that second read happens, taking some amount of time, and we do our subtraction again and get a completely insane number.

Let's unroll the loop to see why.

  1. Function begins. Oldtime initialized (badly) from time().
  2. Loop begins: first time through.
  3. tv is set from gettimeofday().
  4. read() is called (first time). We sit here for some amount of time.
  5. newtime is calculated and used in a subtraction.
  6. oldtime is updated.
  7. Loop begins: second time through.
  8. tv is set from gettimeofday().
  9. read is called (second time). We sit here for some amount of time.
  10. newtime is calculated and used in a subtraction.
  11. oldtime is updated.

The first time through the loop, we're measuring the time from step #1 to step #3, even though what we care about is the time spent in step #4 (read)!

The second time through the loop, we've measured the amount of time from step #3 to step #8, which happens to include the first read(). What we actually care about is how much time we spent in the second call to read() in step #9!

It's like if I asked you to look at your watch. Then look at your watch again. Then do something. Then tell me how long it took to do it. You couldn't tell me, because you didn't measure the time afterward! But, hey, if I have you do it one more time, you can give me a value, but it'll be delayed one pass. All of your answers will be offset by one.

That's exactly what's going on here: a persistent off-by-one error. What's really crazy is that whoever wrote this thing also wrote a playback tool called "scriptreplay", and ... it adjusts for the second bug. Seriously. Look at this paraphrasing of the code:

size_t oldblk = 0;
 
// ... init stuff ...
 
for (...) {
  size_t blk;   // set to second number from "script" output
 
  if (oldblk) {
    emit(sfile, sname, oldblk);
  }
 
  oldblk = blk;
}

What this code basically does is simple: the first time through, oldblk is zero (from being initialized), so it's ignored. Then it stores the value from blk (the byte count) in oldblk. The next time through the loop, it uses that value. It actually shifts the values back one step to make them match up with the broken logging done by script.

The final bit of craziness here is that as far as I can tell, both the timing patch to script and the scriptreplay tool were merged into the current util-linux git tree as part of a massive import from the "util-linux-2.11m tarball" in December 2006, and they've been like this ever since. I haven't been able to track the history back before that yet.

So, why do I know anything about this? Easy. I had to reproduce this "delay for one pass" insanity when I wrote my little replayer thing for the web a couple of weeks ago. I wondered what was going on at the time but didn't have the chance to decode it. The other night, I didn't have the option of ignoring it any more, and so I started investigating. Two or three hours later, I had uncovered all of what you see above.

What to do? Well, fix the bugs, of course. Tune in next time for the details of how all of that is handled.


March 2, 2013: This post has an update.