Writing

Feed Software, technology, sysadmin war stories, and more.

Wednesday, June 12, 2024

Can you run in a tight loop and still be well-behaved?

Timing things to happen at specific intervals is yet another way that we collectively find out that dealing with time is a hard problem. I've been noticing this while working on feed reader stuff, and I realized that it can apply to other problems.

It goes like this: say you want to have a process that runs at most once an hour. You are okay with it taking a little more than an hour between runs, but really don't want to go faster than that. Maybe you have an arrangement with a service provider to not poke them too often. Whatever.

So maybe you rig something up using cron, and it looks like this:

15 * * * * /home/me/bin/do_something

Then, every hour, at 15 minutes past, cron will run your program. Unfortunately, this by itself is not nearly enough to deliver on your arrangement. It's not even the problem you might imagine at first, which is that system clocks can be sloppy and can get pulled around by external forces.

Nope, this has to do with the time it takes to actually do the work, and not accounting for that when allowing the work to proceed again.

Back to our cron job. We'll say it gets installed at midnight, so 15 minutes later at 00:15:00, it starts a run. Maybe it does a lot of work and talks to many sites over the Internet. Some of them respond quickly, but others are slow. Maybe their DNS is taking forever to resolve the hostnames. Maybe another site is offline and is just dropping packets, so you sit there until a timeout fires on your end. It burns a good minute doing this.

At 00:16:00, it finally gets around to doing the "once an hour" work, and it happens relatively quickly. Then it finishes and goes to sleep.

About an hour later at 01:15:00, cron will run your program again. This time, maybe all of the earlier work happens much more quickly, and all of it completes in 15 seconds. That means you get around to your "once an hour" work at 01:15:15.

Oops. You were supposed to wait at least 3600 seconds - that's one hour - between requests, but you just ran it after only 3555 seconds.

The problem is that you you can't just rely on the start time of your program to know if enough time has elapsed since it last did some work which is supposed to be rate-limited. You have to actually track the time when the work *was attempted*, and then do the math of "elapsed = now - then" to see if enough time has gone by.

I tend to think of the timeline for this sort of thing as a series of fenceposts, like this:


start       action      end     (rest of the hour here)
    |          |        |
    v          v        v
----*----------*--------*---------------------------------------->

To avoid violating rate limits, you have to time things from when the action happens, not when the program starts up. If you want to really be paranoid about it, then you'll want to time it from when the program is all done with its work and is about to shut down (but this is a lot harder).

What ends up being much easier is to just remember whenever the work last started and/or finished, even if it didn't succeed. It should never select a target for refreshing until it has been idle for long enough. The program must never assume "well, I'm running again, so it must be time to do my thing". What if the box just rebooted, or any of a number of other possibilities? What then?

Here's an easy way to know if a program is on the right track: could it be run in a tight loop without causing a giant mess for other people?

$ while true; do run-my-stuff; done

If you can run something in a loop like that and not have it beat the crap out of whatever it's supposed to periodically talk to, then you're probably headed in the right direction. It also means that if the program gets into a start-crash-restart loop some day, maybe it won't unleash a hellstorm on whatever it happens to talk to.

Running a program in an infinite loop like that might chew a lot of resources on the local machine, but that's (relatively) okay. It's your machine. Feel free to burn your own resources. Where it becomes troublesome is when it reaches out and starts burning those of other people.

As usual, the details are important here.