Software, technology, sysadmin war stories, and more. Feed
Monday, January 14, 2019

Populating an oncall rotation

It seems to be really hard to manage the population of people within an on-call rotation. I'm specifically talking about the technical part where you say that this human receives the pages for service X from time Y to time Z.

I've seen it done two major ways. One is where you just have a plain text file, more or less, and people put themselves in.

Let's say you have a company here in the valley that's using US/Pacific as their time base for everything, and a three-way split for oncall: west coast AM, west coast PM, and Europe AM. You end up with something like this:

01/14 09:00 cheetah
01/14 16:00 tiger
01/15 01:00 lion
01/15 09:00 cheetah
01/15 16:00 tiger
01/16 01:00 lion

You get the idea: it's US-style MM/DD, then the hour and minute relative to US west coast local time, and the account name of the person who's going to be on call. (I used big cats for the engineers. Close enough.)

The thing which reads this file is pretty simple. Now and then it wakes up and says "okay, which is the last line in this file which has not been obsoleted by another line in this file". When it's 12:00 on the 14th, that means "cheetah" is active, and "tiger" is not on the hook yet.

This also means that if the humans stop maintaining this file, "lion" will be on call FOREVER. It has another side-effect: short of emptying the file, there will always be someone listed as being on call, even if they forget to update it.

That, then, is the rub with this system. The humans have to periodically wheel and deal and figure out who's doing what in terms of days, and times, and shifts, and all of this. In my own experiences using this system, it was worked out initially with a Wiki page that anyone could edit, and then once people were happy, it was flushed to a text file like the one shown above.

As long as our little wiki-to-text dance happened regularly, then the rotation kept happening and everyone had their turn wearing the pager.

The other big way I've seen this done is where you define a "rotation" to just be a whole list of names, and then you pick when a shift ends and the next begins. Maybe you have the same three people as above (cheetah, tiger, lion) and the shifts are one week long and "turn over" on Monday at 2 PM.

You type this in and say "go". It grabs the first person from the list and says, okay cheetah, you're up. Then it extrapolates along from that and says tiger will be up next week and lion gets it on week three. Then it's back around to cheetah.

people: cheetah, tiger, lion
start/end: Mon 2P

week 1: cheetah week 2: tiger week 3: lion week 4: cheetah week 5: tiger week 6: lion

For a while, this works well enough. The people make plans based on when they will and won't be available. They book flights to go to exotic places where they can relax and forget about being on call in the first place.

Then life happens and things get interesting. The person behind "tiger" goes out on parental leave. They are dropped from the rotation. cheetah is still on call right now, but now *lion* gets it next week. Worse still, cheetah comes back up AGAIN on week three, instead of coming up on week four.

people: cheetah, lion
start/end: Mon 2P

week 1: cheetah week 2: lion week 3: cheetah week 4: lion week 5: cheetah week 6: lion

Obviously, this can't stand, so someone else is added to the rotation. Trouble is, they are added at the end of the list. Now you get this:

people: cheetah, lion, lynx
start/end: Mon 2P

week 1: cheetah week 2: lion week 3: lynx week 4: cheetah week 5: lion week 6: lynx

In this case, sure, cheetah ends up back at week 4, but now lion has jumped around all over the place. They are 100% offset from both their original on times and their original off times, and any plans they had are ruined.

What happens next probably involves a lot of manual fiddling with the list, trying to wrangle people around into just the right order, or countless surgical "overrides" for specific times and days, and worse.

It seems like the better thing to do would be to start with a list, and then periodically (once a quarter, say), use it to build a proposed schedule. This is where you actually take the list, run through it, and assign people sequentially to the shifts. Then you give them all a chance to look at it and do trades and whatever else to get the time they need. After a reasonable amount of time has elapsed, then it gets committed as the schedule.

In this world, you still have to deal with exceptions, but you get to do it without uprooting the entire schedule and causing havoc for everyone on the team. Since it's no longer calculated on the fly, you can safely remove someone from the list, or add a new member, or even swap members around. It won't have any consequences until the next quarter when it's time to build another proposed schedule.

When things *do* happen in this world, you get to simply go in and mark someone as unavailable, and then see what that does to your coverage. The system would just need to flag any of the former person's shifts as "unhandled", and it's up to the manager or the team as a whole to work until those holes are now covered.

Worst case scenario, if they don't plug the holes, then the previous person's shift drags on for longer than they expected. It's not ideal, but at least you don't have the scenario where you have NOBODY on call. Granted, it's no guarantee that the unlucky person will be available, but it gives random folks a place to start when they're trying to find someone to help with whatever problem has arisen.

This came up in a chat recently with a friend. I wanted to know which way a certain provider of paging services worked, but I wasn't a customer, and their docs didn't seem to help. So, I went digging to see if anyone was posting online about the inevitable problem of "I removed someone from my team and the whole schedule changed". Sure enough, I found those reports. Conclusion: they generate it on the fly, and whatever happens, happens.

This seems suboptimal. Scheduling for humans isn't like scheduling for machines. You can't just drop work onto whoever's available and assume it'll "just work out". People don't work that way.

I sure hope there are other places which get this right.

January 28, 2019: This post has an update.