Software, technology, sysadmin war stories, and more. Feed
Monday, January 28, 2019

Feedback: oncall rotations

I got a bunch of feedback about the oncall rotation post.

An unnamed reader writes:

The way we do oncall rotations at my company is that once a quarter, a machine queries everyone in the list of "eligible oncallers" and does a lookup into their work calendar, looking for a list of keywords like "vacation" or "out of office", and it finds a way of assigning oncall rotation that minimizes (a) being oncall too frequently and (b) being oncall during a vacation. This means that as long as everyone plans their events more than a quarter out, the rotation is automatically filled. The bot then outputs the data to a text file, which can be manually edited if people need it to.

My reaction: this is probably the most humane automated system I've heard of yet. It starts from a list, understands that people have things going on, and tries to avoid messing up their lives. I still think it would be useful to have it go into some kind of pending state where the humans could do their wheeling and dealing with each other to swap shifts around, but that's just me.

That is, the text file aspect of it could be trouble - people have ways of writing text that won't parse. It might be nicer as some kind of web thing that's just fancy enough to work without going to extremes.


Another anonymous reader says:

I think part of the answer may be a 'people' thing rather than a 'mechanisms' thing. (yeah, yeah, mechanisms over good intentions, but...) I think that social norms around volunteering to swap for each other's shifts when convenient and a mutual spirit of generosity can go a long way towards making a system bearable when it might otherwise suck.

I agree. In the "text file" life I lived at one point, it boiled down to "edit file, get it reviewed by someone, then check it in". One time, I needed to get out of a really bad situation where I was on call. I found someone back in Mountain View to cover until I could get back in town, put him in the file, fired off a change for review, but ALSO pushed it in without waiting for the review. (The so-called "TBR" commit - as in To Be Reviewed.) That way I could hop on the bus and go off the grid for five hours and he'd "review" it later.

I'm still grateful to that individual for picking it up. It let me escape the badness. If you're out there reading this all these years later, thanks, Mark.

The same reader continues:

I think (ok, I know, because I can math and your posts are dated) that you've been on more rotations than I have, but has that been your experience? The other big thing I see there is that 3 people in a rotation is... not a lot. A bigger rotation (and possibly sharing off-hours duties with an adjacent team, if the system's small enough/this is practical) makes these things easier. More people to swap with if needed. So maybe that's why I think that works. And maybe spreading responsibility a little is part of the solution. (But was the 3-person rotation just to make the example easier to write about than a 8-person rotation, or have you seen a lot of those?) My experience has been at Amazon, where there's not a big SRE/SWE type distinction, so the dev team is usually on call for their system, and you don't necessarily have a possibly smaller pool of SREs to form a too-small rotation.

Rotation size is crucial, too. Let me tell you a little story about what happens when the rotation shrinks on you.

I was on this team which did the whole "swat team" thing for that one very big web site. This team's oncall was sufficiently stressful that the rotation had been doubled up and split into Mon/Tue/Wed and Thu/Fri/Sat/Sun. The two "sides" of this rotation ran roughly 180 degrees out of phase, so a series of weeks might look like this with eleven team members "A" through "K":

[Random note: these are Mon-Sun weeks. It's easier to visualize this way when you don't have wraparounds for the weekend.]

Let's say I was "A". I'd have it for three days on week 1, then it would be five blissful weeks with no pages. Then I'd get it again for four days - the weekend, which was ruined by this. No leaving the house, really. Then it would be another five weeks until I got the Mon/Tue/Wed side of things again.

But, then, one morning, this all changed. We found out that three people were all changing teams, effective immediately. Normally when you do something like this, you have new people who have already joined the team but not the oncall rotation. Then, as an old person leaves, you give their spot to a new person. This is what happened when I joined the team: my manager came out of the rotation (to take paternity leave), and I took his spot.

This time, however, there were no new people on the team to fill in, and there were no new people who would be joining soon. The oncall rotation just shrank by three immediately. It went from 11 people to 8. Look at what the same 12 week period turned into:

Now look at what my life looked like: instead of having three shifts in those 12 weeks, and ruining a single weekend, there were four, and a second ruined weekend. I had done the "short oncall list" thing for this team before, when person B was off doing this, person C was on paternity leave, person D was unavailable for some other reason, and so on. It was not fun, and it was about to start again.

What happened? Well, I became the fourth person to leave the team that week. What that did to the remaining people is left as an exercise for the reader.


Finally, one more comment:

At my last job, we had rotations. People were added to the rotation at ~6 months, and removed at ~2.5 years. This gave us some buffer such that we could keep the rotation at a fixed size, even if someone left. When someone was removed, we slotted their replacement in the same position, keeping the rotation stable. You needed to be conscientious about it, but we had the capacity, and kept a stable schedule for several years.

As to "certain provider", I'm guessing you mean PagerDuty? If so, we used that, and it was trivial to swap people without skewing the rotation. It also handles overrides very smoothly. It can be a "whatever happens, happens" situation, but that's avoidable.

This is interesting because you explicitly mentioned swaps: you know what's going to happen and you deliberately go in there to avoid shaking things up. But, I bet if you just remove someone from the schedule, everything goes crazy, right?

If it boils down to "never do the wrong thing", people are going to suffer eventually. I'm glad that you've been able to avoid hurting your coworkers by being conscientious about it, but that can fail too easily. It just takes one totally well-meaning person not realizing the bigger picture to make a change and throw everything into chaos.

Building systems where people can do what comes naturally without hurting themselves is hard, but it's worth it. It scales way better, and it tends to survive longer, too.