Writing

Software, technology, sysadmin war stories, and more. Feed
Sunday, July 10, 2011

Look-ahead scheduling for tech support ranks

In my job as a phone/ticket support monkey, it became obvious that we had to figure out what staffing levels would look like at future dates. With a limited number of employees to go around and 24 hour coverage to provide, there were a number of constraints already. Add holidays, personal vacations, sick days, and people who just plain don't show up and life gets interesting. I was drafted to help out.

I started by writing something which just took a flat file with a list of techs, their days, starting times, and shift lengths. It would load that up and use it to populate a bunch of internal buckets. Assuming 7 days, each with 24 hours, each with 60 minutes, you have 10080 buckets if you want minute-level resolution. For each tech, it would then go through and mark where they showed up in each bucket.

Once it had done this, it would draw the whole week using just the top of the hour (minute = 0) buckets. With seven rows, one per day, and 24 columns, one for each hour, there was your week. That gave us our baseline: assuming everyone we know about shows up, every week should look like that. This went over well, so I went to the next level.

My second iteration involved taking the baseline and then mapping it onto a specific week. That way it wouldn't just be any Sunday, but it would be today, July 10th, for example. Then it would look for any "exceptions" in the database, like vacations and sick days, and if they occurred during this week, it would map them onto the 10080 buckets and adjust things downward accordingly. This meant you could mark someone gone for a few hours or days and they would automatically drop out of the views.

Once this worked, this gave us the ability to look into the future! You could just plug in all of the vacations people wanted to take, and then flip through and look for hot spots. I had color-coded the cells on that 168-hour week view, so you could look for places where the number of techs for a given team was too low: it would be flashing red. Then you could get people to shift around well in advance so there would be no last-minute surprises.

There were other features in all of this. Different techs wore different hats. Some would work teams B and H. Others would work C and G. Still others were one of A, D, or J. Then there the second and third shift techs which would usually do a combination, like all of (B, C, G, H) or all of (A, D, J). Then there were a couple of people "blessed" for doing enterprise/intensive customer work, teams E/I and F/L.

By tagging techs and then filtering your view, you could make sure you had enough coverage for team F at 3 AM on a Sunday, even if it was two months out. Or you could do an "all Unix" view, or "all Windows", or whatever. Finally, there were things like "tech lead", so if someone really needed to escalate to a manager, you knew who to call. This all became useful when I added a new default view, which was to display the current bucket. That is, if it's 8:45 PM, show bucket (20 * 60) + 45, and tell me everyone who's supposed to be here working right now.

By the end, this thing was smart enough to even generate the dumb paperwork which had to be printed (!) and handed to HR for vacation times. It was building iCal-compatible attachments and mailing them to team leads when their people took time off so it would just show up in their side of the world, as well.

Despite all of this, it never really caught on. There were one or two team leads who used it for a while (hence the iCal feature), but I spotted calendars going up on cube walls with dry-erase marker writing all over it. I don't think it was a case of ignorance of the system, since it was a small place and those who used it had shown it to others. They just made a decision to use markers and slick pieces of paper instead of a computerized system that already existed and worked.

One of the things I had been working on for a later revision was to not treat techs as identical parts. There were some people who might close one or two tickets a night on a good night. Then there were others who would absolutely destroy the queue, closing 15 or 20. It was possible to derive a "power value" for each tech based on their past performance. By adding up all of the power values for people scheduled to work at a given time, you could figure out how many tickets would be closed in that hour.

The final piece of the puzzle was going to be the "inflow value". This is where we look at historical ticket creation times and see when and where they occur. One team's customers might be very much M-F 8-5. Another seemed to be cottage industries, or second jobs for people, with most of the traffic arriving between 5 and 11 every night, with more on weekends. In any case, you could say, okay, it's Sunday at 8 PM, and all of the Unix customers put together will probably create 1 new ticket this hour on average. You're probably okay with an hourly total power value of 1.0.

However, other times of the week were much more interesting. One team, all by itself, might have 20 tickets created in the 1-2 PM hour on a typical Friday. If that team's total power value was not at least 20, then those tickets would spill to the next hour, 2-3 PM, when another 17 would be created. Again, they'd need at least 17 to keep up with the new tickets, plus more to deal with the backlog. This would repeat, and the queue would go wild until the inflow finally died off and techs could catch up. Woe to those customers who were left waiting as people repeatedly passed over their tickets, though.

What I still don't understand is why nobody else seemed to think about things in these terms. It seemed like everyone was down in the trenches, engaged in hand-to-hand combat with the incoming flood of phone calls and tickets and whatever, and nobody ever pulled back to look at the bigger picture. Making better infrastructure to make these tickets never exist in the first place would have been a big deal, but those projects tended to be ignored. I just don't get it.