Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, April 8, 2012

Ticket cruncher power demonstrated

I used to work a weird "wraparound" schedule at my web support monkey job. It ran from Thursday to Monday, and was also second shift. That meant everything was offset relative to the rest of the world. I didn't mind that too much, but there were other reasons why I wanted out.

Ticket queue graph

Here's one of the Sundays where I worked. I wrote a little tool which would render the status of the ticket queues way back when. I'd usually run it against days when I'd think to myself "why was this so crazy?" just to see what happened.

In this graph, those numbers across the top are the time of day. 13 is 1 PM, 14 is 2 PM, and so on. The dark blue part of the graph is the number of unassigned/active tickets. The light blue is the number of active tickets. Since unassigned/active is a subset of active, it has the effect of looking like they are layered.

Things start out nicely. We had almost no unassigned tickets for several hours, and very few active tickets as well. Then, right around 7 PM, everything goes wrong. The number of tickets shoots up and just stays there. Things basically go to hell in a handbasket.

What happened? Well, that takes a little more explanation. First, second shift that night was a "three shift special", as I used to call it. There were only three people working tickets and taking calls.

Then, around 6:30-6:45, I went to lunch for an hour. They paid us for 8 hours, not 9, so I made sure to take my lunch at some point every day. I'd shift it earlier or later, but I always took it. I got back in the queues around 7:30-7:45, and you can see what happened. I knocked that thing back down as fast as I could.

I started graphing more days. There was this visible bump in the data most nights. So, I started keeping track of when I'd take lunch and compared it to the data. Most of the time, it was a perfect match. I got tired of having that much of an impact on things, and ultimately found my way out of support and into a separate team. If other people weren't going to pull the same kind of load, then I sure wasn't going to do it all by myself.

So, okay, let's jump forward several months. Here's another Sunday night from about eight months later, or well after I had left support. Notice the difference.

Ticket queue graph eight months later

They never get it down to zero active. They don't even ever get it down to zero unassigned! Even as the night wears on and it gets closer to midnight, things just get more and more broken.

This kind of queue situation leads to horrendous latency in fulfilling customer requests. They definitely notice, and they get annoyed. They start calling ahead to try to bump their stuff ahead in the queues, and then just things get worse.

Now for the final zinger: I purposely left off the Y-axis labels on these graphs. That first graph, as bad as it looked during my lunch, still only peaked at 27 tickets active.

This second one peaked at 49 tickets active. It's almost twice as bad.

I have a final graph which should make my point crystal clear.

Yet another ticket queue graph

Here we are, two months later. It's a Friday, and OMG does the queue look miserable. They're running a huge backlog all day long now. They can't get a handle on it. What to do, what to do.

But then... something happens. It just goes *poof*.

You know what happened? That's easy. We happened.

I decided to work support that day along with two friends who had also escaped from support. One had escaped to being an account manager, and one was also on my "meta-support" team. We just put on our headphones and started crunching tickets. While our AM friend got sucked into some AM thing and had to bail out relatively early, the two of us kept on going, smacking down tickets left and right. Bonk, bonk, bonk.

People definitely noticed. I started getting messages like "you're working support? I wondered where the queue was going".

I left the Y-axis scale off again. The peak value, shown on the graph around 7:30 AM, was 104 active tickets (with about 50 of those unassigned). A couple of hours after we jumped in, we brought it way down. At times, there were only 3 or 4 unassigned tickets sitting in the queue.

We actually got to a point where we had to start snaking tickets away from people who had left them open, assigned to themselves, and had gone home for the day. You're not supposed to do that, but they had done it anyway. That's right -- we invaded the "assigned/active" view. Many tickets which had stumped the other techs were a trivial fix with the right people on board.

So let's review. We had a situation which had been rolling from day to day to day and was never improving. Then three (later two) heavy hitters dropped in and started crunching. The situation evaporated.

The entirety of the day shift couldn't make a dent in it. Three people did. What does that say about day shift?

Nothing changed as a result of our demonstration. Within a couple of days, the backlog built up, and soon, things were just as miserable again. All of those techs still couldn't keep up with what we did.

There's no silver lining here.