Software, technology, sysadmin war stories, and more. Feed
Saturday, June 9, 2012

Free form tickets are hard to analyze (let's go shopping)

In the years when I cared about ticketing systems and the kinds of broken analysis people would do on them, I had some thoughts about how it should be done. It had a lot to do with recognizing when we were able to have something work because people were reading written text and could figure it out. Trying to make sense of that with a program was out of the question. Here's what I mean.

A typical ticket might start out with a quick note from a customer.

My server seems to be down. Please help!

Sometimes, that's all you get. Some person has to grab that and translate it into reality. Basic triage behavior then leads to someone trying to ping the machine, log into it, or otherwise do something that would confirm its status.

When these fail, then the ticket is forwarded to the data center with another short comment. This one might be "private": only visible to other employees (like the data center techs). This one might add some details and ask for action, like this:

No ssh, no ping, please check at console. Thanks!

Then the ticket is sent to that data center's queue. This sets off their pagers, and someone gets to grab a crash cart and scope out the machine at its console.

Let's say they discover a kernel panic and the machine is just wedged there. They take down the last few bits of the panic in a note, reboot the machine, and make sure it's healthy from the console. Then they send it back to Support with another private message saying what they found and what they did.

Server was frozen with a kernel panic. Last message was: "BUG() at ...". Rebooted and is back online.

Now it's on the support team to pick it back up and provide an update to the customer. Realize that there's been no public response yet, so someone has to sum up what's happening and what will happen next.

Hello XYZ. We found your server unresponsive to ssh and also at the console, where a panic was found. We rebooted it and it is back up. We'll take a look to see if there are any further clues in the logs and will provide another update.

The customer now knows where things are, and that the ticket hasn't been finished. Now, some tech gets to log in and do a few basic best practices to look for things which may be broken. If there's a known-bad kernel version, or some nasty firmware bug in the RAID controller, now would be the time to catch it. They might also be able to see whether this is an isolated event or not from ticket history.

Finally, with this assessment made, it's time to provide another update.

Hello again, XYZ. I noticed that your server is running version x.y-z of the kernel, and that has a bug under certain load conditions, and it seems we may have tripped it. I recommend we upgrade your kernel and reboot into it as soon as possible.

We can do this now, or we can schedule it for a quieter time for your web site, such as overnight. Please let us know your preference, or if you have any other questions or concerns about this.

The ticket is placed into a state where it will wait a day or two for the customer's response, and the tech moves on to something else. If it is updated with a go-ahead, then more things will happen, but what I've described is enough to continue on to my thoughts about analysis.

Everything which happened here was implicit. A bunch of freeform text messages conveyed content between people who then parsed it and took action and then generated more messages. This is no big deal, since that's what people do every day. We call it language.

The trouble is that computers aren't particularly good at figuring out this sort of thing. Specifically, the sorts of questions analysts tend to ask tend to not be trivially answerable when all you have is a bunch of text messages floating around in a database.

Now's when you have to be aware of whether you live in the Silly Valley echo chamber. At this point, a bunch of people are probably thinking that about "semantic text analysis" and all sorts of other fancy stuff that make people think about their Ph.D. dissertations. Yes, there is a lot of work out there which tries to make sense of human language. You just have to realize that these people are not the types who will ever be able to do such a thing... or even know it exists.

These are the bottom rungs of the technology ladder. Things are different in this world.

Instead, you start getting proposals to strangle the free-form expression or otherwise duplicate it with highly structured lists of actions. Everything is now supposed to happen explicitly. The customers are still able to write a message, but some poor employee on the inside now gets the task of sorting it.

Task added: [DC Ops] [Check at console and report] [Reboot if necessary]

This task has business logic attached. When someone adds it to the ticket, it fires off a bunch of policies which were written by the analysts. One of them is "you will always write a public comment to the customer before sending a reboot to the data center". Previously, this had just been a decree, but now it can be implemented in software!

Hello. Your message has been received and has been forwarded to the data center for analysis. Please stand by while one of our award-winning Platinum Level III (tm) technicians checks on your server. Thank you for choosing StudlyCapsCompanyName.

The data center does their thing and marks the task done, filling in the required fields. They tick "unresponsive at console", and "video present", which flips down a sub-chooser where they pick "kernel panic". This pops open a text field and then they add the "BUG() at..." message. Then they tick "rebooted" and "OK now". They click [done] and it goes away.

The company may have decided that kernel panics are not worthy of investigation unless it occurs more than N times in M days on a given machine. Since this is the first one, it is below the threshold, and will not appear back in the support queue. Instead, the ticketing system takes it upon itself to respond.

Hello again. Your server was found unresponsive at the console. There was a message present, showing a kernel panic and "BUG() at ...". It was rebooted and is now responsive. Thank you for choosing yadda yadda blahity blah.

The ticket disappears into the ether. If the customer doesn't do anything else, it probably auto-closes 24 or 48 hours later. The immediate fire is out, and they might be happy with that. No further investigation ever happens.

This is the kind of dream world they want. In this world, nothing happens unless there is a task assigned for it. Then, every task has conditions when it might be used, and a whole bunch of actions which occur as a result. All of this is used to tune and retune their metrics to squeeze the absolute most out of their techs.

What's kind of disturbing is that it could actually be an improvement depending on how they staff their support teams. If you have a bunch of people who need to be told explicitly what to do at every step of the way, then you really do need a system which works like this. You get someone who's (hopefully) competent to program it, then you turn the peons loose inside it. You give them no leeway to do things "outside the box", since they have been judged incapable of such work.

Of course, having described this whole thing, I keep having this nagging feeling about corner cases. There will be times when an honest request just can't be understood in terms of strictly-defined tasks. This may be due to a lack of tasks to capture something that needs to be done, or it might just be that there are too many if-then-else details in a request.

What worries me is that people would try to shoehorn a request into something which seems remotely related, and this would cause a loss of precision. Important details which could not be squeezed into the rigid task fields simply disappear. Service regresses to whether ordinary things can sometimes be handled, and forget about anything remotely complicated or different. There's just no way to express it.

Have you ever submitted a service request to a big organization and gotten back a result which looks like your original request had some of the sharper points filed off before it went through their "system"? This might be why.