Software, technology, sysadmin war stories, and more. Feed
Sunday, June 5, 2011

Catching problems before the customers do, or not

After you've worked a few thousand support tickets in a couple of months, you start learning things, and you start noticing patterns. I realized that while brute force is one way to keep up with whatever customers may need, it's a soul-sucking technique which ignores one's higher brain functions. I tried to do something about it.

One thing which occurred to me is that many tickets simply don't ever need to exist. There are things which (at the very least) could be noticed by the hosting company and resolved long before the customer ever found out. At least that changes it from a flood of customer-requested tickets to a flood of internal requests. That is actually a net win, since it means less hassle for customers and it also looks like you are being proactive, because, well, you are, and they love that.

We had this service where backups would run on your server periodically and would dump over a private network to a huge backup environment. These dumps and a pass/fail status would show up in a database somewhere. There was another database where you could find out if the customer had ordered backups for it or not -- call it a SKU, because that's what it was.

So now you have basic set logic going on here. Just using the backup transfer logs, you get a set of hosts which have been backed up recently, a set of hosts which have been backed up (at any point), and a set of hosts which have never been backed up. From the provisioning side, then you have a set of hosts which should be getting backed up, and a set of hosts which should not.

From this point, the rest is a simple matter of doing intersections and unions and all of this to get things you care about: hosts which should be getting backed up but never have, hosts which were being backed up and then strangely stopped (but should still be getting it), and so on.

I had all of these triggers feed into something I called the audit console, where it would rattle off a list of problems for anyone who cared to look. It could also pick up on a bunch of other random dumb things which could be broken in a customer's configuration: more IPs in use than their firewall's license will support, and so on.

So now I had this magnificent web page which showed plenty of opportunities for fixing things before they went and hurt a customer, but... nobody cared. There was no real incentive to do anything about that kind of problem before the fact. Even though the consequences could be brutal if the worst happened, few seemed to worry about it.

Let's say you are paying for backup service, and it stopped working three weeks ago, and you pay for two week retention. That means your last backup disappeared from the environment a week ago. You have no backups. Now imagine your disk dies. You're toast.

This is one of those things which was obviously correct and yet was never adopted. If the right people don't care, nobody will.