Writing

Feed Software, technology, sysadmin war stories, and more.

Tuesday, June 1, 2021

Please don't count outages (or SEVs, or whatever)

There are some interesting patterns which repeat in management. I understand that it comes from a desire to quantify things and maybe even try to point at things which are moving the right direction (or not). Counting outages (SEVs, whatever you call them) is not one of those things.

This one is subtle, but it has a lot to do with the way people behave in the face of a measurement. In short, if you start counting them, it's probably because you're going to start making reports which say "we had X outages in this span of time". There might even be a *gasp* trend line showing it going up or going down.

This is terrible. You think it's going to help, but it's not. At best, it will have no effect on things, but at worst, it will tell the people in the trenches that "opening a SEV (outage, ...) is baaaaaad", and they will shy away from doing it. Worse still, they may not even realize this avoidance behavior as a conscious thing. It just might not occur to them to hit the [create] button when it's time.

If they know that someone in management will ask "why were there more SEVs in October than September", they might be more willing to go "ehhh, it's not important" and just work the problem without tracking it properly.

I get that people want to measure things and look for stuff that's getting better or worse. They can still do that. They just need to be a bit more subtle and detached from the actual reporting mechanism. Look at your reliability numbers. How many requests does your system get? Then, how many of those actually succeed? Divide the one into the other and you should get a rate. What is that rate doing over time?

What about latency? Is the system good and responsive? What's the 99th percentile for request times? Is it going up? Is it going down? How wide is the distribution of those latencies? Is everything pretty consistent, or do you have a nasty pit which gobbles up some of them?

That's the usual type of monitoring people think about: look at the whole system and see how things are going. You definitely need those, and they can give you a decent sense of whether your actual customers are happy. Still, given big enough numbers of customers, it's possible some of them will have a bad time and you'd never know it.

This is why I have tried to propose a "really bad day" metric for certain companies. It works like this. Take your pizza delivery people, your dog walkers, or whatever else it is your service does. Look at all of them for a given area, and see how much business they've been doing.

Let's say you look at all of the customers in Chicago for a given Monday between 8 AM and 5 PM. What's the median number of jobs that bunch of people has gotten? Who's an outlier? That is, is someone having a "really bad day" with only 2 jobs where most people have gotten at least 15? Does it happen to them a lot? Maybe it's worth looking into.

Sometimes, people's accounts get into broken states and they suffer in silence since nobody believes that they could be that one-in-a-million scenario where they never get scheduled to deliver that pizza, walk that dog, or whatever it is the company does.

The flip side of this is also interesting, but isn't as urgent. It's for detecting people who are too many sigma to the right on that distribution. In other words, who's having a really *good* day? If someone has done 50 jobs in the same period of time where the mean is 15 and the standard deviation is 5, that's probably worth investigating! Maybe they're just really lucky, or maybe they've found some way to game your system and cash in with some kind of fraud. You should probably look at it.

Incidentally, back on the topic of counting outages, it's funny what happens. People (managers) count them because they want to see if the number is going down and so things are getting more reliable. What tends to happen is that other people (the workers) stop creating them. This means a genuine problem doesn't raise the attention of the right responders, and the outage might last longer.

It also means that the problem never gets a chance to be reviewed and have followups assigned (and hopefully completed), so there is a very good chance the whole thing gets swept under the rug. People hope it will just go away and not show up on anyone's radar.

This is not how things get fixed. If you assume it won't happen again, it will find a way to do just that and will probably happen at the worst possible time.

So, if you count them, people will hide the truth, things won't get cleaned up properly, and reliability will actually go down. If that's not what you want, then stop counting those things. Switch to some other metric and you'll get better results.

...

Side note: I once worked at a place where upper management actually "got" this. One day, a well-meaning engineer with nothing but positive thoughts came up with a neat little hack: an electronic display that said something like "it has been X days since the last SEV".

It was meant to be one of those things like "it has been X days since the last job-stopping injury" that you see on construction sites and other places like that where serious, honest work happens every day. This person meant it in the best possible way but probably didn't realize the side-effects that would happen.

The management in this case *did* get this, and asked nicely for the engineer to take it down. They explained the problem that people will start avoiding the [create SEV] button so they don't reset the number, and everything else that comes from it, as I described above.

At that point in my life, it had never occurred to me that counting these events would be a bad thing. Seeing that conversation from the sidelines gave me a new perspective on metrics and human behavior.