Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, April 30, 2018

Company-wide outages and the tendency to spam "ME TOO"

One of the smallest ideas I ever had to improve a system came after observing how people behave organically. It's all about what people do when there's a big outage that seems to affect everyone, and how to keep them from spamming the official response channels.

Let's set the stage here. It's an ordinary day at work, and people all over the company are doing their thing. Then, something happens, and now they can't. We'll say that all connectivity to the outside world went down for the sake of this example. Internal systems are still up and reachable, but their Spotify streams (or Netflix connections for the people who really like to slack at work) or whatever else just keeled over. After a minute or two, people really start to notice and start getting annoyed.

Sooner or later, someone will open an incident. It might be a ticket, a bug, a SEV, or a post on an internal mailing list, but it will become the focus point for the response to the problem. Eventually, someone will post the inevitable to the incident: "ME TOO". They usually don't say it in those exact words, but that's essentially the value of their contribution.

At first, a little bit of reporting is useful, like if you haven't identified the extent of the problem. But, once the exact nature of the failure is known, having more people flood into the system with a broadcast "ME TOO" message is not helpful. It can actually start becoming disruptive, since responders may try to stay up to date on the incident, and these messages aren't doing them any favors.

I saw this happen a bunch of times with many different kinds of outages. It usually happens whenever something is big enough to affect people who normally would not get involved in such an event. After all, the veterans of working outages know better than to spam the thread with just an ordinary "me too" comment. It takes a bunch of well-intentioned novices to turn it into a real mess.

My idea was simple enough: how about we add a "ME TOO" button at the top? It could be next to a number along the lines of "n people report being affected by this". If you click it, then it goes from being "popped out" to being "pushed in", and the number goes up by one.

The first time I pitched this, it went nowhere. It was read, acknowledged, and discarded as uninteresting, unimportant, or something like that. So, the ME TOO spam continued.

Time passed. More events happened. More spam went with them.

Then, one day, someone added the ability to "react" to individual comments. Think of clicking "like" or "+1" or "pinning it" or whatever your favorite social network thingy does. All comments started with 0 reactions, and clicking it would make it visibly change on your screen, and the number would increment for everyone viewing it.

This was the case for a while, and then we had another outage that would have brought in the peanut gallery with their onslaught of ME TOO comments. I was still annoyed with the lack of uptake on my original idea from years in the past, and decided to run a little experiment to make my case.

I posted a comment that said something short and sweet. It looked something like this:

<-- If you are affected by this, please click the button over there to react. Please do not post "ME TOO" here.

That comment got a ton of likes, and the event didn't get any ME TOO comments. We got the signal that people were affected, without the noise of their usual method of conveying it. The pattern repeated over a few more outages like this.

This was enough to convince a new set of maintainers that my old idea was worth it. A friend went and came up with a way to make this work, wrote the patch to the system, got it reviewed, and added it. I owe him bigtime for taking me seriously and doing the legwork to make it happen.

From that point forward, all events had a prominent reaction type button up top for people to click, and click they did. Comments stayed on-topic.

We'll never know exactly how many low-signal comments will never be created due to this. It's one of those things that is really hard to quantify: if you make something never happen, how can you say you've been effective? Proactive work requires the right sort of management to be recognized and compensated properly.

There's another moral here: just like with customer service, "hang up and call back" sometimes can work with software, too. You just have to outlast the maintainers who give you the brush-off. You also have to be super patient, since it can take years for them to rotate out.

...

Epilogue: while writing this, it occurs to me that the "ME TOO" phenomenon is a lot like the electronic food fights that happen when people reply-all on giant company-wide mailing lists. They're probably trying to help, but are failing bigtime.

In that vein, I wonder what would happen if every mail that went out to a sufficiently large distribution list also offered a prominent link at the bottom which offered to do something about it.

"Did you get this mail and you don't know why? Click here."

The target page doesn't even have to do very much. It just has to keep them busy long enough to absorb that human desire to "just do something" about the errant e-mail. If it can capture enough of that energy so they don't generate their own reply to propagate the food fight, it might just stop the problem in its tracks.