Writing

Software, technology, sysadmin war stories, and more. Feed
Saturday, May 23, 2020

Discovering the hypocrisy gap in reliability the hard way

I like the model of having a weekly meeting, preferably Friday morning, where the company sits down and talks about what broke recently. You might call it "SEV Review" or "Incident Management Review" or something else of the sort. It's "the best meeting of the week", as someone else once put it.

Such a meeting only really works if it's attended by the right people. What do I mean by that? Well, obviously, whoever worked on fixing things probably needs to be there. Whoever broke it (if that's someone else) probably needs to be there. Then, whoever *was broken by it* might need to be there. (That's how you find out the actual impact.)

Then you need some people who go every week, who I call my "usual suspects". It should be a good mix of management and non-management types who go often, learn how things work, and keep track of patterns that emerge. They ask questions and don't settle until they know what actually happened. They don't necessarily take the report at face value. They should also know a lot of good contacts in the company in order to send people in the right direction -- "go talk to so and so, since they already solved for this and it'll save you a lot of time".

This meeting also needs a couple of people who have technical chops and are able to convince people to actually do their damn jobs to deliver on the commitments to fix things. This, unfortunately, almost always comes in the form of upper management - VP and above, usually. It's not just enough for them to attend, either. They have to be there and be willing to wield the big stick and shut down teams which are reckless, foolhardy, and which don't give a crap about keeping things from breaking.

One of the best instances of one of these meetings I can remember is when the resident VP told a team to just stop. They were no longer allowed to work on features or whatever shiny stuff they thought wanted to do. They had to stop right there and clean up the messes which had been breaking the site and bringing them to the review week after week.

It was epic. Nobody forgot that meeting. Just reading about it will probably remind some of my coworkers from back then about it.

Why does this matter? Well, first of all, you have to stop the clowns from setting the rest of the circus on fire. Fixing the reliability issue is a big deal. Second, actions like this show to the rest of the company that you can't just screw around forever, and that eventually, teams will answer for their recklessness and disregard for safe practices. Nobody gets a pass at this kind of thing.

Some of the worst instances of these meetings I've seen are where there is no enforcement from on high. There are zero consequences for blowing it off and not taking it seriously. You can ignore the best practices. You can fail to deliver on the requested changes that have been shown to work in the past and that should keep a problem from happening again.

Oh, there might even be a VP in the loop, but if they don't get involved, it doesn't matter. The whole process is a sham and should be shut down. Sometimes they make it obvious by just not attending any more. That should be your hint that they really don't care about reliability, any work in that dimension is folly, and you should find something else to do post haste.

It took me a while to figure this out. I realized that there is what upper management is willing to SAY they care about, and then there is what those same people are willing to ACT ON by leaning on teams, and yes, firing people who get in the way.

If you draw these out on a marker board, it might look like this:

              ---       <-- what they say they care about
 
 
 
 
              ---       <-- what they actually deliver on

That space in the middle? I called it the hypocrisy gap.

So then, what happens if you're just a high-level non-manager type who's tasked with improving this stuff? You probably have your own "level", too. What happens next depends on whether it syncs up with upper management or not.

If you're delivering *exactly* what they actually deliver on, you're a top-tier engineer. You'll go far. You are perfectly aligned with their hypocrisy.

If you're below that point, well, then you're not so great, obviously.

If you're somewhere in the hypocrisy gap, they'll probably be very happy with you, but not understand why you're plowing energy into something that they are not. Can't they see that you don't actually care about it that much? You should be more like them and save your effort for something else, like coming up with new ways to spend the VC money.

And finally, if you're somehow managing to come in above their top line (what they say they care about), then ... well, you're just nuts. They'll see you as being completely insane, since you care about something beyond even the point that they CLAIM to care about. Only a maniac would want something like that.

One key thing about this: I haven't mentioned the absolute levels of any of this yet. Whether you are seen as good, bad, or crazy is entirely relative to the powers that be in your organization. It doesn't matter if terrible things are happening. If they don't care about it, YOU caring about it will not be seen as valuable, and indeed, will turn into a liability.

Here, let me invent some scenarios that should seem terrible to you.

"The entire company's database credentials were committed to a public GitHub repository!"

"People are using public gists to store sensitive customer information!"

"Anyone can open this door with a plastic library card and a potato!"

"People are setting up vendor relationships to get kickbacks!"

"Middle management is inventing situations that didn't happen in order to get HR to intimidate them into shutting up about actual problems!"

Guess what? If they don't think those are problems to the point of acting on it, then they don't actually care. If they don't, and you do, that makes YOU the problem. Congratulations! You are in for a rough ride.

Unless you report directly to the CEO or Board of Directors, don't think you can do anything about it. Pack it up and get out of there.

Otherwise, well, prepare to be put out to pasture.

Moooooo.