Reliability meeting experiences
I was a new employee. I wanted to see what the reliability situation was at the company. Nobody was going to just tell me, so I had to just try a bunch of things to see how much could be learned from poking around.
How did it start? Well, it started with me showing up at a bunch of disjoint meetings to see what the ground truth looked like. One of them was this Monday morning show and tell for the VP. It supposedly included a focus on reliability, but it was a joke. Taking two minutes to read a one-pager about an outage aloud and glossing over everything is not a reliability review to me.
At one of them, there were two people representing a single outage. One of them ("user") was from this team that used GPUs to compute something or other. The other one ("service") managed the team that ran the gunk that let people ask for machines with GPUs so they can run stuff. What happened is that "service" presented it as "user screwed up and was down because of some stuff that happened".
This didn't seem to sit right. Why would you have the manager of a service team presenting the outage for one of their customers? Questions were asked.
It turned out to be a completely different story than the one they were pitching. In fact, what happened is that the "service" team made some change and started shipping kernels without GPU support (or something like that). Every system that came up after this point therefore had no GPU support and so jobs that needed a GPU would not run on them. Eventually, through attrition, ALL systems had that no-GPU kernel, and so the GPU-requiring job couldn't run anywhere, and hence the outage happened.
That the outage happened is itself of relatively low interest on the grand scale of things. Sure, to make it not happen again, this technical thing needs that technical thing done to it. It's like a Star Trek script full of [TECH] [TECH] [TECH] nonsense. Rotate the shield harmonics and repurpose the main deflector dish, already. Monitor this, alert on that, test this other thing, you know the drill. That's not the part to focus on here.
What was far more interesting to me is the way they tried to spin it, and how the "user" person basically sat there and let it happen at first. The power dynamic was such that the "service" team's manager almost got away with it - and would have too, if not for those pesky questions!
That situation suggested that many more outages would follow, given that the people who ran the service seemed rather skilled in turning things around on their customers. This is one of those cases where a little bit of company C action would actually be preferable, I think!
...
Another meeting was rather interesting in terms of how topsy-turvy things had gotten at that company. There was a whole group of people who did customer support from a distant state. They were good, hard working people who for the most part wanted to do right by the customers. Any time something went wrong, they were the ones who found out about it. They issued credit memos and tried to put things right.
If something in the product was hurting the customers somehow, it would invariably show up in the support queues, and the support teams would notice patterns. They'd try to report things to the software teams back at HQ and would frequently be ignored, told it wasn't actually a problem, or otherwise dismissed with no progress to be seen. Meanwhile, the support queries would continue flooding in from the customers who were hitting problems.
The point of this meeting was to provide a relief valve of sorts: they'd bring up issues that weren't getting traction, and then the people at HQ (where I was) in that same meeting would "go to bat" for them with the software teams. First, it's harder to outright dismiss someone who's right there in person instead of being 2300 miles away, sure, but there's more to it than that. Anyone from "support" was automatically inferior in the eyes of a great many software people, so coming in as not one of "those support people" also (dubiously) conferred benefits in terms of getting them to maybe pay attention by default.
At some point I had seen enough, and just told the support folks that I am going to inherently trust that you are diligent people who aren't making things up from thin air, and have good reasons for reporting these things. I said that they should go ahead and open a SEV (think outage report, site event, service event, significant event...) any time they think it's appropriate, and I will support them by default by chasing people down on the software side of things and getting them to deal with the (now quite visible to the entire company) situation.
...
Some time later, I described the "support" situation to someone new to the company, and their reaction was genuine and spot on: isn't that backwards? Essentially, why are we not finding out about problems until they make it all the way to production, make life miserable for a bunch of customers, turn into trouble reports, that then get noticed as a pattern by the senior support people, and then get pushed back to HQ?
I was glad to hear this out of the mouth of this new person, since they were at the sort of high-up position that (in my eyes) implied that they had the power to change this sort of thing for the better. They could theoretically light a fire under these teams to own their problems better lest they incur the wrath of someone who only answers to the president and CEO of the company.
Unfortunately, not nearly enough happened, and while I don't know exactly why, I can always speculate. I considered what it must have been like. If things are broken and you want them fixed, fixing them is supposedly your job, you report to the top two people at the company, but it doesn't happen, then I guess "the fish rots from the head" really is true, huh? I was not surprised when this person left the company earlier this year.
...
This encounter and another one from around the same time taught me a new bit of nasty truth about this line of work: it doesn't matter what title you have. You can be the "(S)VP of XYZ" at a company, but when push comes to shove, if the people truly running the show don't support your decisions, you aren't really the (S)VP of anything, and it's time to leave.