Load 'em up and throw 'em under the bus
In recent times, I've been realizing more and more just how much a screwed up management situation can lead to screwed up technical situations. I've written a bit about this in the past few months, and got to thinking about a specific anecdote from not too long ago.
I was working on a team which was supposed to be the "last line of defense" for outages and other badness like that. We kept having issues with this one service run by this team which ran on every system in the fleet and was essential for keeping things going (you know, the cat pics). We couldn't figure out why it kept happening.
Eventually, I wound up transferring from my "fixer" team and into the organization which contained the team in question, and my first "tour of duty" was to embed with that team to figure out what was going on. What I found was interesting.
The original team had been founded some years before, but none of those original members were still there. They had moved on to other things inside the company. There was one person who had joined the team while the original people were still there, and at this point, he was the only one left who had "overlapped" with the original devs.
What I found was that this one person who had history going back to when the "OGs" were still around was basically carrying the load of the entire team. Everyone else was very new, and so it was up to him.
I got to know him, and found out that he wasn't batshit or even malicious. He was just under WAY too much load, and was shipping insanity as a result. Somehow, we managed to call timeout and got them to stop shipping broken things for a while. Then I got lucky and intercepted a few of the zanier ideas while he was still under the stupid-high load, and we got some other people to step up and start spreading the load around.
I pitched in too, like trying to help some of the irked customers of the team and do some general "customer service" work. My thinking was that if I could do some "firewall" type work on behalf of the team, it would give them some headroom so they could relax and figure out how to move forward.
This pretty much worked. The surprise came later, when the biannual review cycle started up and the "calibration sessions" got rolling. They wanted to give this person some bullshit sub-par rating. I basically said that if they give him anything less than "meets expectations", I would be royally pissed off, since it wasn't his fault.
What's kind of interesting is that they asked the same question of one of my former teammates (who had also been dealing with the fallout from these same reliability issues), and he said the same thing! We didn't know we had both been asked about it until much later. We hadn't even discussed the situation with the overloaded engineer. It was just apparent to both of us.
With both of us giving the same feedback, they took it seriously, and didn't hose him over on the review. He went on to do some pretty interesting stuff for monitoring and other new stuff (including bouncing it off the rest of the team first), and eventually shoved off for (hopefully) happier shores.
The service, meanwhile, got way better at not breaking things. The team seemed to gel in a way that it hadn't before. It even pulled through a truly crazy Friday night event that you'd think would have caused a full site outage, but didn't. Everyone came together and worked the problem. The biggest impact was that nobody internally could ship new features for a couple of hours while we figured it out and brought things back to normal. The outside world never noticed.
Not long after that event, I considered the team "graduated" and that I no longer needed to embed with them, and went off to the next wacky team in that particular slice of the company's infra organization.
This was never a tech problem. It was one guy with 3 or 4 people worth of load riding on his shoulders who was doing his very best but was still very much human and so was breaking down under the stress. They tried to throw him under the bus post-facto, but we wouldn't stand for it. This was a management problem for letting it happen in the first place.
See how it works?