Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, June 28, 2012

Why I hate playbooks and people who shoot down everything

The time has come for me to rant about playbooks for on-call situations. There is a world of difference between helpful information which was discovered during an unlikely event and ordinary checklists. Where it gets messy is when people think they are the same thing.

I first encountered this shortly after being handed a pager for a couple of production services at a prior job. It was my fourth week there, and it happened to fall during Thanksgiving. That week, I saw just how often the on-call person gets paged for that team. It was merciless. I'm talking about 20-30 pages between 9 AM and 6 PM, and that was considered a normal day!

Worse than the quantity was what you were expected to do when it went off. All of the pages would look like this "DogTooFuzzy10Min xxab", and it would (usually) have a link to a wiki page. There, more often than not, someone had gone in and put in a list of things to look at and do.

  1. Connect to the groomer process in location (xxab)
  2. Verify that number X is higher than number Y
  3. Switch on automatic puppy grooming machine for 10 minutes
  4. Verify that X has dropped to below Y
  5. Switch off the machine

It was seriously that stupid. It was all stuff that could be expressed as a series of if-then-else tests with conditions that could be tested by code and actions which could also be taken by code. But, instead of having code do it, they had us do it.

Of course, these pages would invariably miss steps because the documentation efforts were never particularly good, and things would change anyway. Then there were the things which "everyone knew" and never were written down, but the new person on the team couldn't possibly know them yet!

When I say "they", I'm talking about the other people on the team, of course. The people who wrote those Wiki pages were my teammates who had been doing this job longer than me, the brand new employee.

So naturally, I showed up and declared this to be insane. Pages should be the anomaly, not the norm! We should not be setting up for a life where these things just appear and we just sit there turning the crank and occasionally receiving a food pellet.

I expected far more from what was supposedly the gem of the industry.

Did I mention this was all e-mail driven? An alert kicked off a mail to your pager, and you'd acknowledge it, and that was that. If something else fired off, related or not, it would kick off yet another e-mail to your pager. Since these monitoring systems were all based on symptoms and a bunch of them had the same root causes, we'd get storms.

There was actually a command called "STFU" which made it auto-ack all pages for some period of time, like 15 minutes.

Also, there was no way to aggregate pages. Let's say you got 5 pages for the same event. Could you batch them up? Nope. There was no way to say that all of these were part of the same root cause and handle them that way. Worse still, there was no notion of "oh, so and so is working on that", so when someone started messing with something while doing normal maintenance work, you'd probably get pages from it.

I proposed something which would let us batch up related pages, but it was shot down. "You don't want to get detached from the (20 or so things we normally check)" was the response. Great. I was told to worry about knocking out bogus pages. On that note, I actually agreed with them, but not to the point of completely ignoring my idea.

Of course, years later, someone else wound up making a page batching service which did all of that and it was hailed as a wonderful idea. It's just amazing how that works.

I also proposed something which would allow you to at least reroute some of these things to whoever was actually responsible for it. If someone was working on X and X kept alerting, point the pages at that person. This all came up one day when I was oncall and being flooded by pages, and my boss got antsy because I wasn't commenting on every single one as the mails came through.

The problem was that he wasn't in the cube to hear the discussions where I found out who was working on something and reminded them to stop making it page me. There was no electronic way to reflect this, so it all came back to me. I had to remember that one person was wrangling database shards, another one was upgrading something, and a third was poking at the monitoring rules. He said it wasn't useful.

All I could say was "OK, just because you say this doesn't need to exist doesn't mean it doesn't exist, because it does, in here", as I pointed to my head. I was offering to make a way to export this state so that antsy manager types could figure out what was going on and keep them from bugging me directly, but he wouldn't have it.

It seemed like all of my ideas were little more than balloons ready to be peppered with buckshot from their many shotguns. They shot holes in everything, and it became clear that a lot of it was just because they could. After all, they didn't have to propose anything better.

I wound up making a bunch of little tools for my own purposes. They would do the monkey work of chasing down the tree of dependencies to find actual problems. That saved me a whole lot of clicking around and staring at graphs. I don't think anyone prepares you for just how much time you waste just staring at graphs over there.

Anyway, as it became obvious that I was somehow suffering less and less during my oncall shifts, other people wanted a part of this. This is how a bunch of my tools eventually migrated out of my home directory and into the official code depot.

They wouldn't take these things when I offered them, but when I went ahead and used it anyway, then they wanted a part of it.

Sometimes I ask myself if that job even had a "honeymoon" period when everything was perfect and nothing could go wrong. I still wonder.