I am not your black box monitoring service
Remember my story about partial network partitions from last week? A few readers wrote in to say that better monitoring would have taken care of the problem. They were absolutely right!
Using switches and/or routers which send out SNMP traps when an interface changes state would be a start. Having something which polls all of those interfaces to make sure they stay up and running (and are passing data) would be another good thing to do. This is all obvious to anyone who's ever run a sufficiently complicated network, especially after the first meltdown.
Given this, here's the question to ask: why did I even have to write that story last week? Well, the answer is simple and yet silly: it's because the network people did not have sufficient monitoring in place. I'm not sure what kind of monitoring they had in place for that machine-to-machine fabric, but if they did have one, it was ineffective.
I was not on any of the infrastructure teams which could have possibly been responsible for that fabric. Instead, I was just one of the users. I could see the network problems from my level, and had the privilege of having root on enough machines to where I could poke around and actually verify it. When you can add a network route to the distant rack through the "top" router (which usually takes you to the outside world) instead of using the "intra" router (which connects to the other rack), and that makes it starts working again, you know something is very wrong.
This was a fairly common pattern. I'd trip over something and report it as a problem. They'd go, "oh, yeah, huh, X broke", and then they'd poke it and it would come back up. I'd go on with life, and apparently they would too, but they didn't seem to give it a second thought!
So, when it happened again, I started wondering just what was going on. How is it that something like this can be reported by one of your users, and then you just fix it without adding your own monitoring? Isn't that completely embarrassing?
I know that if that sort of failure had happened to the service which I had been running, I would have been mortified in multiple ways. First, my users were affected, and that's bad enough. Second, they saw it before we did because of something we didn't catch first. That sort of failure reflects on the service and on me by extension. I take that kind of stuff personally.
Some services (not just the network people) kept having the same sort of problems happening over and over. I finally started asking "... and what sort of monitoring have you set up?". Some of them tried to dodge the question. I started pushing back. They were not pleased with this!
Finally, I had to make it very clear: if I am your black box monitoring system, you have a serious problem. In other words, if your failure reporting involves first having your system screw over a customer/user (me), and then have me notice and page you, you have failed.
What really gets me is how people can run services and not be embarrassed at these failures. How is that possible? How can you work on something day in and day out and not have any sense of ownership and connection to the users? Where is the love?
It's this lack of basic human empathy which makes me very worried when I find out that certain outfits are trying to "be social". If it's not in your DNA, don't even try. It's just going to be a big bunch of broken.