Writing

Software, technology, sysadmin war stories, and more. Feed
Monday, July 15, 2019

Your nines are not my nines

I've had some occasions of late to peer through the looking glass into a world that I hadn't seen much of previously. Specifically, I'm talking about the world of so-called "cloud" stuff, where you basically pay someone else to build and run stuff for you, instead of doing it yourself.

I'll skip the analysis of build vs. buy and just jump straight to the point where you've chosen "buy". Then you've had a whole bunch of fun outages caused by something going wrong with their services. Finally, you reach the point of a sit-down talk with the vendor to figure things out. Maybe they send some sales people too, or perhaps it's just engineers. You talk for a while, and before long, you realize what happened.

They are huge. They are like a giant which lumbers around while you are a gnat. You are nothing to them.

This becomes obvious when talking about some problem you experienced at the hands of their system. The whole time, their dashboard stayed green because from their point of view, they had tremendous availability. We're talking 99.999% here! Totally legit!

Meanwhile, you were having a really bad day. Nothing was working. Your business was in shambles. Your customers were at your throat yelling for action, and all you could do is point at the vendor. What happened?

Well, this is the point where you find out that their "99.999%" availablity is for their entire system. They see that, and they're good. It's not a problem! Everything is fine.

This also completely misses the fact that for you, everything was failing. It doesn't matter though, since your worst day still won't move the needle on their fail-o-meter. They won't see you. They won't have any idea anything even happened until you complain weeks later.

You are the bug on the windscreen of the locomotive. The train has no idea you were ever there.

The problem is that they weren't monitoring from the customer's perspective. Had they done that, it would have been clear that oodles of requests from some subset of customers were failing. They would have also realized that certain customers had all of their requests failing.

For those customers, there were no nines to be had that day.

Seriously, if you have a multi-tenant system, you owe it to your customers to monitor it from their point of view. Otherwise, how can you possibly know when you've done something that'll leave them in the cold?