Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, July 2, 2012

Web hosting and shared fate

If you rely on computers and uptime helps pay your bills, then there's something you should learn about, and it's called shared fate.

Maybe you start out with a single whitebox machine in a bread rack somewhere. This is the original ServerBeach and Rackspace model, and I'm sure it's been copied many times since. All of the stuff you run on it is now vulnerable to anything which may happen to that box... but you knew that.

One day, you decide to order ("pop") a second server. It goes in right next to your original machine on that same bread rack. Now, assuming you've played your cards right and built your sites such that they can be served from either system, you're doing a little better. There's a big problem, though: both of those machines are behind the same switch. If it goes out, you're toast.

Next, you decide to get them to stand up another machine in a different bread rack, since you know that they have a 1:1 of racks to switches. This is better. It gets you off that same switch, but what's this? You're on the same power distribution unit (PDU). If it blows (and trust me, it happens - just ask "Z"), then both of your machines go down. All die, oh, the embarrassment.

When you find out about this, you play hardball with your account manager and get your second system running in a totally different part of the building. It has totally different power, so now the same-PDU situation won't kill you. It also happens to have a different UPS, so that's good, too.

Then, one day, a blue heron decides to go visit Elvis by shorting the primaries outside the big windowless building where your machines live, and the resulting BBQ kills utility power in that business park. That's when you find out that the building you're in had some kind of switching problem with their generator, and the air handlers and chillers won't come back on. The machines themselves are up and on the net, but they're boiling. Someone hits the big red emergency power off button to try to save them, and everything goes down... hard.

The next thing you do is get some space in a facility across the street, since they have separate generators. It would take a lot of work to bring all of this down, and indeed, you lose one or the other a few times in a few years, but never both at once.

Then a tornado or a flood comes along and destroys both at once.

These are all examples of "shared fate", and all of them have happened already to someone. I seem to recall a tornado hitting Herndon or some similar part of Virginia which is packed with servers a couple of years ago. I think it tore up some facility which was being used to serve up one of those online games where elves shoot lightning bolts at each other.

There are ways around many of these problems, but they all come at a cost. If all you're doing with your site is pushing around pictures of cats, it might not be worth that kind of effort.

In that case, enjoy your downtime. Those cats can wait.