Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, February 16, 2012

Partial network partitions and obstacles to innovation

Imagine a big network with a bunch of hosts on it. Maybe you break them up by logical groupings, like racks, or groups of racks. You give each group separate address space, like a /24 or similar. Then you have a huge router with tons of ports, and each port is an interface to a different group.

This sounds pretty boring, right? Well, it could become a bottleneck for certain applications. In this scheme, traffic from a host to the outside world and traffic from that same host to another internal host gets to route through that same router link. Imagine that all hosts are plugged into a switch which does gigabit Ethernet out to them, and 10 gigabit Ethernet to the router via a speedy uplink port. It doesn't take much to realize that 10 machines might be able to saturate that uplink.

So you take a look at your traffic mix, and find out that a substantial amount of traffic stays within a given location and just jumps between racks/groups. You realize that maybe you should start stacking on additional connections just for your intra-location traffic. Eventually, you wind up with a second fabric of routers, each with interfaces in the networks of your rack groups.

Actually using it isn't that difficult. You can just keep "default" pointed up and out to the *big* router, and then add a network route for local machines through the secondary router. Now traffic for other nearby hosts jumps through that pipe instead. Everything's cool, right?

Well, yes and no. Your network design has just opened up to allow way more traffic to move around, but you've also now introduced dozens or hundreds of new links within the local cluster. If they go down, very bad and strange things will happen to your traffic.

Prepare to enter the world of the partial network partition.

Normally, you'd catch a failed rack link by noticing that all of the hosts have fallen off. This is pretty obvious. When you lose a whole bunch of machines, other stuff tends to notice. This is trivial to monitor.

Now, with this extra path through your network, you can have a whole new set of craziness. Let's say the intra-rack fabric linking rack A to rack G fails. Hosts on A can't see G and vice-versa. They can still talk to everyone else, though. What happens now?

Well, your "top down" monitoring systems will probably never notice. If they happen to reside outside the cluster, they will come in through the "top" of each rack and will therefore be able to reach everything. From their point of view, everything is good. Likewise, if your monitoring happens *inside* a cluster, but from another rack (not A or G), then it'll also be able to reach everything else.

This may even persist for a while! If nothing on A needs G or the other way around, it will just sit there being broken and nobody will care. Then, one day, things will shift around. Maybe some critical service will have a "leader" land in rack G. Then a host in rack A will need to talk to it to get something done. It will fail. It will look like "host G01 is unreachable" or something.

This will get people looking at host G01. It's fine, naturally. So they'll look at the source host in A, and it'll also be fine. Depending on the people you have investigating this problem, they may just write it off as "one of those things" and go on with life. This is bad!

It will take someone actually logging into the A host and trying to reproduce the problem to realize that something strange is afoot. They will then have to jump around to other A hosts and then finally non-A hosts until the pattern will emerge. Then they have to somehow convince the network wranglers that their precious fabric is broken, even though the network monitoring stuff doesn't seem to be complaining.

After seeing this once or twice, the frazzled person who's had to troubleshoot it and then fight to get people to even believe it's a problem might propose a solution. Every host should periodically try to reach out to at least one other host per distant rack to make sure it can. If it can't, then it should report this problem somehow. Even a small number of probes would quickly establish what the problem was.

Of course, if said frazzled person is the type to discuss a problem first, she might be told that it's a worthless idea and to not pursue it. Then time will pass, and the problem will come up again and again, each time, causing someone else to have to jump through the same hoops as they re-solve the problem from first principles.

This is usually when the original person who was shot down starts chuckling. You could have had the monitoring already, but instead you shot it down just like you shoot down every other idea. So now, some other service had a failure and someone else had to waste time covering old ground.

This is what happens when your company becomes infected with "no men" -- think "yes men", only they say "no" to everything. To be fair, these are actually "no people", since there's nothing limiting this behavior to just men.

In this case, the "correct" answer given the acidic environment would have been to just ignore them and just write the tool anyway. As I've written before, when you have to resort to this kind of behavior to get anything done, you have a sick team.

Believe it or not, there seem to be places where you can have a rational discussion about the need for a tool without automatically eliciting a "no, you shouldn't do that" from everyone within earshot. There are also places where merely discussing a topic does not automatically mean that you are somehow incapable of handling the task.

If you're surrounded by people who will shoot down ideas without discussing them just to get their jollies, you're in trouble. Get out now.