Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, January 8, 2012

Be careful with those router configs

One fine day, I was just doing my own thing at work. Everything was working fine, and then something changed. All of the sudden, DNS resolution out to most of the world quit working. It wasn't just me, either. People I could reach through various channels in other offices were having trouble, too. Something big was happening.

I started trying to get things moving again for myself by manually querying some public DNS servers by hand. Strangely, those worked. Also, on the wireless network that had been set up for guests, that worked too! It was something about trying to resolve things through our internal network that was bad.

There was another clue: almost all of the roots were unavailable. At first it looked like all of them were gone, but cycling through the various letters of root-servers.net eventually turned up one or two which worked. So now I was really curious. What could have happened to filter all but those?

Somehow, a pattern started emerging. The IP addresses of the root servers are not spread out evenly. This has changed over time, but for the most part, they tend to cluster around the old-school "swamp" space. I found it rather strange that the one or two which weren't in or around that space were working.

A few test traceroutes supported this theory. Going to certain destinations just went around and around in circles, while others were unaffected. Granted, I had to do this with "-n", so I couldn't tell what the PTRs of those hops were, but it's still possible to see a loop without meaningful names.

I laid out my guess: someone had managed to filter or otherwise blackhole something huge, like a /4, or a /3, or a /2 or something ridiculous like that. Normally you might deal with a network that's a /20 (4096 IPs) or even a /16 (65536 IPs), but once you get past that, you're in some seriously crazy territory.

After all, if a /1 is half of the IPv4 Internet and /2 is a quarter, then /3 is an eighth. Consider what one-eighth of the entire address space means - 2^29 IPs!

Some time later, the official word came down as to what had happened. Someone was working on manually configuring a point to point link between two sites. One traditional way to set these up is to carve out a /30 for the PTP link. You have the meaningless network and broadcast addresses, then two more for the near and far ends.

A /30 is all well and good, but it's just one missed character away from /3. Yep. Someone set up a link and missed the "0", so that router figured it had a great local route to whatever /3 happened to match up with that IP address, and it shared it to its friends. They all gladly accepted it and proceeded to send those packets to someplace that really wasn't expecting that much traffic.

Fortunately, this only happened internally and didn't leak out via BGP. No other networks were affected.

I was pretty happy when I found out. It's not every day you get to guess at something like that and actually turn out to be right.