Software, technology, sysadmin war stories, and more. Feed
Tuesday, July 31, 2012

Avoiding some types of DNS hosting problems

One of the things I used to do in my web hosting meta-support days (when I supported the support teams, if that makes any sense) was think of ways that bad things could happen to us. I've written about at least one of them previously when describing unreasonable demands and how you might create a "support denial of service attack" by filing a bunch of stupid tickets at your competition.

This one was a little more removed from bad actors. It had more to do with people not liking single points of failure. Our DNS servers had been swung over to anycast, such that their IP addresses were advertised from different data centers, and so it was likely at least one of them would be reachable by a client at any given time. This was a good thing, and customers liked it.

I wondered about another problem, though: that of delegation. Let's say you register a domain, and your hosting company lets you use their name servers: ns1.example.net, and ns2.example.net. You tell your registrar to use those exact names as your primary nameservers and they start publishing the appropriate NS records in that top-level domain for your zone.

The problem is one of glue, or specifically missing glue. Odds are good that a query for "your.domain IN NS" is only going to return "your.domain. IN NS ns1.example.net." and "your.domain. IN NS ns2.example.net." with no further details about those two example.net hosts. All of your clients will then have to shelve the lookup temporarily and chase down ns1/ns2.example.net.

Normally, this is not a problem. ns1 and ns2.example.net probably are listed as primaries for their own domain, and in this case, the top-level domain will punt back A records as glue to break the deadlock which would otherwise result. Your resolver can then take this and continue on its merry way.

However, sometimes, web hosting companies don't publish glue for the domain in which their primary nameservers live, and instead send visitors on yet another chase into another domain.

Eventually, you either hit a domain which has glue, or you run into some limiter in your resolver, and it gives up and fails the request. Of course, it's not consistent. If you happen to query another domain which has a shorter path to something in that too-long chain, odds are it'll stick around in your local cache for a while. When that happens, the original domain with the deep chain of glueless domains might just magically start resolving for you... for a while, at least.

There's another problem here, too. What if something screws up the entries for that example.net domain where ns1 and ns2 live? Attempts to resolve them might fail. This can happen if someone seriously messes up the zone at your web hosting company, or perhaps if the registrar nukes the domain accidentally, or if your dumb web hosting company lets such a key domain expire. I've seen all of this and then some.

It seems like it would make more sense to maybe delegate to ns1.example.com and ns2.example.net, even if they still point to the same IP addresses as before. If nothing else, it would remove the single point of failure which is having the same domain for both of your primaries.

Of course, you could always just specify that your domain's primaries are hosts inside that same domain and thus specify IP addresses to be published as glue. The problem usually given with this scenario is that you now have to worry about what happens if your DNS provider changes the IP addresses of those hosts. It has been known to happen, and it always leads to all sorts of horrible problems after long enough.

The caching nature of DNS combined with the "many overlapping paths" aspect of resolving things means it's a ripe environment for heisenbugs. The very act of trying to troubleshoot a problem by running more queries may temporarily make it go away. This makes it difficult to convince some people that something needs to be done.

I brought this up in a meeting. I received one of those standard "nice doggie" responses where people said "wow, that's deep" and then promptly forgot about it. I guess that's easier than taking any action. I mean, all they'd really need to do at a first pass was create identical ns1 and ns2 records in their .net domain to match those already in their .com domain. Then they'd just have to share that tip with customers.

Checking right now as I write this years later, it's exactly as things were before. I guess I shouldn't be surprised.