Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, August 15, 2011

20 years of experience, or one year 20 times?

I have a collection of experiences which lead me to automatically suspect certain things when a system isn't behaving properly. I've been learning that chalking them up to "intuition" is a good way to get a bunch of people to stop believing you, true though it may be. Even though I can come back later and show them that yes, this is in fact what the problem was, just like I said, it's like they are programmed to disbelieve.

Here's one such example. Someone escalated a support ticket to me for something crazy which was happening on a customer's machine. I don't even remember what they were trying to do. I do recall that they just blew right past the first oddity I noticed, and it turned out to be the key to everything.

Basically, I went to connect to the customer's machine. We'd always log in with a regular account and then su to start doing stuff as root. So I paste in the password for that account and hit enter. It sits, and sits, and sits, and sits. Then it goes, okay, last login: on ptsXX from IP at date, and drops me to a prompt as usual. Nobody else thought that was the least bit odd. I noticed it right away.

Right off, I wanted to know why it did that. Was it really busy or something? Time to check the load average, or to look for other people doing stuff. I did "w".

 16:59:33 up 21 days,  2:12,  1 user, load average: 0.01 0.00 0.05
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
tech     pts/0    64.39.xx.yy      16:59    0.00s  0.00s  0.00s w

Right then, I was pretty sure I knew what was going on. Just those two things had given me a strong lead. Thing #1 was the lag after authenticating but before getting to a prompt. Thing #2 was seeing that IP address in there. Our external NAT interface for the support network had a proper PTR entry, and a matching A record.

It should have showed up as nat-vlanXXXX.YYY.company.TLD but it did not. That right there told me that DNS resolution on this box was broken, and that by itself could easily explain whatever initial cause had lead to this ticket being filed in the first place. So now I just had to diagnose the DNS woes.

Sure enough, running my own DNS queries with dig and friends would just hang and eventually time out and fail. This machine's /etc/resolv.conf looked fine. I think I wound up running tcpdump, and then it showed me something quite interesting.

The machine was generating ARP requests for its name servers. This made no sense, since there is no way any customer box would ever be on the same broadcast domain as our caching name servers. Those hosts were on their own little segment and subnet off somewhere in a room named after a Greek letter. There's no way they'd ever see the request, so there's no chance for this box to get a response.

This lead me to my next question: why does this machine think that these nameservers are on the same network? Clearly, it must have a seriously screwed up interface somewhere. I ran ifconfig, and the rest became clear.

eth0:0    Link encap:Ethernet  HWaddr xx:yy:zz:aa:bb:cc
          inet addr:69.20.x.y  Bcast:69.255.255.255  Mask: 255.0.0.0

Yep. This machine had an alias configured on it with a classful netmask. It thought that 69/8 was attached directly, and that 16.7 million IP addresses starting with 69 could be reached via ARP -- no router hops at all. How wonderfully silly.

A bit of digging revealed that this machine had a netconfig file for eth0:0 which had a missing NETMASK= line. It was either misspelled or gone completely. In the absence of useful data, either the network "ifup" script or ifconfig itself said "well, this looks like a traditional class A, so let's use a 255.0.0.0 netmask". This set up the situation where 69.x.x.x is thought to be locally connected.

This broke DNS resolution because while those caching nameservers were in that bigger block, they were not in fact on that network.

So the last question some people should be asking now is: how did we ever manage to log in to find this out in the first place? That's easy. Remember back to my initial login. I was coming from 64.39.x.x, and that particular block was not affected by this routing anomaly.

After setting this interface up properly, everything started resolving as it should, and the original problem was also solved. We suggested that the customer should not attempt to make up his own additional IP addresses, and even for the ones which are legitimately his, to leave the configuration to us.

In the months which followed, any time I got a lag-at-login like that, the first thing I'd do is check ifconfig. More often than not, I'd find some classful interface and it would be the same thing I described here. People watching me work would be mystified. How did you know that?

I said, it's like the story of the person who charges $1.00 to swing the hammer and $99,999 to know where to hit it. It's experience.