Software, technology, sysadmin war stories, and more. Feed
Friday, May 4, 2012

Slow FTP? We'll fix it live.

How about another troubleshooting story? One evening after I had officially escaped from support, I was logged in from home just snooping on things. By running the VPN on a separate machine, I could get to everything on the corporate network without affecting my usual system. I'd usually run a Jabber client as well, so it wasn't too surprising when someone noticed and opened a chat with me.

One of my friends who was working that night was stuck on a "FTP download" ticket. He wasn't sure what else he could try, or what he should tell the customer. I looked, and the customer was reporting that it would take 30 seconds to download files which were less than 1 KB.

Meanwhile, my friend could connect to the machine and pull in 1 MB/sec from it. I asked if he had tried this from outside the corporate network, and he hadn't, so I did a test of my own from my house. It maxed out my relatively slow DSL inbound, so there was nothing obviously wrong there, either.

I was unable to duplicate it, so I hopped on the box and started looking for signs of insanity in the logs. I found what appeared to be our customer's IP address, and did the usual traceroute stuff. It went off into "* * *", but I ran it from my own connection and found that it was just because their ISP was doing some silly things with their routers. They were replying to the traceroutes with interfaces which were configured in RFC-1918 space, and while the company networks filtered that as the bogon it was, my home connection did not.

That wasn't the cause of the problem, but it did look weird. I was getting stuck, too. I figured we could tell him to make sure he could duplicate it from a completely different location, but kept poking at the box in parallel anyway.

About two minutes later, I got it: it was DNS. The customer was actively doing stuff via FTP as we were sniffing around, and I was able to strace the FTP daemon. What I saw gave me some idea of what was happening:

21:01:42.900370 connect(14, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("a.b.c.d")}, 28) = 0
21:01:42.900565 send(14, [...], 43, 0) = 43
21:01:42.900779 gettimeofday({1134266502, 900837}, NULL) = 0
21:01:42.900897 poll([{fd=14, events=POLLIN}], 1, 5000) = 0
21:01:47.910087 send(13, [...], 43, 0) = 43

It was trying to do a DNS query. "a.b.c.d" was the IP address of one of our caching nameservers, and port 53 makes that doubly obvious. The part I cut out with [...] is just a bunch of binary gunk which includes the tell-tale string "\7in-addr\4arpa". So, it's not just any DNS query, but a "reverse DNS" PTR lookup.

The FTP daemon issued one of these every time a new data (!) connection came up. It would fire off that query at our nameserver, wait 5 seconds, then time out and try again. After a few tries -- apparently six or so -- it would give up and just let the connection happen. I did the in-addr query myself. It hung just the same.

It seems this person's ISP had both routers with horribly-configured interfaces and completely nonexistent DNS for their in-addr.arpa space. For whatever reason, it wasn't failing quickly with NXDOMAIN when his server did a query to our caching nameserver, so it would block.

I came up with a plan for my friend: turn off "UseReverseDNS" on the FTP server, reload or restart the daemon to make it pick up the new config, then get the customer to try again. It should Just Work after that.

I also tried to put a short-term no-restart-required hack in place by dropping the IP + hostname combo into /etc/hosts but nothing happened. It kept doing the laggy thing. I figured that was strange, but maybe it had some ridiculous resolver thing going on which ignored the usual "files, dns" scheme from host.conf and/or nsswitch.

Still, I kept poking at it. Then it hit me: the FTP daemon was chrooted, and it was looking at a different /etc/hosts under another path. I dropped the same line into that file, and it started flying.

I told my friend what I had done, and he started laughing at just what had happened. I guess he didn't expect me to be able to "change the engine while the car was running", so to speak, but I had. Since I had meddled with the machine, I added a private comment to his ticket documenting my changes, and left it to him to update the customer publicly.

Everyone was happy, and another problem was solved.

In retrospect, I bet a lot of people at that company have no idea how many tickets I wound up fixing beyond the ones which were officially "credited" to me.

Who are the hidden players in your organization?