Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, May 3, 2012

30 second delays should have been a 30 second fix

Toward the end of my tenure in web hosting support, a bunch of us started getting bitter about the caliber of people who were being hired. There were more and more people who were supposedly actual techs but in reality were little more than human interfaces to the ticketing system. They'd answer a call, take a few notes, and open a ticket asking for more details rather than doing anything about it.

One frequent customer complaint was that doing something which involved connecting to their server was slow. The majority of these were e-mail. Usually, it would be our customer who leased a server and then resold access to other people who were using it for "vhosting" - web and e-mail hosting. One of those users would have an issue, and that would come to our customer, and then it would come to us.

So imagine the situation where someone is trying to send mail, and it sits there for 30 seconds before it actually does anything. Users with e-mail clients that expose debugging information can keep tabs on this. It might open the connection and then block for exactly 30 seconds before proceeding. A tech who was paying attention would notice this number coming up again and again and might realize there's something more to it.

Even if they didn't immediately recognize the problem from having encountered it before, there are plenty of things which can be done to further establish what's going on. First, they can try to connect to the mail server themselves to run a test. Granted, this means they need to know how to speak SMTP, and good luck finding that.

In this case, the tech's test would be fine. There would be no delay. So next, they should start thinking about what seems to be limiting this to just that user. One approach would be to dig around in netstat, find the process which is talking to them, and then connect to it with strace while it's stuck. They'd probably see a "connect(..." sitting there which would eventually return ETIMEDOUT or similar. As soon as that cleared, the user's mail would go through.

This is a subtle but important hint. Clearly, it's trying to connect outward to something, but what? Depending on how lucky they were with that strace, it might have included the important details like the port number. If not, then they have to get clever. Now it's time to find the parent process and run strace on that, and then track all of the forked children to see what they're up to. This will yield enough context to see which port it's trying to reach.

Or, they could just go straight to the big guns and run tcpdump with a host filter for the user's IP address. With that view, it becomes quite clear that the server is trying to call back to the user's port 113, and it's not going anywhere. A bunch of SYNs are sent, and nothing is received. When it finally gives up, things start going through.

Now we're getting somewhere. Actually understanding what's going on would probably require some old-school experience on IRC. After all, who runs identd these days? Those without that experience probably have no idea what the port is all about.

What happens next depends on how they like to roll. They might go and search the web to look for some combination of "qmail", "plesk" and "port 113", and they might eventually turn up a magic flag called -Rt0. Assuming they deploy it correctly, then the problem would suddenly disappear.

Someone with a different view of the world might try to "solve" the problem at the source by forcing outgoing TCP connections to port 113 to fail with some iptables magic. That one is a little more sneaky since it's harder for other techs to find, and it's easier to mess up later.

Ideally, the next time such a situation came up with the magic 30 second value, that tech would just jump on the box, drop -Rt0 on there, and then check back to see if that fixed it. It's a dumb problem with a dumb solution, and the only way to deal with it quickly is to figure it out once and learn from it.

Of course, what was happening is that the company had started hiring people who would take the call and then give up. Instead of trying these things to collect some information (and hopefully discovering the problem in the process), they would just open a ticket to log the call. Then they'd ask the user to "send in a traceroute to the mail server", set it to "require feedback", and go on with life.

Basically, they had found a way to throw the ticket into the future by putting it back on the user to do some nontrivial task. If they were "lucky", the ticket wouldn't come back with a response until after their shift. If that happened, then they wouldn't have to actually work on it! In their mind, they just "won".

Of course, that "win" for them was a loss for the user who had to suffer through an unreasonable amount of time to solve their problem. It was also a loss for other techs who had to clean up their mess.

All of this happened because a mail server shipped with poor defaults which clashed with user firewalls which (quite reasonably) dropped packets to port 113. Having that sort of check in the critical path for providing service was the real problem. Also, having that sort of tech in the critical path for providing customer service was another real problem.

Such is life in the trenches of tech support.