Writing

Software, technology, sysadmin war stories, and more. Feed
Thursday, October 11, 2012

Web hosting support customers, money, and clues

There are a bunch of customer types you encounter at web hosting companies. Some pay a lot of money, and some pay as little as possible. Some of them are flat out Unix (or Windows, I suppose) genius sysadmins, and others probably shouldn't be allowed anywhere near a keyboard (or mouse).

If you thought the people who paid the most money were the most clueful, you'd probably be wrong. More often than not, the scrappy little operations with just a single box had someone who knew how to do everything and only called us when something that only we could do had to happen: physically swapping parts, assigning new IP addresses, and so on.

The big firms tended to have a bunch of nice and polite people who really did not know a whole lot about this stuff. They'd usually "know enough to be dangerous", but that was it. Since they were deemed to be Very Important customers, they were routed to a special subset of techs. Other techs were not to touch those machines under normal conditions.

I wound up being one of the people who had been hit with the magic wand which said I could handle the biggest accounts. Due to my odd schedule, this meant I was frequently the only one available to take such calls, and I'd get on these hour-plus marathons while they sorted something out.

One night, a company with a nerd-tastic name from the '70s (but which actually wasn't that company, having merely bought the name and IP from the now-dead company) was stuck with some problem on their web site. They'd start hitting it and it would be okay at first, but then it would just bog down horribly. They figured it was our load balancer, because it's always the web hosting company's fault, right?

With these two guys on the phone, I proceeded to hit their web servers directly. That is, instead of hitting www.$nerd.com, I hit www1.$nerd.com and www2.$nerd.com instead. The same thing happened. Right there, I could establish that it wasn't the load balancer. They accepted this and so we proceeded to "work the problem".

I also demonstrated that merely hitting the server from itself over loopback with lynx (or links, same basic idea) would also hang up in the same places. That further implicated something happening on the web server, or perhaps something it relied on. It probably wasn't our infrastructure, at least, not the external side of their platform.

While doing this, I realized that a nontrivial part of their site was actually served by some Apache module which was snagging parts of their namespace and was turning those hits into connections to JBoss. Basically, it would effectively proxy them through to localhost:8080 or something like that, and thus JBoss was responsible for making things go.

Now I had something more to use for troubleshooting. I got a connection open to port 80 from netcat and found it in netstat. Then I hooked strace to that httpd child and pasted in a query which was known to hop into JBoss. What I saw was fairly interesting: it was sitting there in connect() to localhost:8080. It would pause for a fairly long time, and then it would "grab on" and start working.

We didn't officially support JBoss, but for this tier of customer you tended to avoid saying that outright, so I went looking. That's when I found it buried in a config file: they had the thing set to a maximum of 10 concurrent connections. Apache may have been set to handle hundreds of them, but JBoss wasn't. I demonstrated they had a problem by having them connect to JBoss directly via localhost and showed how it would block until the other connections cleared out. Then I told them about the limit.

"Of course!" they said, and they tweaked it. Then all was well. They thanked me and let me off the phone at last.