Software, technology, sysadmin war stories, and more. Feed
Thursday, June 2, 2011

Broken web mail uncovers quite a mess

One fine day, a customer called in and reported their web mail was no longer working. It had been fine the night before, but then today it was just broken. They swear they didn't change anything on the box, and indeed, nobody had logged in via ssh, so what happened? This is my tale of figuring it out and dealing with it.

This one came to me as an escalation. Some web hosting customer running Plesk (a web-based control panel for vhosters) was having trouble with their machine. We supported the machine, the OS, and the control panel, so it was up to us to figure it out. The report was that the web mail would just hang after typing in a user name and password. I had to start from that and work backwards to a cause, then come up with a cure.

I looked up a valid username/password combo in their database and tried it. Sure enough, it was doing exactly what had been described. After clicking the [login] button in my browser, it just sat there seemingly forever. Something was going on, but it wasn't anything productive. It also seemed to affect multiple accounts, including a brand new one I created in the control panel just for this test.

Okay, this was something. The good news is that since the web request was hanging, that meant it was trivial to find the socket in netstat, then find the Apache child which was servicing it, and attach to it with strace. What I found was a little disturbing (this is from memory, since it's been several years, but...):

select(1026, [3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ...

This crazy process was trying to select tons of file descriptors, and the top one was 1026. 1026. That's two higher than 1024, and 1024 is one of those magic powers-of-two numbers. Hmmm. Oh dear. My own experience with writing code which used select had introduced me to this particular catch from the man page:

An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will result in undefined behavior. Moreover, POSIX requires fd to be a valid file descriptor.

The machine in question had FD_SETSIZE equal to 1024, and it had been baked into all of the binaries as a result. It was now trying to read past the end of that array, and not surprisingly, it wasn't finding anything.

This was causing the webmail's IMAP login code to hang, since it was waiting for an IMAP server banner which would never arrive. The newly-created TCP connection to localhost port 143 was up there on some fd above 1023 and would never be noticed.

The workaround was annoying but it got the job done for the short term: we reset FD_SETSIZE to some higher number and then recompiled the entire stack: the IMAP client library, PHP itself, OpenSSL, Apache, and so on. Obviously, this was bound to break when any upgrade came along, but it was all we could do on short notice.

The only remaining question was why it broke right then. That part became obvious later on: the reason the PHP IMAP client is running with so many file descriptors in its environment is because it inherits them from the Apache child in which it runs. That Apache child, meanwhile, has one file descriptor for every single log file it has open, plus a few other things for housekeeping, plus the network.

At some point the previous night, our customer had logged into their control panel and added another virtual host web site. This created the usual flurry of directives in the bigger Apache configuration, plus ... four new log files. Each one of them got a file descriptor which would be carted around in every httpd child, and that pushed off the cliff at 1023. That's how they were able to "break it" without ever using root through traditional means.

There's another fun issue which this exposed, too. It meant that arbitrary client code running in PHP essentially had full access to all of the web site log files on the box. It took a little wrangling, but you could grovel around in /proc/self/fd and find the whole set. It wouldn't be terribly difficult to ftruncate() the logs and cover any tracks if you broke into such a machine that way. Normally, PHP exploits would be discovered through the logs (owned by httpd and thus not directly writable by a mere user), but this would knock that out nicely.

This was a long time ago. Hopefully Apache isn't so loose about its logs these days.