Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, November 13, 2011

Blocking important threads like UI? Do you hate your users?

Using some software gives me the impression that developers don't always pay attention to what they're doing in their various threads. Sometimes it seems like they allow a program to go off and do something rather costly in such a way that it hangs the user interface or totally messes up the network stack.

Try this some time if you use Adium (Mac chat client): after talking with someone for a while, hit command-P. Odds are, the entire thing will freeze on you while it builds the print dialog. It's not clear exactly what the major malfunction here is. It might be doing the costly work in the UI thread and that would be pretty obvious.

Or, it's entirely possible that the print dialog creation process grabs some kind of lock (mutex) which is also required by the UI thread for normal operation. In that case, even though you might think you're in the clear, you're not.

This is not to single out Adium. I've seen the same things happen in all sorts of software. How many times have you seen some graphical program in X go AWOL and stop repainting its screen? You can just sort of push it around and it becomes a plain white square. More often than not, it'll pick up the "X: 123 Y: 456" gunk as you move it around.

You might think that people who write server software would be better about this since they don't have to service mouse clicks and repaints. I used to think that. Then I saw some horrible things in recent times. I saw a network server which used a complicated RPC system in userspace. This RPC system relied on getting a few CPU cycles every now and then so it could maintain its connections and health checks.

The server in question apparently had this RPC system shoehorned into it, and managed to have it in-line with some rather expensive operations. In particular, there was a place where adding or removing a storage node would cause a write back to a directory service. That service tended to rate-limit writes in a half-hearted attempt to limit abuse. Unfortunately, the client handled this by blocking, and everything else behind it got stuck as a result.

If you had opened a RPC connection to this server, you could actually watch the TCP connection underneath drop out and restart every time this happened. The server would go catatonic for 30-40 seconds at a time and would fail healthchecks causing the client to restart the underlying connection. It was ridiculous.

Naturally, this caused actual RPC requests to fail. Normally, the RPC system itself would handle an intermittent problem within the allowable deadline, but this was far longer than it could stomach. Instead, my code had to forcibly loop back and try a few times until it finally worked.

The worst part was reporting this to the developers.

"Yeah, it does that".

I hate that response. It always makes me wonder: don't you use your own software? How can you live with this?