Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, June 24, 2012

Web audio error handling

Some "programming problems" aren't completely programming problems. They are actually user experience issues which can be discussed by reasonable people who are lucky enough to not be code wranglers. Naturally, one of the participants in the conversation has to be someone who understands the core technical issue and can relay it to the audience.

Here's one. Today, I started chasing down an intermittent hang in playback on my scanner page. The page itself is primarily driven by Javascript code which has a number of timers and other events firing off to do things. It fetches new calls, scrolls the screen, evicts old calls, and starts jPlayer to actually play the audio files.

Problem is, the Internet being what it is, sometimes those audio files aren't available. There are sufficient gremlins in the tubes to make it occasionally fail. It doesn't happen often, but when it does, users tend to notice, since they have to manually interact with the page to get things moving again. This is how I heard about it this morning: one of my users reported it.

Upon further inspection, I determined that it can happen when there's a HTTP error return for the audio. Due to the way I have this thing rigged up, it's possible to fail due to network anomalies which are beyond my control. These anomalies tend to be short-lived, such that a retry will resolve them.

This is where it becomes conversation fodder for normal people. I can present my problem thusly: I have people expecting things to keep working, and a retry can solve it. Trouble is, too many retries is just wasteful, and if enough clients get into a loop like this, it'll beat my server to death.

Talking through it with one such non-programmer (a family member, not my user) eventually lead to a possible strategy. If it fails, retry right away. It almost always works on the second try. If that fails, then wait a little while, and try it again. If that fails too, wait some more, and try it yet again. If this fails, give up. This makes the common case almost invisible since it's an immediate retry (and usually succeeds), while also giving enough of a time-spread to allow it to recover within 15-20 seconds.

With that decided, I set off to try to implement it. It turns out that jPlayer provides a nice event which can be hooked to catch errors. I noticed a problem, though: the 5xx errors which are coming from these transient failures look no different than a 4xx error which might happen. Granted, I never have 404s, because all of the requests are for files which actually exist, but it could happen some day.

This is when I started applying some "would be nice" logic and started digging to see if I could tell them apart. It turns out that's not possible. The HTML5 player element used by jPlayer sets a generic MEDIA_ERR_NETWORK for 4xx, 5xx, and some other failures and doesn't do me the favor of storing the actual HTTP response code anywhere. This isn't jPlayer's fault, incidentally. It's just how the player spec works.

I would have liked it to automatically detect 404 as a broken call and skip over it. Unfortunately, it can only retry pointlessly, and ultimately give up. This is less than ideal, but I'll live with it for now.

One thing I do control is what my server returns. I could theoretically rig it to never return 404s, no matter what you fetch. Anything which would throw a 404 could be internally shunted to a zero-length placeholder audio file which would succeed. This would allow the client to move past that call and keep on going with others which (hopefully) will actually exist.

I'd say the moral of the story is that even non-technical people can be valuable when certain issues come up. Looping them in can provide useful insights, and it also keeps them from feeling excluded when you have to go off and poke at the computer for a while. It's no longer a black box.

Engineers who like to talk their way through problems actually exist. Who knew?