Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, December 29, 2011

Hash table attacks and overzealous analytics

There's a hash table attack making the rounds this week. It involves what happens inside a bunch of web frameworks and other libraries which handle incoming GET and POST requests for you.

There's this assumption that people really do not want to scan through a blob of "?foo=bar&blah=123&launch=true" gunk by hand. This leads to parsers which split those up and turn them into keys of "foo", "blah", and "launch", each pointing at "bar", "123", and "true", respectively.

The problem is that those keys are under the control of whoever is sending the request. If they can guess a few things about how you store them, then they can purposely generate the kind of pathological case which would cause your system to handle them really inefficiently.

What I find really funny is that I saw an attack like this a few years ago which also involved key storage. What was different is that the programmers brought it upon themselves. For some reason, they decided to write something which would track hits to every URL endpoint within their custom servlet engine. Did I mention it was Java?

Normally, this was no big deal. You could connect to the program's embedded HTTP server over the intranet and ask it what was going on. It might say that "/login" had 123,456 hits, and "/logout" had 4,567 hits. I imagine someone actually found that useful.

Where they went wrong is that they stored this stuff forever, and they didn't store these URLs according to their handlers. Basically, there were only so many operable endpoints within any given app setup. You might have /login, /logout, and /settings. The problem is that /LOGIN, /LOGOUT, and /SETTINGS also worked... and they stored them separately.

Pretty soon you started getting things like this:

... and so on. Every new permutation got its own entry. It was actually far worse than what I'm showing here, since every entry had a series of buckets. It would track the total number of hits, plus the number of hits in the past day, 6 hours, 2 hours, 1 hour, 30 minutes, 15 minutes, and so on.

All of those buckets took up memory, obviously, but they also scarfed down CPU time like it was going out of style. Inserting a new entry or updating one meant the system had to traverse that structure to find the right spot. Then it had to keep passing over them to keep the time-based buckets accurate. It was a mess.

The craziest part of this is that all of this tracking was happening inside the process which was also responsible for serving production traffic. Eventually, it would either grow too large and would be whacked by the system's out-of-memory handler, or it would simply become too slow to handle requests sanely. Then someone would notice and would restart it. Until that happened, any request which happened to land on this instance would receive suboptimal service.

Analytics are all well and good, but if you allow them unbounded access to your finite resources, you might want to make it happen in a place where it can't bring down your actual site. Otherwise, you might find yourself analyzing a time series of zero hits as people find some other site to frequent.