Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, May 16, 2013

A half-baked idea about honeywords and giant hash tables

When you think of an authentication system, exactly what sort of design concepts start swirling around in your head? Okay, that's kind of vague, so let's reduce the scope a bit. Imagine something which is meant to authenticate people who log in to a service, like a web page. It could be anything where ordinary people get to send you a username and password and you have to find out if they belong.

One approach is just to store every username and password you've ever seen. It's simple, it's easy, and it's a great way to really let your users down. Storing things in plaintext just makes life really interesting when your systems are compromised and the list of accounts and their passwords is spread around.

If your database looks like "user | user@example.com | password" then it might just be built around this sort of design.

It seems the next step up from that is to take a page from the world of shadow passwords and use some kind of hashing scheme on the password. Then, instead of storing the plaintext, you just store the result of some kind of black box function. In order to verify a login attempt, you just take their password, run it through that same function, and see if you get a match.

Of course, those systems get compromised as well. The lists are obtained and then the hashing function is determined, and then a series of brute force runs of that function eventually turn up the probable plaintext. It's possible to make educated guesses about what a password might be based on the username or other components of their profile. I think tools like "john" will grab bits of the "GECOS" data from the passwd file to help out. Users who had things like "District Office" in their location data would have a password of "district" cracked right away, in other words.

So about two weeks ago, I ran across a paper which talks about creating "honeywords" in the hopes of detecting passwords which have been cracked. It seems to be based around the notion of having passwords which might be valid hashes but aren't actually useful for accessing a given account. If some evil attacker sends in the wrong combination, it might be able to guess that someone's been messing with a copy of the hashed data.

It was pretty late that night when I found this, and that's probably what lead to my half-baked idea which would add to this. You keep the notion of having a bunch of bogus matches, but you go far beyond it in terms of enlarging the problem space.

First, instead of just hashing passwords, hash the usernames too. Then, instead of having separate lists of usernames and passwords, have just one. Third, instead of only having hashes for actual usernames and passwords which are in use, have all of them which have ever been used, including the ones which aren't even valid any more. Finally, hash a whole bunch of random crap and put it in there. Mix it up really well so there's no pattern to which is "filler" and which has actually been used by an actual user.

Let's say your system has 1000 active users, and each one has had two passwords on average. That would normally represent 3000 hashes. To make life interesting, we'll push it up to about 30 million hashes by hashing all kinds of stuff: random strings, dictionary words, whatever. Then we mix it up so it's not apparent which rows are which.

The table is simple enough: a row id which is unique, and a hash which is also unique.

Obviously, we need some way to identify which of these combinations is valid. That part is simple enough: it's just a tuple. You might say that ( 123, 456 ) is a valid user. That is, hash #123 and hash #456 together are a valid user. It might be necessary to get some kind of user ID back from this, so in that case, ( 123, 456 ) = user 1337. Whatever.

If you store this mapping in the same place as your hashes, you're still in the same boat as before, so let's make things even more interesting by storing these tuples somewhere else. Put them under separate cover as a completely different hosted service elsewhere in your organization. That service would only store these tuples -- just the simple mapping of ( 123, 456 ) to 1337, and so on.

Without the hash table, the list of tuples is just a bunch of numbers. Without the list of tuples, the hashes is a giant haystack which is full of plausible needles, and they're all encrypted. You have to brute-force all of them to get back their plaintexts. Only then can you start guessing combinations, and this time, you don't know the order, since how can you tell if it's a username or a password?

Okay, maybe "rachelbythebay" looks like a username and "hunter2" looks like a password, but who's to say that it couldn't also work the other way around?

It would be necessary for the tuple service to be vigilant about traffic patterns, and try to flag those which seem dubious. In that sense it might be possible to notice something bad before it goes too far.

I wonder if it would be useful for some third party to provide the tuple lookup service. It would be one fewer thing to reside on the systems where the login attempts are taking place. Of course, given that, the whole job of storing hashes could also be outsourced to yet some other provider as well.

If both of them were stored offsite, then a login attempt would be pretty interesting. First you collect the username and password, hash them, and ask the hash server for the numeric values for those hashes -- their unique row IDs. Then you take those numbers and fire off a request to the tuple server to see if they are valid. If so, then you allow the user in and proceed as before.

I suppose you could also send a bunch of "fuzz" requests to both the hash server and the tuple server so even they wouldn't know exactly what was going on. That would probably defeat the notion of having those providers run traffic analysis on your queries to find attacks, though. It would be back on you to do that locally.

I'm no crypto person, so this is probably full of holes and has probably been hashed (heh) to death somewhere already. Just in case it's something fresh, I'm putting it out there to hopefully stir some imaginations. Maybe something useful will come from all of this.