Writing

Software, technology, sysadmin war stories, and more. Feed
Monday, April 30, 2012

Inverting web search

Web search is a strange thing right now. You have sites which create content and host it. These sites know all about what they have stored by definition. They post it online, and other people come along and try to make exact copies of it. They go to great lengths to discover all of it and keep it as fresh as possible, but they can never find all of it.

Ultimately, people go to those indexing places and ask them to run searches against their copies. The search engines then make decisions about which pages are more important than others and generate results. This is effectively a "pull" strategy. They pull content and then run searches against their own local copies of things.

I started wondering about a "push" strategy. I can't remember exactly how far back I had the original notion about this, but it's not at all like how things work now. Basically, it would make the search process more-or-less owned by the user, and a bunch of different sites could try to cough up good results for them.

The problem with this kind of fan-out approach is knowing who to ask. You might know to search Reddit, Hacker News, Twitter, metafilter, and even the accursed G+, but oops, you forgot tumblr, LiveJournal, Dreamwidth, and ... who knows what else.

Rather than being stuck there, I turned it on its head as well. Instead of finding sources of content, maybe they need to find me. In that world, I publish my search term, then they get to see it and respond. So how would that work, exactly?

My first thought was a clearinghouse for search terms and results, sort of like a stock market or commodity exchange.

Let's say a user connects to the service and registers a search for "rachelbythebay". Data providers who are connected then see that come across the "firehose", and they can provide results for it. Those results work their way back to the user asynchronously. Instead of a web search being a one-time affair, it's more like a standing order which might take a while to fulfill.

Obviously, you'd have to do something about spammy sources. Charging money to providers to raise the bar would be a good start. Maybe only allowing a provider to send along a single response for a given term might help, too. Allowing users to identify providers and rank them based on their experience with the results wouldn't hurt.

Let's say your search for a given term results something that's obviously spammy. You know it just by looking at it. Just mark it as such and move on. Any results coming from that provider will be de-emphasized for you in the future. Aggregating these "downvotes" to provide default quality values per provider might also be interesting.

Anyway, this thing can just keep on rolling. Providers need not take on the full firehose, which could be a pretty nasty amount of data. They could just pull terms at their own pace and provide answers when they can. Those answers can be cached in the "exchange" for other people doing the same search within a certain amount of time. Having TTL values for each result provided by a source could improve this.

I envision a system where you might provide "foo bar show: live stream" with an appropriate lifetime so it only says that while the show is on. Then, afterward, you might return "foo bar show: April 30th" after that. You'd want to be very careful about claims you make lest you annoy someone who then downvotes your result. Claiming the show is live even after it's ended would be bad.

This would also have the interesting side-effect that dead sites would drop out of search results right away, since they wouldn't be around to respond to queries. After your results expire, you're gone.

Another fun quirk is that all of the providers would be able to see the query stream. Right now, this is largely the exclusive domain of search engines (and people who might be looking at a screen in their visitor lobbies). While this is sure to be gamed for profit, I wonder what kinds of neat things might happen as a result of it.

It all seems so obvious. I imagine this has been tried before.