Brewster Kahle (Archive.org) made a small fortune by selling WAIS (Wide Area Information Server), which made each machine a search engine, and used a protocol to request from all engines. Developed in the 80s.
Unfortunately, the Web has gone down this road of having centralised search engines which somehow know their way through a maze.
It would be better to contact several sites in turn (NYT, FT, Archive, etc) and pull information. If you have to pay, you go through the credit card paywall. Then you don't have to worry about NYT's dark kilobytes.
Why aren't we using it? I guess that's what happens when you sell something like that to AOL. ;-)
I should also point out that Google could have a stream for the NYT, where the NYT feeds all its stories to Google on creation, and Google doesn't enable cache for the stuff people pay for. But for all I know, that's already being done.
But service-push-to-server is better than Google's pull-service-to-server.
It's a useful reminder of how distinctly unsolved the search problem is. Google has taken stabs at this area from different directions with Google Base and Product Search, but there's still a whole world of information "out there" which is inaccessible or not usefully organized.
Google's weakness with structured data and its weakness in cultivating third-party developers are mutually reinforcing and seem to have arisen from the same hubris. In other words, bring back the search API already!
It seems the AJAX Search API now lets you get a machine-readable list of search results in a relatively straightforward fashion; I'm not sure that was true back when the SOAP API was canned. http://code.google.com/apis/ajaxsearch/documentation/#fonje But the terms and conditions still seem to prevent you from using structured data to do anything useful to the search results. http://code.google.com/apis/ajaxsearch/terms.html (see especially the start of 1.3)
Actually, we're creating a way for developers to access the web really easily for the purposes of different kinds of analysis, including building semantic frameworks. The idea is that we give you really cheap, really fast access to millions of pages, and you use our platform to analyze Internet content how you want. $2 per 1 million pages crawled, $0.03 per CPU-hour used for any computing you want to do. Not yet at beta, but you can check our site: http://www.80legs.com.
Looks cool. How would it compare to Amazon/Alexa search service? In theory they allow you to build your own search engine, but in practice you can't really amend their ranking formula and don't get access to the raw inverted index (with tf-idf statistics and such). Yahoo BOSS is in the same league.
Yes, in theory you could do something similar with AWS. However, you'd have to put in the work to handle all the complexities of parallel-computing and web crawling. We do that for you. And yes, our service is cheaper.
We'd love to see developers using our platform to build some very interesting indexes based on innovative concepts.
I have to admit I have limited familiarity with Pig and Sawzall (just added "Learn more about Pig and Sawzall" to my to-do list :D), but our platform is designed to save developers from having to think about clusters, parallelization, etc.
The two basic functions you would interact with as a developer are a select() and a do(). The select() specifies what content you want. The do() specifies what you want to do with that content. The backend infrastructure is supposed to handle everything else.
I don't have any code examples right now, but we plan to provide some with or during the beta release.
Unfortunately, the Web has gone down this road of having centralised search engines which somehow know their way through a maze.
It would be better to contact several sites in turn (NYT, FT, Archive, etc) and pull information. If you have to pay, you go through the credit card paywall. Then you don't have to worry about NYT's dark kilobytes.
Why aren't we using it? I guess that's what happens when you sell something like that to AOL. ;-)
I should also point out that Google could have a stream for the NYT, where the NYT feeds all its stories to Google on creation, and Google doesn't enable cache for the stuff people pay for. But for all I know, that's already being done.
But service-push-to-server is better than Google's pull-service-to-server.