Google: "We're Not Doing a Good Job with Structured Data"

zandorg · on Feb 2, 2009

Brewster Kahle (Archive.org) made a small fortune by selling WAIS (Wide Area Information Server), which made each machine a search engine, and used a protocol to request from all engines. Developed in the 80s.

Unfortunately, the Web has gone down this road of having centralised search engines which somehow know their way through a maze.

It would be better to contact several sites in turn (NYT, FT, Archive, etc) and pull information. If you have to pay, you go through the credit card paywall. Then you don't have to worry about NYT's dark kilobytes.

Why aren't we using it? I guess that's what happens when you sell something like that to AOL. ;-)

I should also point out that Google could have a stream for the NYT, where the NYT feeds all its stories to Google on creation, and Google doesn't enable cache for the stuff people pay for. But for all I know, that's already being done.

But service-push-to-server is better than Google's pull-service-to-server.

Anon84 · on Feb 3, 2009

which made each machine a search engine, and used a protocol to request from all engines. Developed in the 80s.

Peer-to-peer search is alive and well, though:

http://sixearch.org/

uberc · on Feb 2, 2009

It's a useful reminder of how distinctly unsolved the search problem is. Google has taken stabs at this area from different directions with Google Base and Product Search, but there's still a whole world of information "out there" which is inaccessible or not usefully organized.

leoc · on Feb 2, 2009

The long-delayed epiphany.

Google's weakness with structured data and its weakness in cultivating third-party developers are mutually reinforcing and seem to have arisen from the same hubris. In other words, bring back the search API already!

litewulf · on Feb 2, 2009

(What do you mean by bringing back the search API? Isn't http://code.google.com/apis/ajaxsearch/ what you want?)

wildwood · on Feb 2, 2009

From that page: "The Google AJAX Search API lets you put Google Search in your web pages with JavaScript."

I've never understood why they call that an API. It's not. It's a web 2.0 widget.

Google used to have a SOAP-based API for natural search results, and it was sweet. I miss those days...

leoc · on Feb 2, 2009

It seems the AJAX Search API now lets you get a machine-readable list of search results in a relatively straightforward fashion; I'm not sure that was true back when the SOAP API was canned. http://code.google.com/apis/ajaxsearch/documentation/#fonje But the terms and conditions still seem to prevent you from using structured data to do anything useful to the search results. http://code.google.com/apis/ajaxsearch/terms.html (see especially the start of 1.3)

th0ma5 · on Feb 2, 2009

if only there was a machine-readable semantic-based web everyone could use ;p

jdrock · on Feb 2, 2009

Actually, we're creating a way for developers to access the web really easily for the purposes of different kinds of analysis, including building semantic frameworks. The idea is that we give you really cheap, really fast access to millions of pages, and you use our platform to analyze Internet content how you want. $2 per 1 million pages crawled, $0.03 per CPU-hour used for any computing you want to do. Not yet at beta, but you can check our site: http://www.80legs.com.

ntoshev · on Feb 2, 2009

Looks cool. How would it compare to Amazon/Alexa search service? In theory they allow you to build your own search engine, but in practice you can't really amend their ranking formula and don't get access to the raw inverted index (with tf-idf statistics and such). Yahoo BOSS is in the same league.

Your service would be cheaper, though.

jdrock · on Feb 3, 2009

Yes, in theory you could do something similar with AWS. However, you'd have to put in the work to handle all the complexities of parallel-computing and web crawling. We do that for you. And yes, our service is cheaper.

We'd love to see developers using our platform to build some very interesting indexes based on innovative concepts.

lsb · on Feb 2, 2009

What type of a platform? Do you have any code sketches to demonstrate the ease of doing otherwise tedious tasks, like in the Pig and Sawzall papers?

jdrock · on Feb 3, 2009

I have to admit I have limited familiarity with Pig and Sawzall (just added "Learn more about Pig and Sawzall" to my to-do list :D), but our platform is designed to save developers from having to think about clusters, parallelization, etc.

The two basic functions you would interact with as a developer are a select() and a do(). The select() specifies what content you want. The do() specifies what you want to do with that content. The backend infrastructure is supposed to handle everything else.

I don't have any code examples right now, but we plan to provide some with or during the beta release.

zandorg · on Feb 3, 2009

Maybe Mechanical Turk could help with bulletin boards, etc.