One interesting discussion from here: http://www.commoncrawl.org/common-crawl-en...

ahadrana · on Nov 8, 2011

Hi, I work at commoncrawl, so I will try to answer your question. We store our crawl data on S3 in the form of 100MB compressed archives and there are between 40,000 and 50,000 such files in commoncrawl’s bucket today. The key to scanning such a large set of files efficiently on EC2 is to have your each of your Mappers (assuming you are running Hadoop) open multiple S3 streams in parallel to maintain some desired level of throughput. For example, assuming that you can maintain on average a 1MByte/sec throughput per S3 stream, and you start 10 parallel streams per Mapper, you should be able to sustain a throughput 80 Mbits/sec or 10 MBytes/sec. If you were to run one Mapper per EC2 small instance, and start 100 such instances, this would yield and aggregated throughput of close to 3TB/hour. At that rate, you would need 16 hours to scan 50TB of data, or a total of 1600 machine hours at $.085 per hour, costing you somewhere in the neighborhood of $130.00. Of course, you would then need to add in the cost of running any subsequent aggregation / data consolidation jobs and the cost of storing your final data on S3. So, the $100.00 number is generally in the ballpark but final numbers may vary :-)

As far as comparisons to Yahoo BOSS are concerned, no, we are definitely not comparable to Yahoo BOSS or other such APIs that run on top of an already built (and properly ranked) inverted index of the web. At this stage we only produce bulk snapshots of what we crawl, and we are focusing our engineering resources on improving the frequency and coverage of crawl (the results of which will hopefully start to bear fruit in early 2012). Perhaps at some point in the near future, we can partner with the community to build a rudimentary full-text inverted index of the Web that we can make available in bulk via S3 as well.

joda_ · on Nov 8, 2011

Hey ahadrana, I haven't found anything about the page ranks on the website, are they included? Do you know if it is possible to go only trough the metadata of the crawl, say to get the page ranks for a list of pages or do you have to go through the full crawl?

ahadrana · on Nov 8, 2011

The pagerank and other metadata we compute is not part of the S3 corpus, but we do collect this information and probably will make it available in a separate S3 bucket in Hadoop SequenceFiles format. Be aware that our pagerank will probably not have a high degree of correlation to Google's pagerank number, since their pagerank calculation is going to be a lot more sophisticated than our version.

Aloisius · on Nov 8, 2011

Does BOSS still exist? I was under the impression that it was defunct.

michels24 · on Nov 8, 2011

I was the former GM of Yahoo BOSS (was there from pre-launch through 11/09). BOSS does still exist - http://developer.yahoo.com/search/boss/. It is now a paid API under the umbrella of Yahoo Developer Network. The pricing plan (http://developer.yahoo.com/search/boss/#pricing) is based on query type and volume. Unfortunately there is no self-serve advertising model (meaning if you incorporate Y!/Bing search ads, the service is free). It's important to note though that this is the Bing search index, not the old Yahoo Search index that is effectively shut down. The original BOSS product was based on Yahoo! Search.

From what I have heard BOSS continues to do very well and is pointed at internally as how to turn an API into a real business and product.

One more note, I am now at Factual where we are very happy consumers of the CommonCrawl service.

nethsix · on Nov 8, 2011

Yes. With Google no longer providing search result API (not even paid version, the last I checked) people are turning to BOSS/Bing/(anything else?)

csulok · on Nov 8, 2011

custom search API is the search result APi. The cse has a flag for searching the entire internet. http://www.google.com/support/customsearch/bin/answer.py?hl=...