Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Quick question. Could you use Scrapy to specific individual pages from thousands (or millions) of sites, or would you be better off using a search engine crawler like Nutch for this? I want to crawl the first page of a number of specific sites and was looking into the technologies for this.


Yes, if you use subclass CrawlSpider, then you will be able to set rules on your crawls [1]

[1] -http://scrapy.readthedocs.org/en/latest/topics/spiders.html?...


Yes. Each spider in Scrapy has a "start_urls" parameter/method, so you'd just need to fill that up with all your domains and make sure the spider has freedom to crawl across domains. Each URL would be accessed, you'd do whatever you want to do and when the spider has visited them all, it would quit.


> would you be better off using a search engine crawler like Nutch for this?

Fwiw we recently met a client who tried Scrapy + Frontera vs Nutch, and their assessment was that Scrapy + Frontera is twice as fast. Here's a deck on Frontera FYI:

http://www.slideshare.net/scrapinghub/frontera-open-source-l...

Aside: don't hesitate to get in touch with our sales team if you need help. We're experienced in this type of project.


Scrapy can do the job, for sure. We use it to crawl more than 2 billion pages a month. :)




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: