Quick question. Could you use Scrapy to specific individual pages from thousands...

sheraz · on Jan 20, 2016

Yes, if you use subclass CrawlSpider, then you will be able to set rules on your crawls [1]

[1] -http://scrapy.readthedocs.org/en/latest/topics/spiders.html?...

djm_ · on Jan 20, 2016

Yes. Each spider in Scrapy has a "start_urls" parameter/method, so you'd just need to fill that up with all your domains and make sure the spider has freedom to crawl across domains. Each URL would be accessed, you'd do whatever you want to do and when the spider has visited them all, it would quit.

ddebernardy · on Jan 20, 2016

> would you be better off using a search engine crawler like Nutch for this?

Fwiw we recently met a client who tried Scrapy + Frontera vs Nutch, and their assessment was that Scrapy + Frontera is twice as fast. Here's a deck on Frontera FYI:

http://www.slideshare.net/scrapinghub/frontera-open-source-l...

Aside: don't hesitate to get in touch with our sales team if you need help. We're experienced in this type of project.

stummjr · on Jan 20, 2016

Scrapy can do the job, for sure. We use it to crawl more than 2 billion pages a month. :)