Here is a first one : What are the best ways to detect changes in html sources w...

eliasdorneles · on Jan 20, 2016

Well, missing data can happen from problems in several different levels:

1) site changes caused the items that were scraped to be incomplete (missing fields) -- for this, one approach is to use an Item Validation Pipeline in Scrapy, perhaps using a JSON schema or something similar, logging errors or rejecting an item if it doesn't pass the validation.

2) site changes caused the scraping the items itself to fail: one solution is to store the sources and monitor the spider errors -- and when there are errors, you can rescrape from the stored sources (it can get a bit expensive store sources for big crawlers). Scrapy doesn't have a complete solution for this out-of-the-box, you have to build your own. You could use the HTTP cache mechanism and build a custom cache policy: http://doc.scrapy.org/en/latest/topics/downloader-middleware...

3) site changed the navigation structure, and the pages to be scraped from were never reached: this is the worst one, it's similar to the previous one, but it's one that you want to detect earlier -- saving the sources doesn't help much, since it happens at an early time during the crawl, so you want to be monitoring it.

One good practice is to split the crawl in two: one spider does the navigation and push the links of the pages to be scraped into a queue or something, and another spider reads the URLs from that and just scrape the data.

stummjr · on Jan 20, 2016

Hey, not sure if I understood what you mean. Did you mean:

1) detect pages that had changed since the last crawl, to avoid recrawling pages that hadn't changed? 2) detect pages that have changed their structure, breaking down the Spider that crawl it.

stummjr · on Jan 20, 2016

1) detect pages that had changed since the last crawl, to avoid recrawling pages that hadn't changed?

You could use the deltafetch[1] middleware. It ignores requests to pages with items extracted in previous crawls.

2) detect pages that have changed their structure, breaking down the Spider that crawl it.

This is a tough one, since most of the spiders are heavily based on the HTML structure. You could use Spidermon [2] to monitor your spiders. It's available as an addon in the Scrapy Cloud platform [3], and there are plans to open source it in the near future. Also, dealing automatically with pages that change their structure is in the roadmap for Portia [4].

[1] https://github.com/scrapinghub/scrapylib/blob/master/scrapyl...

[2] http://doc.scrapinghub.com/addons.html?highlight=monitoring#...

[3] http://scrapinghub.com/scrapy-cloud/

[4] http://scrapinghub.com/portia/

mitchtbaum · on Jan 20, 2016

> 1) detect pages that had changed since the last crawl, to avoid recrawling pages that hadn't changed?

Usually web clients use https://en.wikipedia.org/wiki/HTTP_ETag , afais. If a web app\server lacks that skill, then you could compute your own hash and check it yourself, instead of processing that condition at the network layer.

grantbachman · on Jan 20, 2016

As someone who does a fair amount of scraping at his job, I'd like to hear what you have to say regarding both questions :)