'How to do Scrapy historical output comparison using Spidermon

So Scrapinghub is releasing a new feature for Scrapy quality insurance. It says it has historical comparison features where it can detect if the current scrape quantity is only below 50% of the previous scrape, which is suspicious. But, how can I apply it?



Solution 1:[1]

Spidermon version 1.10 introduced a new stats collector, that keeps inside your .scrapy directory the stats of your last job executions (https://spidermon.readthedocs.io/en/latest/stats-collection.html). So every time you execute your spider, your will have available a stats_history property in your Spider instance containing a list of all previous stats of your jobs that were executed before. You don't need to handle the storage of your stats manually as Luiz suggested in his answer anymore (but the principle is basically the same).

Having that information, you can create your own monitors that handles theses statistics and calculate the mean of items scraped and compare them with your latest execution for example (or you can use the stats as you want). You can see an example of a monitor like that in the docs mentioned before.

Solution 2:[2]

To compare the current scraped items with a previous run you first need to store the stats of the previous run somewhere.

Take the Spidermon example project on Github, specially the monitors.py file. It has two monitors defined, ItemCountMonitor and ItemValidationMonitor, the former checks if the spider scraped less than 1000 items, if so it will send a message on Slack. The latter checks if the item schema was validated correctly and if not it will also send a message on Slack.

So now to your question.

If you want to detect if the current scrape extracted 50% less items than the previous scrape you should store the scape stats in some place or even store the scraped items, let's say you store the scraped items on a directory /home/user/scraped_items/%(date)s.json, where %(date)s is the date where your spider ran (eg: 2019-01-01). To simplify let's say you run the spider everyday and there is one file per day.

Then you can write a monitor like this:

import json
from datetime import datetime, timedelta

@monitors.name("Item count dropped")
class ItemCountDroppedMonitor(Monitor):
    @monitors.name("Item count dropped since previous run")
    def test_item_count_dropped(self):
        yesterday = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')
        last_day_item_path = f'/home/user/scraped_items/{yesterday}.json'
        minimum_threshold = 0.5  # 50%
        items_extracted_now = getattr(self.data.stats, "item_scraped_count", 0)
        items_extracted_last_run = json.loads(open(last_day_item_path).read())
        items_extracted_last_run = len(items_extracted_last_run)
        diff = items_extracted_last_run - items_extracted_now
        self.assertFalse(
            diff >= (items_extracted_last_run * minimum_threshold),
            msg="Extracted less items than expected"
        )

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Renne Rocha
Solution 2 Nimantha