Dynamic rules for following links declaratively with Scrapy

When using CrawlSpider, we have a fixed set of rules that declares how we should follow and process links extracted in the website.

But sometimes we don’t want the rules to be static. We need a certain level of dynamism, where the rules vary according to parameters provided as input in our spiders.

Consider that we are scraping product URLs from an ecommerce website, and we have the following patterns for category URLs:

  • https://store.example.com/ - main page of our store
  • https://store.example.com/electronics - list of products of electronics category
  • https://store.example.com/food - list of products of food category

We can notice the pattern https://store.example.com/<CATEGORY_SLUG> in our URLs. Using CrawlSpider as explained in my last post this set of rules can be defined as:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class StoreSpider(CrawlSpider):
    name = "store"
    start_urls = ["https://store.example.com/"]

    rules = (
        Rule(
            LinkExtractor(allow=r"\/electronics$"),
            callback=self.parse_category,
        ),
        Rule(
            LinkExtractor(allow=r"\/electronics/ID\d+$"),
            callback=self.parse_product,
        ),
        Rule(
            LinkExtractor(allow=r"\/food$"),
            callback=self.parse_category,
        ),
        Rule(
            LinkExtractor(allow=r"\/food/ID\d+$"),
            callback=self.parse_product,
        ),
    )

    def parse_category(self, response):
        ...  # Code to parse a category

    def parse_product(self, response):
        ...  # Code to parse a product

There are a few potential problems with this approach:

  • We need to create a rule for each category we want to extract data from;
  • We need to change the code to add any new categories that we want to start processing;
  • If processing a particular category takes too long, we might want to run the spider in parallel so that each process extracts data from just one category.

What if we could send the name of the category that we want to process as an argument to our spider? We can do it using -a argument=value when calling scrapy crawl such as:

scrapy crawl store -a category=food

If we run the spider passing this argument, we now have access on the spider instance the attribute self.category with the value food which can be used to limit the links we want to be extracted.

Then we can filter our requests, preventing any page that is not in the desired category from being processed.

    def parse_category(self, response):
        if self.parse_category not in response.url:
            return

        ...  # Code to parse a category

The problem with this approach is that we send a real request to all links (even for the categories we don’t want to), no matter if the response will be discarded or not.

A better solution would be making the rules collection to be dynamically:

    rules = (
        Rule(
            LinkExtractor(allow=rf"\/{self.category}$"),
            callback=self.parse_category,
        ),
        Rule(
            LinkExtractor(allow=rf"\/{self.category}/ID\d+$"),
            callback=self.parse_product,
        ),
    )

Unfortunately this will not work, as the rules is a class attribute, so we don’t have an instance of the spider to get the content of our input.

Investigating Scrapy’s code, we found that the defined rules are processed in a call to the _compile_rule method.

Inside this method, each rule in self.rules is evaluated and then appended to self._rules attribute, which is where the spider decides which link to follow.

In this way, we can define our custom _compile_rules method, which will take the value passed as an argument to our spider and define the rules to only extract links that are of the desired category.

Our spider can look like this:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class StoreSpider(CrawlSpider):
    name = "store"
    start_urls = ["https://store.example.com/"]

    def _compile_rules(self):
        self.rules = (
            Rule(
                LinkExtractor(allow=rf"\/{self.category}$"),
                callback=self.parse_category,
            ),
            Rule(
                LinkExtractor(allow=rf"\/{self.category}/ID\d+$"),
                callback=self.parse_product,
            ),
        )

        # After setting our rules, just use the existing _compile_rules() method
        super()._compile_rules()

    def parse_category(self, response):
        ...  # Code to parse a category

    def parse_product(self, response):
        ...  # Code to parse a product

We can now run a separate spider job for each category and extract product data from each category individually:

# Extracts data only of 'electronics' category
scrapy crawl store -a category=electronics
# Extracts data only of 'food' category
scrapy crawl store -a category=food

If we have new categories, we just pass them as a new value for the argument:

# Extracts data only of 'cars' category
scrapy crawl store -a category=cars