Following links declaratively with Scrapy

This post assumes that you have a basic understanding of how Scrapy, a web scraping framework, works, discussing some lesser-known features. You can have an introduction to it in your documentation.

Extract links from a response, filtering by a specific pattern and following them is a common task that you will face when scraping a website.

As an example, suppose that you are scraping a store website and you want to process two different URLs:

  1. Product URLs to gather product details
  2. Category URLs to collect information about a category (e.g. number of items, subcategories, etc.) and/or more product URLs

We can implement our spider like the following:

class StoreSpider(scrapy.Spider):
    name = "store"
    start_urls = ["https://store.example.com/"]

    def parse(self, response):
        products = response.css('a.product::attr(href)')
        for link in links:
            yield scrapy.Request(link, callback=self.parse_product)

        categories = response.css('a.category::attr(href)')
        for link in links:
            yield scrapy.Request(link, callback=self.parse_category)

    def parse_product(self, response):
        ...  # Code to parse a product

    def parse_category(self, response):
        ...  # Code to parse a category

Only anchor elements with CSS classes product or category are of our interest. We have a different parse method based on the type of element being followed.

As soon you have more types of links and more complex rules to find them in the website response, your parsing methods will become more complicated and prone to have some code duplication.

For example, if inside your parse_category you also want to find more products, more categories and also subcategories, you will need to duplicate some code such as:

class StoreSpider(scrapy.Spider):
    name = "store"
    start_urls = ["https://store.example.com/"]

    # (...)

    def parse_category(self, response):
        # (...) Code to parse a category

        products = response.css('a.product::attr(href)')
        for link in links:
            yield scrapy.Request(link, callback=self.parse_product)

        categories = response.css('a.category::attr(href)')
        for link in links:
            yield scrapy.Request(link, callback=self.parse_category)

        sub_categories = response.css('a.sub_category::attr(href)')
        for link in links:
            yield scrapy.Request(link, callback=self.parse_subcategory)

To avoid all this repeated code, Scrapy comes with generic spiders providing special functionality for common scraping cases.

A CrawlSpider is a generic spider (inherited from the regular scrapy.Spider class) that provides a mechanism for following links by defining a set of rules (just like we did in our previous example).

Instead of actively looking for each link in its response, iterating over the results, and sending a request for each one, we can use a declarative pattern by providing a list of Rule objects with arguments that state which links we want to follow and what to do with the response.

Our previous spider can be rewritten as:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class StoreSpider(CrawlSpider):
    name = "store"
    start_urls = ["https://store.example.com/"]

    rules = (
        Rule(
            LinkExtractor(restrict_css="a.product"),
            callback=self.parse_product
        ),
        Rule(
            LinkExtractor(restrict_css="a.category"),
            callback=self.parse_category
        ),
        Rule(
            LinkExtractor(
                restrict_css=".pagination",
                restrict_text="Next Page",
            ),
        ),
    )

    def parse_product(self, response):
        ...  # Code to parse a product

    def parse_category(self, response):
        ...  # Code to parse a category

You may notice that we now have a rules tuple. Each Rule is assigned a LinkExtractor object that defines how (and which) links will be extracted from each page.

In addition, we have a callback, which tells us which parse method should be used to process the return of the request from each of these links.

Given that rule as example:

Rule(
    LinkExtractor(restrict_css="a.product"),
    callback=self.parse_product
)

It extracts all the links with the CSS class product, requests them and processes the response in the spider’s method parse_product.

The following Rule extracts only links with the CSS class equal to next, but that contains the text Next Page on it. So <a class="next" href='https://store.example.com/page/2'>Next Page</a> will be extracted, but <a class="next" href='https://store.example.com/events/'>Future Events</a> will not.

Rule(
    LinkExtractor(
        restrict_css="a.next",
        restrict_text="Next Page",
    ),
),

Notice that we don’t have a callback provided for this Rule. If you don’t provide one, the link will still be requested and the response will be parsed using all the rules we have to find more links to follow.

You can see more ways to filter the links you want to extract in Link Extractor documentation.

Using CrawlSpider avoids some complexities in our parsing methods, keeping them focused only on how to scrape data into a page response, leaving the task of crawling the website (i.e., finding the following links) to a more declarative, easier-to-read and understand pattern.