This post assumes that you have a basic understanding of how Scrapy, a web scraping framework, works, discussing some lesser-known features. You can have an introduction to it in your documentation.
Extract links from a response, filtering by a specific pattern and following them is a common task that you will face when scraping a website.
As an example, suppose that you are scraping a store website and you want to process two different URLs:
- Product URLs to gather product details
- Category URLs to collect information about a category (e.g. number of items, subcategories, etc.) and/or more product URLs
We can implement our spider like the following:
class StoreSpider(scrapy.Spider):
name = "store"
start_urls = ["https://store.example.com/"]
def parse(self, response):
products = response.css('a.product::attr(href)')
for link in links:
yield scrapy.Request(link, callback=self.parse_product)
categories = response.css('a.category::attr(href)')
for link in links:
yield scrapy.Request(link, callback=self.parse_category)
def parse_product(self, response):
... # Code to parse a product
def parse_category(self, response):
... # Code to parse a category
Only anchor elements with CSS classes product
or category
are of our interest. We have a different parse
method based on the type of element being followed.
As soon you have more types of links and more complex rules to find them in the website response, your parsing methods will become more complicated and prone to have some code duplication.
For example, if inside your parse_category
you also want to find more products, more categories and also subcategories, you will need to duplicate some code such as:
class StoreSpider(scrapy.Spider):
name = "store"
start_urls = ["https://store.example.com/"]
# (...)
def parse_category(self, response):
# (...) Code to parse a category
products = response.css('a.product::attr(href)')
for link in links:
yield scrapy.Request(link, callback=self.parse_product)
categories = response.css('a.category::attr(href)')
for link in links:
yield scrapy.Request(link, callback=self.parse_category)
sub_categories = response.css('a.sub_category::attr(href)')
for link in links:
yield scrapy.Request(link, callback=self.parse_subcategory)
To avoid all this repeated code, Scrapy comes with generic spiders providing special functionality for common scraping cases.
A CrawlSpider
is a generic spider (inherited from the regular scrapy.Spider
class) that provides a mechanism for following links by defining a set of rules (just like we did in our previous example).
Instead of actively looking for each link in its response, iterating over the results, and sending a request for each one, we can use a declarative pattern by providing a list of Rule
objects with arguments that state which links we want to follow and what to do with the response.
Our previous spider can be rewritten as:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class StoreSpider(CrawlSpider):
name = "store"
start_urls = ["https://store.example.com/"]
rules = (
Rule(
LinkExtractor(restrict_css="a.product"),
callback=self.parse_product
),
Rule(
LinkExtractor(restrict_css="a.category"),
callback=self.parse_category
),
Rule(
LinkExtractor(
restrict_css=".pagination",
restrict_text="Next Page",
),
),
)
def parse_product(self, response):
... # Code to parse a product
def parse_category(self, response):
... # Code to parse a category
You may notice that we now have a rules
tuple. Each Rule
is assigned a LinkExtractor
object that defines how (and which) links will be extracted from each page.
In addition, we have a callback
, which tells us which parse method should be used to process the return of the request from each of these links.
Given that rule as example:
Rule(
LinkExtractor(restrict_css="a.product"),
callback=self.parse_product
)
It extracts all the links with the CSS class product
, requests them and processes the response in the spider’s method parse_product
.
The following Rule
extracts only links with the CSS class equal to next
, but that contains the text Next Page
on it. So <a class="next" href='https://store.example.com/page/2'>Next Page</a>
will be extracted, but <a class="next" href='https://store.example.com/events/'>Future Events</a>
will not.
Rule(
LinkExtractor(
restrict_css="a.next",
restrict_text="Next Page",
),
),
Notice that we don’t have a callback
provided for this Rule
. If you don’t provide one, the link will still be requested and the response will be parsed using all the rules
we have to find more links to follow.
You can see more ways to filter the links you want to extract in Link Extractor documentation.
Using CrawlSpider
avoids some complexities in our parsing methods, keeping them focused only on how to scrape data into a page response, leaving the task of crawling the website (i.e., finding the following links) to a more declarative, easier-to-read and understand pattern.