When using CrawlSpider
, we have a fixed set of rules that declares how we should follow and process links extracted in the website.
But sometimes we don’t want the rules to be static. We need a certain level of dynamism, where the rules vary according to parameters provided as input in our spiders.
Consider that we are scraping product URLs from an ecommerce website, and we have the following patterns for category URLs:
https://store.example.com/
- main page of our storehttps://store.example.com/electronics
- list of products ofelectronics
categoryhttps://store.example.com/food
- list of products offood
category
We can notice the pattern https://store.example.com/<CATEGORY_SLUG>
in our URLs. Using CrawlSpider
as explained in my last post this set of rules
can be defined as:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class StoreSpider(CrawlSpider):
name = "store"
start_urls = ["https://store.example.com/"]
rules = (
Rule(
LinkExtractor(allow=r"\/electronics$"),
callback=self.parse_category,
),
Rule(
LinkExtractor(allow=r"\/electronics/ID\d+$"),
callback=self.parse_product,
),
Rule(
LinkExtractor(allow=r"\/food$"),
callback=self.parse_category,
),
Rule(
LinkExtractor(allow=r"\/food/ID\d+$"),
callback=self.parse_product,
),
)
def parse_category(self, response):
... # Code to parse a category
def parse_product(self, response):
... # Code to parse a product
There are a few potential problems with this approach:
- We need to create a rule for each category we want to extract data from;
- We need to change the code to add any new categories that we want to start processing;
- If processing a particular category takes too long, we might want to run the spider in parallel so that each process extracts data from just one category.
What if we could send the name of the category that we want to process as an argument to our spider? We can do it using -a argument=value
when calling scrapy crawl
such as:
scrapy crawl store -a category=food
If we run the spider passing this argument, we now have access on the spider instance the attribute self.category
with the value food
which can be used to limit the links we want to be extracted.
Then we can filter our requests, preventing any page that is not in the desired category from being processed.
def parse_category(self, response):
if self.parse_category not in response.url:
return
... # Code to parse a category
The problem with this approach is that we send a real request to all links (even for the categories we don’t want to), no matter if the response will be discarded or not.
A better solution would be making the rules collection to be dynamically:
rules = (
Rule(
LinkExtractor(allow=rf"\/{self.category}$"),
callback=self.parse_category,
),
Rule(
LinkExtractor(allow=rf"\/{self.category}/ID\d+$"),
callback=self.parse_product,
),
)
Unfortunately this will not work, as the rules
is a class attribute, so we don’t have an instance of the spider to get the content of our input.
Investigating Scrapy’s code, we found that the defined rules are processed in a call to the _compile_rule
method.
Inside this method, each rule in self.rules
is evaluated and then appended to self._rules
attribute, which is where the spider decides which link to follow.
In this way, we can define our custom _compile_rules
method, which will take the value passed as an argument to our spider and define the rules to only extract links that are of the desired category.
Our spider can look like this:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class StoreSpider(CrawlSpider):
name = "store"
start_urls = ["https://store.example.com/"]
def _compile_rules(self):
self.rules = (
Rule(
LinkExtractor(allow=rf"\/{self.category}$"),
callback=self.parse_category,
),
Rule(
LinkExtractor(allow=rf"\/{self.category}/ID\d+$"),
callback=self.parse_product,
),
)
# After setting our rules, just use the existing _compile_rules() method
super()._compile_rules()
def parse_category(self, response):
... # Code to parse a category
def parse_product(self, response):
... # Code to parse a product
We can now run a separate spider job for each category and extract product data from each category individually:
# Extracts data only of 'electronics' category
scrapy crawl store -a category=electronics
# Extracts data only of 'food' category
scrapy crawl store -a category=food
If we have new categories, we just pass them as a new value for the argument:
# Extracts data only of 'cars' category
scrapy crawl store -a category=cars