Data Science: A multi-disciplinary field in which scientific methodology is applied to data analysis in order to find useful insights and promote evidence-based decision making.

Web Scraping with Scrapy

2nd April 2018

Author: Trevor Simmons

At some time in a data scientist’s work there will be a requirement to scrape some data from a website. For example, it came up right away in my first ever data science project. Back then I used the Python library Beautiful Soup but at present my tool of choice is Scrapy, an open-source web scraping framework. In this post I will discuss the installation and coding of a Scrapy web spider, then demonstrate it on an example website.

The ethics of web scraping

Before getting started with web scraping it would be a good idea to think about the ethics of doing so. While there are currently no laws in the UK that prohibit web scraping, and hypothetically anyone could take whatever data they like from a website providing that they do not violate the site’s Terms of Use. It is far better to take a responsible approach and have a pre-defined ethical policy. Currently, my policy is as follows:

For the purposes of this example, I will be scraping data from http://testing-ground.scraping.pro which is a site set up specifically to test web scraping.

Installing Scrapy

Scrapy can be easily installed via pip, Python’s recommended package manager by opening a terminal and typing:

pip install scrapy

This will install it on the main system but a virtual environment could also be used. It can also be installed via Anaconda or Miniconda. More information on the options, plus platform specific notes can be found in Scrapy’s installation guide.

Creating a new project

I would like to call my project test_scraper so in the terminal I navigate to the directory I would like to create my project in and type:

scrapy startproject test_scraper

This creates the project directory with the following directory structure:

test_scraper/
    scrapy.cfg        # deploy configuration file
    test_scraper/     # project's Python module, you'll import your code from here
        __init__.py    
        __pycache__    
        items.py        # project items definition file
        middlewares.py  # project middlewares file
        pipelines.py    # project pipelines file
        settings.py     # project settings file
        spiders/        # a directory where you'll later put your spiders

Configuring Scrapy

The Scrapy configuration file settings.py provides many different options to customise the project and here I will look at the essential settings in this file. As I mentioned in my ethical policy, I will always set a user-agent identifying myself with a contact address. I prefer to add a direct email address rather than a website URL because it also contains my website URL.

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Wiringbot (example@thewiringundertheboard.net)'

It would also be considerate to obey the site’s robots.txt so I set that to True.

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

The amount of time that Scrapy should wait before downloading consecutive pages is likely one of the most important settings. As part of my ethical policy I suggest not to overload a site with too many requests so this should be set this accordingly. The figure chosen would be dependent on many factors such as the number of pages you to scrape and in what timeframe, but for this demonstration I will set it to one second.

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1

DOWNLOAD_DELAY is also affected by one of the following, I have left it at its default value of 16 for the per domain setting.

# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

I have only covered a small fraction of the available options here but a full list can be viewed in the Scrapy settings documentation.

The Scrapy shell

When it comes to writing the code for the spider, the Scrapy shell is extremely useful because it allows for a quick and interactive running of Scrapy requests against a webpage inside a Python console. I usually have this open and placed beside my editor while I write my spider so I can switch between the two and check my code. It can be launched by typing:

scrapy shell http://testing-ground.scraping.pro

I have used the URL of the testing website, but that can be substituted with any other URL. On running this command some initialisation text will print along with the available Scrapy objects, and finally a command prompt.

For my demonstration I would like to use the table report on the testing website. A powerful feature of Scrapy is the ability to navigate through a site’s structure, and for demo purposes I will point my spider at the index page and follow the link to retrieve the page I would like. Using the shell to see what links are available, consider the following HTML:

<div id="content">
    <div class="caseblock">
        <a href="table">TABLE REPORT</a>
        <div class="casedescr">simulates complicated financial table report</div>
    </div>
    <div class="caseblock">
        <a href="blocks">BLOCKS: Price list</a>
        <div class="casedescr">complicated block layout presented as a price list</div>
    </div>
    <div class="caseblock">
        <a href="textlist">TEXT LIST</a>
        <div class="casedescr">list of items organized as a simple text</div>]
    </div>
    <div class="caseblock">
        <a href="invalid">INVALID HTML</a>
        <div class="casedescr">HTML with markup errors</div>
    </div>
    <div class="caseblock">
        <a href="login">LOGIN</a>
        <div class="casedescr">Form-based authentication via POST method</div>
    </div>
    <div class="caseblock">
        <a href="ajax">AJAX</a>
        <div class="casedescr">Receiving HTML, XML and JSON via AJAX</div>
    </div>
    <div class="caseblock">
        <a href="captcha">CAPTCHA</a>
        <div class="casedescr">CAPTCHA recognition</div>
    </div>
    <div class="caseblock">
        <a href="whoami">WHO AM I?</a>
        <div class="casedescr">Shows web client information</div>
    </div>
    <div class="caseblock">
        <a href="recaptcha">RECAPTCHA</a>
        <div class="casedescr">ReCaptcha solution</div>
    </div>    
    <div class="caseblock_u">
        <a href="popups">POPUPS</a>
        <div class="casedescr"></div>
    </div>
    <div class="caseblock_u">
        <a href="images">IMAGES</a>
        <div class="casedescr"></div>
    </div>
    <div class="caseblock_u">
        <a href="frames">FRAMES</a>
        <div class="casedescr"></div>
    </div>
</div>

To extract all the URLs I can type at the command prompt:

response.xpath('//div[@id="content"]/div[@class="caseblock"]/a/@href').extract()

Which will select the href from all the anchor links, which are inside the div with the ID of caseblock, which are inside the div with the ID of content. I have used the XPath selector here but it also possible to use the CSS selector and extract the same data with:

response.css('div#content > div.caseblock > a::attr(href)').extract()

My personal preference is to use XPath because I find it more powerful, but there is a debate over this. I have also used Scrapy’s extract() function which will return a Python list of the matching elements as strings.

Creating the spider

With everything set up and configured, and with the shell open for trying out requests I can go ahead can create the spider that will do the actual scraping. The Scrapy spider is a class that extends the scrapy.Spider class, and here I will go through the code step by step. The full spider class along with all the other files can be downloaded from my GitHub account with all the proper indenting which I have removed here for ease of formatting. I suggest opening the full code in your preferred editor and reading it along with my explanations below.

Download the spider code.

Firstly I import the scrapy module:

import scrapy

Then I define the class which extends scrapy.spider:

class TestScraperSpider(scrapy.Spider):

I give the spider a name and set the URL that I would like to scrape:

name = 'test_scraper'

start_urls = [
    'http://testing-ground.scraping.pro'
]

Then I define the parse function, which is the default callback used by Scrapy to process downloaded responses. I could have entered the page we would like to parse directly in the start_urls list, but to demonstrate Scrapy’s link following I have decided to go to the site’s index page and follow the link there. Where I extract the enclosing class=”caseblock” divs, notice that I do not use the extract() function because I would like to fill the link variable with Scrapy Selector objects, which I can then reference using link.xpath(). I also use the extract_first() function rather than extract(), this is because I want to return the first match as a string variable rather than as a list. Once I have found the label for the link I require then I yield the response.follow() function to create a new request for that page. I have set a callback function of parse_block_layout() to process the page once the response has returned.

def parse(self, response):

    # follow links to individual pages
    for link in response.xpath('//div[@id="content"]/div[@class="caseblock"]'):

        # extract the label and url
        label = link.xpath('./div[@class="casedescr"]/text()').extract_first()
        url = link.xpath('./a/@href').extract_first()

        # follow the table report url
        if label == 'complicated block layout presented as a price list':
            yield response.follow(url, callback=self.parse_block_layout)

Now that I have a response for the page that I would like to scrape, the parse_block_layout function is called and I can test for one of the two example cases by looping through them and storing the ID in the case variable. Unlike some languages, Python does not have a switch statement so case is not a reserved word.

def parse_block_layout(self, response):
    # loop through the two cases
    for case_array in response.xpath('//div[@id="case_blocks"]/*[starts-with(@id, "case")]'):

        # retrieve the current case ID
        case = case_array.xpath('@id').extract_first()

Case1 is pretty straightforward. I loop through each product with the classname of product, ignoring the adverts and extract the information within. Because the prod* classes contains incrementing numbers, I have used a wildcard in [starts-with(@class, “prod”)]. I have also used span[1] and span[2] to select the first and second occurring spans respectively. Finally I yield the results in a Python dictionary.

# extract differently depending on case
if case == 'case1':

    # loop through the products, ignore ads
    for product in case_array.xpath('./div[starts-with(@class, "prod")]'):
        name = product.xpath('./span/div[@class="name"]/text()').extract_first()
        desc = product.xpath('./span[1]/text()').extract_first()
        price = product.xpath('./span[2]/text()').extract_first()

        yield {
            'name': name,
            'desc': desc,
            'price': price
        }

Case2 is a bit more complicated in that the information I need is in two side-by-side divs with the classes left and right. The way I have done it is to loop through the left divs keeping a record of which one I am in using the row variable, and then access the data in the corresponding right div by its row number. There are two left and two right divs in total so to access the immediate corresponding right div I use following-sibling::div[1][@class=”right”] to make sure I selected the correct sibling.

if case == 'case2':

    # loop through each class="left" div
    for div in case_array.xpath('//div[@class="left"]'):

        # set a row number so we can find the corresponding price row
        # loop through each product within
        for row, product in enumerate(div.xpath('./*[starts-with(@class, "prod")]')):

            name = product.xpath('./div[@class="name"]/text()').extract_first()
            desc = product.xpath('./text()').extract_first()

            # find the price in the corresponding class="right" div
            price = div.xpath('./following-sibling::div[1][@class="right"]/\
                              div[starts-with(@class, "price")][' + str(row) + ']\
                              /text()').extract_first()
            yield {
                'name': name,
                'desc': desc,
                'price': price
            }

Testing and debugging the spider

To test the spider and check if there are any errors I can use the parse command to see it’s output.

scrapy parse --spider=test_scraper -d 2 http://testing-ground.scraping.pro

I tell it which spider to use and set a depth of two levels with -d 2. The parse command also allow a particular method of the spider to be run so can be used to debug specific section of a spider.

Running the spider

Once the spider has been tested and debugged then it can be run from the top level of its directory with:

scrapy crawl test_scraper -o test_scraper.csv

I have specified an output file of test_scraper.csv for a CSV file and Scrapy provides a number of different output file formats. It would also be possible to store the results in a database, and for that an Item Pipeline would be used.

What next?

I have only (ahem) scraped the surface of what Scrapy can do here and provided a very basic example. If scraping larger and more complex websites then it would be advisable to structure the code differently and maybe not use for loops. As I have previously mentioned, there are a number of ways to extend Scrapy and I recommend the Scrapy documentation to give an idea of the possibilities. In future posts I intend to delve deeper into Scrapy’s functionality and cover more advanced features.