Introduction to Scrapy (3)

Introduction

Introduction to Scrapy (1) Introduction to Scrapy (2)

In the previous articles, I tried using Scrapy to call the Web API. This time, let's create a Spider that downloads the file.

Creating a Spider

This time, we will create a Spider to download the data (zip file) related to MLB. The actual data uses the data published in Sean Lahman Database. Let's save the downloaded zip file in any directory.

The processing flow of Sprider is as follows.

Get the contents of the start_urls page
Analyze the contents of the acquired page and acquire the url containing the csv character string in the href of the a tag.
Get the contents of the url obtained above

`get_csv_spider.py`


# -*- coding:utf-8 -*-

from scrapy import Spider
from scrapy.http import Request


class GetCSVSpider(Spider):
    name = 'get_csv_spider'
    allowed_domains = ['seanlahman.com']

    custom_settings = {
        'DOWNLOAD_DELAY': 1.5,
    }

    #Any directory to save the CSV file
    DIR_NAME = '/tmp/csv/'

    #Endpoint (list the URL to start crawling)
    start_urls = ['http://seanlahman.com/baseball-archive/statistics/']

    #Describe the URL extraction process
    def parse(self, response):
        for href in response.css('.entry-content a[href*=csv]::attr(href)'):
            full_url = response.urljoin(href.extract())

            #Create a Request based on the extracted URL and download it
            yield Request(full_url, callback=self.parse_item)

    #Extract and save the contents based on the downloaded page
    def parse_item(self, response):

        file_name = '{0}{1}'.format(self.DIR_NAME, response.url.split('/')[-1])

        #Save file
        f = open(file_name, 'w')
        f.write(response.body)
        f.close()

Run

Crawl using the commands that come with Scrapy.

scrapy runspider get_csv_spider

When you execute the command, the following log will be displayed on the console. The log shows useful information such as the URL being retrieved, status, bytes, and summaries.

2016-12-06 10:02:22 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapybot)
2016-12-06 10:02:22 [scrapy] INFO: Overridden settings: {'TELNETCONSOLE_ENABLED': False, 'SPIDER_MODULES': ['crawler.main.spiders'], 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 1}
2016-12-06 10:02:22 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2016-12-06 10:02:22 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-06 10:02:22 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-06 10:02:22 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-06 10:02:22 [scrapy] INFO: Spider opened
2016-12-06 10:02:22 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-06 10:02:23 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/baseball-archive/statistics/> (referer: None)
2016-12-06 10:02:28 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman30_csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:35 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman51-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:38 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman_50-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:39 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman53_csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:41 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman56-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:41 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman52_csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:42 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman54_csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:47 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman591-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:49 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman55_csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:49 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman57-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:52 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman58-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:55 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:02:55 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman-csv_2015-01-24.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:03:00 [scrapy] DEBUG: Crawled (200) <GET http://seanlahman.com/files/database/lahman2012-csv.zip> (referer: http://seanlahman.com/baseball-archive/statistics/)
2016-12-06 10:03:00 [scrapy] INFO: Closing spider (finished)
2016-12-06 10:03:00 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4518,
 'downloader/request_count': 15,
 'downloader/request_method_count/GET': 15,
 'downloader/response_bytes': 104279737,
 'downloader/response_count': 15,
 'downloader/response_status_count/200': 15,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 12, 6, 1, 3, 0, 285944),
 'log_count/DEBUG': 15,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 15,
 'scheduler/dequeued': 15,
 'scheduler/dequeued/memory': 15,
 'scheduler/enqueued': 15,
 'scheduler/enqueued/memory': 15,
 'start_time': datetime.datetime(2016, 12, 6, 1, 2, 22, 878024)}
2016-12-06 10:03:00 [scrapy] INFO: Spider closed (finished)

Now that the download is complete, let's check if it is actually downloaded. It seems that it has been downloaded safely.

tree /tmp/csv

/tmp/csv
├── lahman-csv_2014-02-14.zip
├── lahman-csv_2015-01-24.zip
├── lahman2012-csv.zip
├── lahman30_csv.zip
├── lahman51-csv.zip
├── lahman52_csv.zip
├── lahman53_csv.zip
├── lahman54_csv.zip
├── lahman55_csv.zip
├── lahman56-csv.zip
├── lahman57-csv.zip
├── lahman58-csv.zip
├── lahman591-csv.zip
└── lahman_50-csv.zip

0 directories, 14 files

At the end

Scrapy also makes it easy to describe the file download process. Since Scrapy is a framework for crawling, developers can focus on the parts that are called by the framework. From the next time onward, I will cover the data pipeline processing that I did not explain this time. looking forward to!

Reference URL

http://www.slideshare.net/shinyorke/python-39061157
http://qiita.com/checkpoint/items/d9bcc63292d7f01c62d3