I would like to use ** Scrapy **, a crawling and scraping framework for Python, to recursively retrieve the title and URL of a web page and output it in csv format. (Is it an image that lists things like site maps?) ※Complete image result.csv

URL	title
`http://www.example.com`	Top
`http://www.example.com/news`	news
`http://www.example.com/news/2015`	2015 news
…	…

environment

OS：CentOS7 (64bit)
Python version: 2.7.5 (installed as standard on CentOS 7)
pip version: 6.0.8
Scrapy version: 0.24.5

Introducing Scrapy

As for Python, 2.7 series is installed as standard on CentOS 7, so use that.

By the way, Scrapy does not currently work on Python 3 series. (Correspondence is in progress)

Install Scrapy from pip using the following command: $ sudo pip install Scrapy

Project creation

Create a project for Scrapy using the following command. $ scrapy startproject HelloScrapy

I think the contents of the created project are as follows.

Of these, the following files are used this time.

** items.py **: Defines a class to store the scraped data.
** settings.py **: You can specify various options for Scrapy. (The options you can specify are the official Scrapy documentation: http://doc.scrapy.org/en/latest/topics/settings.html You can check it from .html)

Scrapy also uses a class called ** Spider ** to define how to crawl and scrape the target site. To define this Spider, create a file called "hellospider.py" in the above spiders directory.

Up to this point, the structure of the project is as follows.

items.py First, edit items.py. This time, we will get the title and URL of the web page, so define the following class.

`items.py`


from scrapy.item import Item, Field

class PageInfoItem(Item):
	URL = Field()
	title = Field()
	pass

settings.py Next, edit settings.py. I have added the following options:

** DOWNLOAD_DELAY **: Interval (unit: seconds) from downloading one page to downloading the next page
** ROBOTSTXT_OBEY **: Whether to follow robots.txt, if any
** DEPTH_LIMIT **: Recursive exploration depth (0 is unlimited)

`settings.py`


DOWNLOAD_DELAY = 3
ROBOTSTXT_OBEY = True
DEPTH_LIMIT = 5

The intention of the above setting is because I wanted to set a crawl interval of about 3 seconds so as not to put a load on the server of the other party, and search according to robots.txt. (It seems that automatic adjustment by AutoThrottle is also possible, so if you want to know more, please see the official document) Also, if the exploration is too deep, it will take time, so we have set a limit this time.

hellospider.py Finally, define your own Spider for your favorite hellospider.

`hellospider.py`


from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from HelloScrapy.items import PageInfoItem

class HelloSpider(CrawlSpider):
    #Identifier when executing scrapy from CLI
    name = 'hello'
    #Domains that allow spiders to explore
    allowed_domains = ["www.example.com"]
    #Starting point(Start exploration)URL
    start_urls = ["http://www.example.com"]
    #Specific rule with LinkExtractor argument(For example, scrape only pages that include new in the URL)Can be specified, but this time there is no argument because it targets all pages
    #When you download a page that matches the Rule, the function specified in callback will be called.
    #If follow is set to True, the search will be performed recursively.
    rules = [Rule(LinkExtractor(), callback='parse_pageinfo', follow=True)]
    
    def parse_pageinfo(self, response):
        sel = Selector(response)
        item = PageInfoItem()
        item['URL'] = response.url
        #Specify which part of the page to scrape
        #In addition to specifying in xPath format, it is also possible to specify in CSS format
        item['title'] = sel.xpath('/html/head/title/text()').extract()
        return item

done.

Run

After that, if you execute the following command, crawl & scrape will be executed recursively from the specified starting URL, and the result will be output as csv.

$ scrapy crawl hello -o result.csv (Note that the argument is not hellospider.py, but the identifier defined in it)

By the way, the result can also be output in json or xml format. I tried it on my own website, but I think that it can be output according to the completed image.

** * Please use at your own risk when executing for websites on the Internet. ** **

reference: http://doc.scrapy.org/en/latest/ http://orangain.hatenablog.com/entry/scrapy http://akiniwa.hatenablog.jp/entry/2013/04/15/001411

Recursively get website titles and URLs in Scrapy

environment

Introducing Scrapy

Project creation

items.py

settings.py

hellospider.py

Run

`items.py`

`settings.py`

`hellospider.py`