Recursively get website titles and URLs in Scrapy

I would like to use ** Scrapy **, a crawling and scraping framework for Python, to recursively retrieve the title and URL of a web page and output it in csv format. (Is it an image that lists things like site maps?) ※Complete image result.csv

URL title
http://www.example.com Top
http://www.example.com/news news
http://www.example.com/news/2015 2015 news

environment

Introducing Scrapy

As for Python, 2.7 series is installed as standard on CentOS 7, so use that.

Install Scrapy from pip using the following command: $ sudo pip install Scrapy

Project creation

Create a project for Scrapy using the following command. $ scrapy startproject HelloScrapy

I think the contents of the created project are as follows. Scrapy_skeleton_half.png

Of these, the following files are used this time.

Scrapy also uses a class called ** Spider ** to define how to crawl and scrape the target site. To define this Spider, create a file called "hellospider.py" in the above spiders directory.

Up to this point, the structure of the project is as follows. Scrapy_skeleton2_half.png

items.py First, edit items.py. This time, we will get the title and URL of the web page, so define the following class.

items.py


from scrapy.item import Item, Field

class PageInfoItem(Item):
	URL = Field()
	title = Field()
	pass

settings.py Next, edit settings.py. I have added the following options:

settings.py


DOWNLOAD_DELAY = 3
ROBOTSTXT_OBEY = True
DEPTH_LIMIT = 5

The intention of the above setting is because I wanted to set a crawl interval of about 3 seconds so as not to put a load on the server of the other party, and search according to robots.txt. (It seems that automatic adjustment by AutoThrottle is also possible, so if you want to know more, please see the official document) Also, if the exploration is too deep, it will take time, so we have set a limit this time.

hellospider.py Finally, define your own Spider for your favorite hellospider.

hellospider.py


from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from HelloScrapy.items import PageInfoItem

class HelloSpider(CrawlSpider):
    #Identifier when executing scrapy from CLI
    name = 'hello'
    #Domains that allow spiders to explore
    allowed_domains = ["www.example.com"]
    #Starting point(Start exploration)URL
    start_urls = ["http://www.example.com"]
    #Specific rule with LinkExtractor argument(For example, scrape only pages that include new in the URL)Can be specified, but this time there is no argument because it targets all pages
    #When you download a page that matches the Rule, the function specified in callback will be called.
    #If follow is set to True, the search will be performed recursively.
    rules = [Rule(LinkExtractor(), callback='parse_pageinfo', follow=True)]
    
    def parse_pageinfo(self, response):
        sel = Selector(response)
        item = PageInfoItem()
        item['URL'] = response.url
        #Specify which part of the page to scrape
        #In addition to specifying in xPath format, it is also possible to specify in CSS format
        item['title'] = sel.xpath('/html/head/title/text()').extract()
        return item

done.

Run

After that, if you execute the following command, crawl & scrape will be executed recursively from the specified starting URL, and the result will be output as csv.

$ scrapy crawl hello -o result.csv (Note that the argument is not hellospider.py, but the identifier defined in it)

By the way, the result can also be output in json or xml format. I tried it on my own website, but I think that it can be output according to the completed image.

** * Please use at your own risk when executing for websites on the Internet. ** **

reference: http://doc.scrapy.org/en/latest/ http://orangain.hatenablog.com/entry/scrapy http://akiniwa.hatenablog.jp/entry/2013/04/15/001411

Recommended Posts

Recursively get website titles and URLs in Scrapy
Get date and time in specified format
Get your current location and user agent in Python
Get stock prices and create candlestick charts in Python
Recursively search for files and directories in Python and output
Get a participant's username and screen name in Slack
Change static file storage directories and URLs in Flask
How to get RGB and HSV histograms in OpenCV
Recursively get the Excel list in a specific folder with python and write it to Excel.