I would like to use ** Scrapy **, a crawling and scraping framework for Python, to recursively retrieve the title and URL of a web page and output it in csv format. (Is it an image that lists things like site maps?) ※Complete image result.csv
URL | title |
---|---|
http://www.example.com |
Top |
http://www.example.com/news |
news |
http://www.example.com/news/2015 |
2015 news |
… | … |
As for Python, 2.7 series is installed as standard on CentOS 7, so use that.
Install Scrapy from pip using the following command:
$ sudo pip install Scrapy
Create a project for Scrapy using the following command.
$ scrapy startproject HelloScrapy
I think the contents of the created project are as follows.
Of these, the following files are used this time.
Scrapy also uses a class called ** Spider ** to define how to crawl and scrape the target site. To define this Spider, create a file called "hellospider.py" in the above spiders directory.
Up to this point, the structure of the project is as follows.
items.py First, edit items.py. This time, we will get the title and URL of the web page, so define the following class.
items.py
from scrapy.item import Item, Field
class PageInfoItem(Item):
URL = Field()
title = Field()
pass
settings.py Next, edit settings.py. I have added the following options:
settings.py
DOWNLOAD_DELAY = 3
ROBOTSTXT_OBEY = True
DEPTH_LIMIT = 5
The intention of the above setting is because I wanted to set a crawl interval of about 3 seconds so as not to put a load on the server of the other party, and search according to robots.txt. (It seems that automatic adjustment by AutoThrottle is also possible, so if you want to know more, please see the official document) Also, if the exploration is too deep, it will take time, so we have set a limit this time.
hellospider.py Finally, define your own Spider for your favorite hellospider.
hellospider.py
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from HelloScrapy.items import PageInfoItem
class HelloSpider(CrawlSpider):
#Identifier when executing scrapy from CLI
name = 'hello'
#Domains that allow spiders to explore
allowed_domains = ["www.example.com"]
#Starting point(Start exploration)URL
start_urls = ["http://www.example.com"]
#Specific rule with LinkExtractor argument(For example, scrape only pages that include new in the URL)Can be specified, but this time there is no argument because it targets all pages
#When you download a page that matches the Rule, the function specified in callback will be called.
#If follow is set to True, the search will be performed recursively.
rules = [Rule(LinkExtractor(), callback='parse_pageinfo', follow=True)]
def parse_pageinfo(self, response):
sel = Selector(response)
item = PageInfoItem()
item['URL'] = response.url
#Specify which part of the page to scrape
#In addition to specifying in xPath format, it is also possible to specify in CSS format
item['title'] = sel.xpath('/html/head/title/text()').extract()
return item
done.
After that, if you execute the following command, crawl & scrape will be executed recursively from the specified starting URL, and the result will be output as csv.
$ scrapy crawl hello -o result.csv
(Note that the argument is not hellospider.py, but the identifier defined in it)
By the way, the result can also be output in json or xml format. I tried it on my own website, but I think that it can be output according to the completed image.
** * Please use at your own risk when executing for websites on the Internet. ** **
reference: http://doc.scrapy.org/en/latest/ http://orangain.hatenablog.com/entry/scrapy http://akiniwa.hatenablog.jp/entry/2013/04/15/001411
Recommended Posts