We have described in an easy-to-understand manner how to execute Scrapy, a framework that enables web scraping. I'm glad if you can use it as a reference.
reference: Python --Create a crawler with Scrapy https://qiita.com/naka-j/items/4b2136b7b5a4e2432da8
Time required 15 minutes Contents
Execute the following pip on the terminal to install scrapy
pip install scrapy
Then go to the directory where you want to create the scrapy project and do the following
scrapy startproject sake
After this, I will scrape the website related to sake, so I changed the name of the project to "sake". Then, the following folders will be configured under the current directory.
Web scraping is not possible with the above files alone, so enter the following command and enter the following command. Create files in the spiders directory.
#scrapy genspider <file name> <Web URL you want to scrape>
scrapy genspider scrapy_sake https://www.saketime.jp/ranking/
Then you can see that a file called "scrapy_sake.py" is created in the spiders directory. The contents of the created file are as follows.
sake/sake/spiders/scrapy_sake.py
# -*- coding: utf-8 -*-
import scrapy
class ScrapySakeSpider(scrapy.Spider):
name = 'scrapy_sake'
allowed_domains = ['https://www.saketime.jp/ranking/']
start_urls = ['http://https://www.saketime.jp/ranking/']
def parse(self, response):
pass
As I will explain in detail later, I will mainly code this "def parse" part. Before coding, let's check once if you can get the web information accurately. Add a print statement to the "def parse" part to see the acquired information.
sake/sake/spiders/scrapy_sake.py
# -*- coding: utf-8 -*-
import scrapy
class ScrapySakeSpider(scrapy.Spider):
name = 'scrapy_sake'
allowed_domains = ['https://www.saketime.jp/ranking/']
start_urls = ['http://https://www.saketime.jp/ranking/']
def parse(self, response):
#Delete pass and add print statement
print(response)
And if you execute the following command, quite a lot of output will be returned, but you can confirm that html is acquired firmly in it.
Execution command
#scrapy crawl <file name>
scrapy crawl scrapy_sake
Output
<li class="brand_review clearfix">
<div>
<p>
Iso's pride, special brewed raw sake
Click here for today's sake, Iso's proud special brewed raw sake!
Rice... <br>
<span class="brand_review_user">
by
<span>Sue</span>
<span>
<span class="review-star">★</span>
<span>4.5</span>
</span>
<span class="reviewtime">
<span>March 23, 2020</span>
</span>
</span>
</p>
</div>
</li>
</ul>
</a>
</div>
:
:
:
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 4, 9, 3, 23, 24, 461847)}
2020-04-09 12:23:26 [scrapy.core.engine] INFO: Spider closed (finished)
Next, let's extract only the necessary information from here!
Basically, there are only two files implemented by scrapy:
Let's edit from item.py first. When you first open it, it originally looks like the following file.
sake/items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class SakeItem(scrapy.Item):
pass
Arbitrarily register the information you want to get with scrapy in this class part.
sake/items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class SakeItem(scrapy.Item):
#Information name you want to get(Any) = scrapy.Field()
prefecture_maker = scrapy.Field()
prefecture = scrapy.Field()
maker = scrapy.Field()
brand = scrapy.Field()
pass
This is the end of the description of items.py. Next, let's move on to coding scrapy_sake.py.
The completed form is as follows. I think that the inside of def parse () is richer than the one seen in Chapter 2 above.
sake/sake/spiders/scrapy_sake.py
# -*- coding: utf-8 -*-
import scrapy
#items.Don't forget to import py
from sake.items import SakeItem
class ScrapySakeSpider(scrapy.Spider):
name = 'scrapy_sake'
#allowed_domains = ['ja.wikipedia.org']
start_urls = ['https://www.saketime.jp/ranking/']
def parse(self, response):
items = []
#html tag li.Sake information was stored in a place called clearfix.
sakes = response.css("li.clearfix")
#Multiple li on the page.Let's look at each clearfix
for sake in sakes:
#item.Declare a SakeItem object defined in py
item = SakeItem()
item["prefecture_maker"] = sake.css("div.col-center p.brand_info::text").extract_first()
#<div class="headline clearfix">In the case of description like,headline.In between as clearfix.To put on
item["brand"] = sake.css("div.headline.clearfix h2 a span::text").extract_first()
#Cleansing the acquired data
if (item["prefecture_maker"] is not None) or (item["brand"] is not None):
#Delete \ n and spaces
item["prefecture_maker"] = item["prefecture_maker"].replace(' ','').replace('\n','')
#Separation of prefecture and maker
item["prefecture"] = item["prefecture_maker"].split('|')[0]
item["maker"] = item["prefecture_maker"].split('|')[1]
items.append(item)
print(items)
#Reflect page switching with recursive processing
#a tag rel="next"Get the elements of
next_page = response.css('a[rel="next"]::attr(href)').extract_first()
if next_page is not None:
#Convert to absolute path if URL is relative path
next_page = response.urljoin(next_page)
#Return Request once in yield, sakes are registered on the page after request, and the above for statement is executed again
yield scrapy.Request(next_page, callback=self.parse)
When these are executed, it will be as follows.
:
:
:
2020-04-10 16:52:58 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.saketime.jp/ranking/page:110/> from <GET https://www.saketime.jp/ranking/page:110>
2020-04-10 16:52:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.saketime.jp/ranking/page:110/> (referer: https://www.saketime.jp/ranking/page:109/)
[{'brand': 'Orochi's tongue',
'maker': 'Kisuki Brewery',
'prefecture': 'Shimane',
'prefecture_maker': 'Shimane|Kisuki Brewery'}, {'brand': '禱 and Minoru',
'maker': 'Fukumitsuya',
'prefecture': 'Ishikawa',
'prefecture_maker': 'Ishikawa|Fukumitsuya'}, {'brand': 'Kanazawa beauty',
'maker': 'Fukumitsuya',
'prefecture': 'Ishikawa',
'prefecture_maker': 'Ishikawa|Fukumitsuya'}, {'brand': 'Jinkuro',
'maker': 'Hokusetsu Sake Brewery',
'prefecture': 'Niigata',
'prefecture_maker': 'Niigata|Hokusetsu Sake Brewery'}, {'brand': 'Kenroku Sakura',
'maker': 'Nakamura Sake Brewery',
'prefecture': 'Ishikawa',
'prefecture_maker': 'Ishikawa|Nakamura Sake Brewery'}, {'brand': 'birth',
'maker': 'Tohoku Meijo',
'prefecture': 'Yamagata',
'prefecture_maker': 'Yamagata|Tohoku Meijo'}, {'brand': 'SUMMERGODDESS',
'maker': 'Mana Tsuru Sake Brewery',
'prefecture': 'Fukui',
:
:
:
'scheduler/dequeued/memory': 221,
'scheduler/enqueued': 221,
'scheduler/enqueued/memory': 221,
'start_time': datetime.datetime(2020, 4, 10, 7, 51, 13, 756973)}
2020-04-10 16:53:00 [scrapy.core.engine] INFO: Spider closed (finished)
I got 110 pages of sake information in JSON format in about 20 seconds. It's convenient.
Try scraping the sites you are interested in to get information.
Although it is basic information, as a way to read the html information of the site you want to scrape, in the case of chrome browser and macOS, it is possible to display it with cmd + option + i. You can also press cmd + shift + c to click on an element in the site to see where it represents in the html code.
Recommended Posts