I did some basic crawling and scraping with Scrapy.
Scrapy project to get the title and total score of the 2016 winter anime from the following anime information summary site https://www.anikore.jp/chronicle/2016/winter/
https://doc.scrapy.org/en/1.2/intro/tutorial.html As you can see in this original tutorial
% scrapy startproject project_name
The project will be created by executing the command.
This time, I changed project_name
to ```anime``.
Next, create a Python file that will be a Spider (scraper) in the spiders
directory of the project.
This time, I named it `ʻanime_spider.py``.
The finished product looks like this:
anime_spider.py
import scrapy
class AnimeSpider(scrapy.Spider):
name = "anime"
start_urls = [
'https://www.anikore.jp/chronicle/2016/winter/'
]
def parse(self, response):
for anime in response.css('div.animeSearchResultBody'):
yield {
'title': anime.css('span.animeTitle a::text').extract_first(),
'score': anime.css('span.totalRank::text').extract_first()
}
next_page = response.css('a.next::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
name = "anime"
This is the name of Spider. When performing scraping, use the name declared here as follows.
% scrapy crawl anime
start_urls = [
'https://www.anikore.jp/chronicle/2016/winter/'
]
This is the URL that will be the starting point for crawling. This time I want to get a list of 2016 winter anime, so I declare it at the top of 2016 winter anime.
Scraping and crawling are done in a function called parse (). Below is the scraping part.
for anime in response.css('div.animeSearchResultBody'):
yield {
'title': anime.css('span.animeTitle a::text').extract_first(),
'score': anime.css('span.totalRank::text').extract_first()
}
In Scrapy you can access data with css and xpath, but this time I wrote it with css.
In this site, each animation description was separated by the div
tag of the ```animeSearchResultBody`` class, so the information of all the animations displayed in the page is acquired as follows. I will.
response.css('div.animeSearchResultBody')
I want only the title and overall evaluation from the extracted animation information, so I will extract it as follows.
yield {
'title': anime.css('span.animeTitle a::text').extract_first(),
'score': anime.css('span.totalRank::text').extract_first()
}
```extract_first () `` will extract the first element.
anime.css('span.animeTitle a::text')[0].extract()
You can also access it with a subscript as, but I am using this because it prevents index errors and returns None.
Crawling is done in the following places.
next_page = response.css('a.next::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
The URL of the next page is created by specifying the href character string of the Next button. By calling the parse () function recursively, all pages will be crawled until there are no more pages.
Finally, let's run the program. I introduced the execution command in "Spider name", but I will add an option and output it as a file in json format. Type the following command under the project directory.
% scrapy crawl anime -o anime.json
I got the title and overall evaluation of the 2016 winter anime.
We have delivered a video that briefly explains Scrapy. Here, we are scraping using XPath.
"Scrapy: Automatically collects information on web pages !! Crawling & scraping framework" https://www.youtube.com/watch?v=Zfcukqxvia0&t=3s
https://doc.scrapy.org/en/1.2/intro/tutorial.html https://ja.wikipedia.org/wiki/ウェブスクレイピング
Recommended Posts