I heard a rumor that a Python library called Scrapy is easy and easy to use, so I immediately tried using it.
Stable pyenv + anaconda (Python3)
Leave the lyrics of anime songs here? Of the anime songs added between July 31st and November 30th on the Latest Additional Songs page here I decided to collect the lyrics.
It can be installed from pip.``` $ pip install scrapy
## Project creation
You can decide the name freely. This time, the name of the tutorial is adopted as it is.```
$ scrapy startproject aipa_commander
I'm too new to scraping and have no idea what the files inside mean. For the time being, do not touch it until it can be used to some extent.
The only directory operated by beginners like me aipa_commander (first project name) / spiders / Create a python script file here. As a result of coding through various trials and errors, it finally became like this.
get_kashi.py
# -*- coding: utf-8 -*-
import scrapy
class KashiSpider(scrapy.Spider):
name = 'kashi'
start_urls = ['http://www.jtw.zaq.ne.jp/animesong/tuika.html']
custom_settings = {
"DOWNLOAD_DELAY": 1,
}
def parse(self, response):
for href in response.xpath('//td[2]/a/@href'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_item)
def parse_item(self, response):
kashi = response.xpath('//pre/text()').extract()
kashi = kashi[0].split('\n')
file = open('./lyrics/{}.txt'.format(kashi[0]), 'w')
for j in range(len(kashi)):
file.write(kashi[j]+'\n')
file.close()
Scrapy is amazing because you can get the lyrics of 200 songs at once in just a few lines.
I made it with reference to the code in Official Tutorial, so I don't have much explanation about the code ... ・ However, what I struggled with most was that I had no knowledge of HTML and CSS.
response.xpath('//td[2]/a/@href')And response.xpath('//pre/text()').extract()
Specifying xpath such as.
However, such a function like a savior was prepared for me.
```$scrapy shell "url"```
When you enter
Shell starts
#### **`>>>sel.xpath('//td[2]/a/@href')`**
And run
[<Selector xpath='//td[2]/a/@href' data='ku/qualidea/brave.html'>,
<Selector xpath='//td[2]/a/@href' data='ku/qualidea/axxxis.html'>,
<Selector xpath='//td[2]/a/@href' data='ku/qualidea/gravity.html'>,
<Selector xpath='//td[2]/a/@href' data='ku/qualidea/yakusoku.html'>,
<Selector xpath='//td[2]/a/@href' data='ku/qualidea/clever.html'>,
<Selector xpath='//td[2]/a/@href' data='to/drefes/pleasure.html'>,
・ ・ ・ Omitted below
The result can be easily confirmed in this way. By using Shell, you can easily try how to get the data you want without having to rewrite the script. This is really convenient, so scraping beginners should definitely take advantage of it.
I will write about the description method when specifying xpath if there is an opportunity,
I used this time
xpath(//td[2]/a/@href)
Gets only httpl: // www ~ in <a>` `` in all
<td [2]>
.
xpath('//pre/text()').extract()
Gets only the text part in all ``` <pre>` `.
It is a process called.
# Execution result
$ scrapy crawl kashi
And execute (the kashi part is the keyword specified in name)
200 text files like this were generated.
![スクリーンショット 2016-12-14 0.20.23.png](https://qiita-image-store.s3.amazonaws.com/0/125193/b660612a-0d67-1311-5238-ecb093b06b15.png)
The contents of the text file
Like this (because it is long, part of it)
![スクリーンショット 2016-12-14 0.23.52.png](https://qiita-image-store.s3.amazonaws.com/0/125193/053df284-92ce-a9fa-9c99-e3c6d0020d97.png)
# in conclusion
I was impressed because it was easier than I imagined to collect.
Next time I would like to try it with images.
Recommended Posts