When you say the word "scraping", there are roughly two things, "crawling" and "scraping". I was confused, so I'll sort it out once.
So, for example, from the Shogi Federation page, I will extract the title of my favorite Go player. Toka is a translation of "scraping".
scrapy
Let's actually scrape it. When I think about it, I've only used PHP so far, so I tried hard to extract the information I wanted from the page using Goutte and so on.
So, I learned that Python, which I recently introduced, has a library (framework?) Called Scrapy, which makes scraping very easy.
So, this time, I will use this to collect information on my favorite Go players from the Shogi Federation page.
$ pip install scrapy
Complete
Well, I'm a super beginner who really doesn't understand Python at all, so I'll try the tutorial step by step to get a feel for it.
There was a tutorial corner in the documentation. https://docs.scrapy.org/en/latest/intro/tutorial.html
It's English, but it's quite so.
I'd like to do something in this order.
scrapy startproject tutorial
This seems to be good.
[vagrant@localhost test]$ scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/usr/lib64/python3.5/site-packages/scrapy/templates/project', created in:
/home/vagrant/test/tutorial
You can start your first spider with:
cd tutorial
scrapy genspider example example.com
[vagrant@localhost test]$ ll
Total 0
drwxr-xr-x 3 vagrant vagrant 38 april 16 04:15 tutorial
A directory called tutorial has been created!
So, there are various things in this, but according to the document, each file has the following roles.
tutorial/
scrapy.cfg #Deployment configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
I didn't understand anything other than the deployment configuration file lol
Create a file called quotes_spider.py
undertutorial / spides /
and create it because there is something to copy and paste.
[vagrant@localhost tutorial]$ vi tutorial/spiders/quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
scrapy crawl quotes
It seems that you can go with this.
After something came out, quotes-1.html
and quotes-2.html
were created.
[vagrant@localhost tutorial]$ ll
32 in total
-rw-rw-r--1 vagrant vagrant 11053 April 16 04:27 quotes-1.html
-rw-rw-r--1 vagrant vagrant 13734 April 16 04:27 quotes-2.html
-rw-r--r--1 vagrant vagrant 260 April 16 04:15 scrapy.cfg
drwxr-xr-x 4 vagrant vagrant 129 April 16 04:15 tutorial
I wrote here "Let's output the information extracted from the command line", Actually, when I looked at the contents of the parse method, I was just doing something like ↓
--Extract the number part from the URL of the crawled site
--Apply this number to the% s part of quotes-% s.html
--Finally, put the body of response (TextResponse) in this file and save it.
After all, this method only returns an object of scrapy.Request
in the end, but it seems that this can be achieved by just writing start_urls
.
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
This is OK without having to bother to define the start_requests
method
The tutorial says, "To learn how scrapy actually pulls out, use the scrapy shell
. "
I will try it immediately
[vagrant@localhost tutorial]$ scrapy shell 'http://quotes.toscrape.com/page/1/'
...Omission...
2017-04-16 04:36:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fbb13dd0080>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7fbb129308d0>
[s] spider <DefaultSpider 'default' at 0x7fbb11f14828>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
Oh, it seems that something like a title element can be extracted.
When this reponse.css (xxx)
is done, the XML called SelectorList is returned. Or an object that wraps HTML.
So, I will extract more data from here. You can also say that.
Extract the text of the title as a trial.
>>> response.css('title::text').extract()
['Quotes to Scrape']
:: text
means that only the text element is extracted from this >>> response.css('title').extract()
['<title>Quotes to Scrape</title>']
When you extract, SelectorList is returned, so basically the list type is returned.
(That's why everything above was surrounded by []
)
If you want to get a specific one of them, specify the list number or get the first element with ʻextract_first`.
>>> response.css('title::text').extract_first()
'Quotes to Scrape'
>>> response.css('title::text')[0].extract()
'Quotes to Scrape'
##There is only one title in this web page, so if you specify the second one, you will get angry.
>>> response.css('title::text')[1].extract()
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/lib/python3.5/site-packages/parsel/selector.py", line 58, in __getitem__
o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
What is xpath? I thought, but @ merrill's article was very easy to understand.
http://qiita.com/merrill/items/aa612e6e865c1701f43b
It seems that you can specify something like atagin the fourth td in the
tbody from the HTML.
When I use it in this example immediately, it looks like this
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'
Let's extract the text part and author of http://quotes.toscrape.com/page/1/, which is the target of scraping now.
First, put the first div in a variable called quote
>>> quote = response.css("div.quote")[0]
>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
Succeeded in extracting the text part
>>> autor = quote.css("small.author::text").extract_first()
>>> autor
'Albert Einstein'
It's insanely easy.
>>> tags = quote.css("div.tags a.tag::text").extract()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']
I can extract it properly with list type
>>> for quote in response.css("div.quote"):
>>> text = quote.css("span.text::text").extract_first()
>>> author = quote.css("small.author::text").extract_first()
>>> tags = quote.css("div.tags a.tag::text").extract()
>>> print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
'text' : quote.css('span.text::text').extract_first(),
'author' : quote.css('small.author::text').extract_first(),
'tags' : quote.css('div.tags a.tag::text').extract()
}
I will rewrite it like this and execute it.
[vagrant@localhost tutorial]$ scrapy crawl quotes
2017-04-16 05:27:09 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tutorial)
2017-04-16 05:27:09 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'BOT_NAME': 'tutorial', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True}
...Omission...
{'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'tags': ['abilities', 'choices']}
2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Albert Einstein', 'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Jane Austen', 'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Marilyn Monroe', 'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'tags': ['be-yourself', 'inspirational']}
2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'Albert Einstein', 'text': '“Try not to become a man of success. Rather become a man of value.”', 'tags': ['adulthood', 'success', 'value']}
2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'tags': ['life', 'love']}
2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
...Omission...
There are various things out, but it seems that they can be extracted.
** Put it out in a file and see it **
[vagrant@localhost tutorial]$ scrapy crawl quotes -o result.json
Let's see the result
[vagrant@localhost tutorial]$ cat result.json
[
{"tags": ["change", "deep-thoughts", "thinking", "world"], "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein"},
{"tags": ["abilities", "choices"], "text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling"},
{"tags": ["inspirational", "life", "live", "miracle", "miracles"], "text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein"},
{"tags": ["aliteracy", "books", "classic", "humor"], "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"},
{"tags": ["be-yourself", "inspirational"], "text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d", "author": "Marilyn Monroe"},
{"tags": ["adulthood", "success", "value"], "text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein"},
{"tags": ["life", "love"], "text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide"},
{"tags": ["edison", "failure", "inspirational", "paraphrased"], "text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison"},
{"tags": ["misattributed-eleanor-roosevelt"], "text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt"},
{"tags": ["humor", "obvious", "simile"], "text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"},
{"tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"], "text": "\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d", "author": "Marilyn Monroe"},
{"tags": ["courage", "friends"], "text": "\u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\u201d", "author": "J.K. Rowling"},
{"tags": ["simplicity", "understand"], "text": "\u201cIf you can't explain it to a six year old, you don't understand it yourself.\u201d", "author": "Albert Einstein"},
{"tags": ["love"], "text": "\u201cYou may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect\u2014you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break\u2014her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.\u201d", "author": "Bob Marley"},
{"tags": ["fantasy"], "text": "\u201cI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.\u201d", "author": "Dr. Seuss"},
{"tags": ["life", "navigation"], "text": "\u201cI may not have gone where I intended to go, but I think I have ended up where I needed to be.\u201d", "author": "Douglas Adams"},
{"tags": ["activism", "apathy", "hate", "indifference", "inspirational", "love", "opposite", "philosophy"], "text": "\u201cThe opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.\u201d", "author": "Elie Wiesel"},
{"tags": ["friendship", "lack-of-friendship", "lack-of-love", "love", "marriage", "unhappy-marriage"], "text": "\u201cIt is not a lack of love, but a lack of friendship that makes unhappy marriages.\u201d", "author": "Friedrich Nietzsche"},
{"tags": ["books", "contentment", "friends", "friendship", "life"], "text": "\u201cGood friends, good books, and a sleepy conscience: this is the ideal life.\u201d", "author": "Mark Twain"},
{"tags": ["fate", "life", "misattributed-john-lennon", "planning", "plans"], "text": "\u201cLife is what happens to us while we are making other plans.\u201d", "author": "Allen Saunders"}
Poi Poi! !! !! !! Very easy ww
Now, I have listed all the transition destination URLs directly in start_urls. However, as usual, you may want to follow a specific link in the page and recursively get the data you want.
In such a case, it seems good to get the URL of the link and call your own parse.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
I feel like this. If there is next_page
, it feels like going around again.
ʻUrl join` would be nice to make a URL to go around?
Here, there is a link in the author part of http://quotes.toscrape.com, so a tutorial is introduced to follow it for more information.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/',
]
def parse(self, response):
#Get a link to the author's detail page
for href in response.css('.author + a::attr(href)').extract():
yield scrapy.Request(response.urljoin(href), callback=self.parse_author)
#Get pagination links
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not NONE:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
def parse_author(self, response):
#Extract from response in the received query and strip(Trim-like thing)To do
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name' : extract_with_css('h3.author-title::text'),
'birthdate' : extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
If you do it like this
―― 1. Just follow the author link and do parse_author
(extract the name, birth date, description)
―― 2. If paging exists, parse it again for the next page.
―― 3. Repeat until there is no paging
It is possible to write such a thing in just a few dozen lines ...
I didn't know how to use this, so I passed it.
--Create a project using scrapy --Write what you want to do in spiders --Crawling is also possible by following the link ――It's super easy to pull out
When I output to json with -o
, the character string is unicoded and cannot be read.
This can be solved by adding a line of FEED_EXPORT_ENCODING ='utf-8'
to [project_name] /settings.py
.
I made something that scrapes the data of Go players.
What i did
--Starting from the Shogi Player List page of the Shogi Federation
--Follow the link on the Go player details page
--Extract data of name, date of birth, master
The actual code looks like this (it's easy w)
import scrapy
class QuotesSpider(scrapy.Spider):
name = "kisi"
start_urls = [
'https://www.shogi.or.jp/player/',
]
def parse(self, response):
#Get a link to the details page of Go player
for href in response.css("p.ttl a::attr(href)").extract():
yield scrapy.Request(response.urljoin(href), callback=self.parse_kisi)
def parse_kisi(self, response):
def extract_with_xpath(query):
return response.xpath(query).extract_first().strip()
yield {
'name' : extract_with_xpath('//*[@id="contents"]/div[2]/div/div[2]/div/div/h1/span[1]/text()'),
'birth' : extract_with_xpath('//*[@id="contents"]/div[2]/div/div[2]/table/tbody/tr[2]/td/text()'),
'sisho' : extract_with_xpath('//*[@id="contents"]/div[2]/div/div[2]/table/tbody/tr[4]/td/text()'),
}
[vagrant@localhost tutorial]$ head kisi.json
[
{"name": "Akira Watanabe", "birth": "April 23, 1984(32 years old)", "sisho": "Kazuharu Shoshi 7th Dan"},
{"name": "Masahiko Urano", "birth": "March 14, 1964(53 years old)", "sisho": "(Late) Sutekichi Nakai 8th Dan"},
{"name": "Masaki Izumi", "birth": "January 11, 1961(56 years old)", "sisho": "Shigeru Sekine 9th Dan"},
{"name": "Koji Tosa", "birth": "March 30, 1955(62 years old)", "sisho": "(Late) Shizuo Seino 8th Dan"},
{"name": "Hiroshi Kamiya", "birth": "April 21, 1961(55 years old)", "sisho": "(late)Hisao Hirotsu 9th Dan"},
{"name": "Kensuke Kitahama", "birth": "December 28, 1975(41 years old)", "sisho": "Yoshimasa Saeki 9th Dan"},
{"name": "Akutsu main tax", "birth": "June 24, 1982(34 years old)", "sisho": "Seiichiro Taki 8th Dan"},
{"name": "Takayuki Yamazaki", "birth": "February 14, 1981(36 years old)", "sisho": "Nobuo Mori 7th Dan"},
{"name": "Akihito Hirose", "birth": "January 18, 1987(30 years old)", "sisho": "Osamu Katsura 9th Dan"},
You can see that everyone is getting it properly. It's really easy.
--Starting from a specific page --Specify search conditions --Extract the search results based on the rules
I will write an article if possible. (Well, I don't understand yield well, I can't debug, and I have to study python.)
Recommended Posts