scrapy has a shell mode that allows you to scrape interactively. When used in combination with chrome, it is relatively easy to scrape from a web page. This is useful for considering what kind of xpath to write before writing a program.
For scrapy, specify the data you want to retrieve in the web page with XPath. It's not difficult to write your own XPath on a page that knows the structure of HTML, but it's hard to write the XPath of the data you want to retrieve on a page you haven't created. That's where chrome comes in.
For example, suppose you want to extract the title and link of each comic from the page http://toyokeizai.net/category/diary
. Open this page in chrome, right-click on the top title "Engineers can't go home on Premium Friday" and select "inspect" from the menu. Developer Tools will open as shown in the figure below, and the corresponding tag will be selected.
Right-click on the `<span>`
tag and select" Copy "→" Copy XPath "from the menu to copy the xpath of this
` `` tag to the clipboard. .. In this example, the XPath is
//*[@id="latest-items"]/div/ul/li[1]/div[2]/a/span[2]
is. In this way, you can easily get the XPath with just chrome. Please refer to the following sites for XPath.
TECHSCORE Location Path XML Path Language (XPath) Version 1.0
Installation of scrapy
Scrapy Install scrapy in python anaconda environment
Please refer to.
First, start the scrapy shell.
$ scrapy shell
2017-03-16 10:44:42 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
2017-03-16 10:44:42 [scrapy] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2017-03-16 10:44:42 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-03-16 10:44:42 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-03-16 10:44:42 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-03-16 10:44:42 [scrapy] INFO: Enabled item pipelines:
[]
2017-03-16 10:44:42 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-03-16 10:44:43 [traitlets] DEBUG: Using default logger
2017-03-16 10:44:43 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1083d7668>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x108f2cb70>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]:
Then load the web page with the ``` fetch ()` `` command.
In [1]: fetch('http://toyokeizai.net/category/diary')
2017-03-16 10:46:30 [scrapy] INFO: Spider opened
2017-03-16 10:46:31 [scrapy] DEBUG: Crawled (200) <GET http://toyokeizai.net/category/diary> (referer: None)
You can also specify the URL when starting scrapy shell and load it all at once.
$ scrapy shell http://toyokeizai.net/category/diary
The loaded page is stored in the response
object.
Whether or not the target page can be loaded is
In [3]: view(response)
Out[3]: True
You can check it with a command such as. Use the view () command to display the web page loaded in the default browser.
Now let's retrieve the desired data. XPath uses the one above.
In [4]: response.xpath('//*[@id="latest-items"]/div/ul/li[1]/div[2]/a/span[2]/text()').extract()
Out[4]: ['Engineers can't go home on premium Friday']
You now have the title. The `text ()`
added to the XPath copied in chrome retrieves all the child text nodes of the selected node. Also, extract () extracts text data from the node. The result is returned as an array.
Then get all the comic titles listed on the page. The HTML corresponding to the XPath used so far is
The HTML of this part is
<div id="latest-items">
<div class="article-list">
<ul class="business">
<li class="clearfix">
<div class="ico">…</div>
<div class="ttl">
<a href="/articles/-/161892" class="link-box">
<span class="column-ttl">Will be incorporated as work time</span><br>
<span class="column-main-ttl">Engineers can't go home on premium Friday</span>
<span class="date">March 12, 2017</span>
<span class="summary">From February 24th, there will be a Friday "Premium Friday" where you can leave the office once a month ...</span>
</a>
</div>
</li>
<li class="clearfix">…</li>
<li class="clearfix">…</li>
<li class="clearfix">…</li>
<li class="clearfix">…</li>
<li class="clearfix">…</li>
</ul>
</div>
</div>
The structure is as follows: `<li class =" clearfix ">… </ li>`
contains information on one manga.
The XPath used earlier
//*[@id="latest-items"]/div/ul/li[1]/div[2]/a/span[2]
`li [1] ``` refers to the first
<li class =" clearfix ">… </ li> ```, so if you do not specify this order, all ``
You can specify
//*[@id="latest-items"]/div/ul/li/div[2]/a/span[2]
Just do. I will actually try it.
In [5]: response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/span[2]/text()').extract()
Out[5]:
['Engineers can't go home on premium Friday',
'If you can't beat the machine, become a machine!',
'Data in the cloud may disappear!',
'What is the unexpectedly large number of paperless offices?',
'Unfortunately common points of unpopular male engineers',
'What you need to do when you challenge advanced programming work',
'New Year's Day 2017 was a second longer than usual',
'The latest situation of the engineer's advent calendar',
'There are "unexpected enemies" in the Amazon cloud',
'When will Mizuho Bank's system be completed?',
'Do you remember the nostalgic "Konami Command"?',
'"DV" has a different meaning in the engineer community',
'Amazing evolution of the game over the last 40 years',
'"Pit" hidden in long autumn night programming',
'Former Sony engineers are popular at work']
I got all the titles. If you compare the above HTML and XPath, the title tag```…It seems that you can just specify directly, but this page is
<div id = "ranking-items" style = "display: none;"> <!-In order of popularity->
<div class="article-list ranking category">
<ul class="ranked business">
<li class="clearfix">
...
<div id = "latest-items"> <!-Latest order->
<div class="article-list">
<ul class="business">
<li class="clearfix">
The structure is almost the same as the latest order under the popularity order, so if you are not careful, extra data will be mixed. When I actually try it,
In [6]: response.xpath('//span[@class="column-main-ttl"]/text()').extract()
Out[6]:
['Engineers aren't playing with Pokemon GO! ',
"When will Mizuho Bank's system be completed?",
'Unfortunate commonality of unpopular male engineers',
'Engineers can't go home on premium Friday',
'Students who no longer know desktop PCs! ',
'Why former Sony engineers are popular in the workplace',
'Cloud data may disappear! ',
"Why I don't envy Yahoo 3 days a week",
"The memory of the first computer I bought is vivid",
'What is the most profitable programming language',
"Who is attracted to" Famicom Mini "?",
'Programming has become a very popular lesson! ',
"" Self-driving cars "do not run automatically",
"The truth about engineer girls'same clothes and staying suspicions'",
"New employees will learn the basics by" creating minutes "",
'Engineers can't go home on premium Friday',
'If you can't beat the machine, become a machine! ',
'Cloud data may disappear! ',
'What is the unexpectedly large number of paperless offices? ',
'Unfortunate commonality of unpopular male engineers',
'What you need to do when you challenge advanced programming work',
'New Year's Day 2017 was a second longer than usual',
'The latest situation of the engineer's advent calendar',
"There are" unexpected enemies "in the Amazon cloud",
"When will Mizuho Bank's system be completed?",
"Do you remember the nostalgic" Konami Command "?",
"In the engineer area," DV "has a different meaning",
'Amazing evolution of the game in the last 40 years',
"The" pitfalls "hidden in long autumn night programming",
'Why former Sony engineers are popular at work']
The same data as is obtained twice. That is, the XPath must uniquely point to the required data.
##Extract the link
The link URL to the manga posting page is the title when you look at the HTML<span>
Parent tag<a>
It is written in the href of. The XPath pointing to this looks like this:
//*[@id="latest-items"]/div/ul/li/div[2]/a/@href
Last@href
Refers to the href attribute of the a tag. This time, I want to extract the attribute value of the a tag instead of the text node of the child of the a tag, so I am doing as above. When you actually move this
In [7]: response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/@href').extract()
Out[7]:
['/articles/-/161892',
'/articles/-/159846',
'/articles/-/157777',
'/articles/-/153378',
'/articles/-/153367',
'/articles/-/152301',
'/articles/-/152167',
'/articles/-/149922',
'/articles/-/149911',
'/articles/-/146637',
'/articles/-/146559',
'/articles/-/144778',
'/articles/-/144756',
'/articles/-/142415',
'/articles/-/142342']
It will be. Now that you have an XPath to get the title and link of each manga, you can retrieve the necessary information by creating a scraping program based on this.
##Export the acquired data If you only scrape once, you can output the required data as it is. First, save the scraped data in a variable and then output it to a file.
In [8]: titles = response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/span[2]/text()').extract()
In [9]: links = response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/@href').extract()
In [10]: f = open('bohebohe.txt', 'w')
In [11]: for title, link in zip(titles, links):
...: f.write(title + ', ' + link + '\n')
In [12]: f.close()
with thisbohebohe.txt
The result of scraping was written to the file.
$ cat bohebohe.txt
Engineers can't go home on Premium Friday, / articles /-/161892
If you can't beat the machine, become a machine! , / articles /-/ 159846
Data in the cloud may disappear! , / articles /-/ 157777
What is the unexpectedly large number of paperless offices? , / articles /-/ 153378
Unfortunately commonality of unpopular male engineers, / articles /-/ 1533767
What you need to do to tackle advanced programming work, / articles /-/ 152301
New Year's Day 2017 was a second longer than usual, / articles /-/ 152167
Engineer's Advent Calendar Latest Circumstances, / articles /-/ 149922
There are "unexpected enemies" in the Amazon cloud, / articles /-/ 149911
When will Mizuho Bank's system be completed, / articles /-/ 146637
Do you remember the nostalgic "Konami Command", / articles /-/ 146559
"DV" has a different meaning in the engineer community, / articles /-/ 144778
Amazing evolution of the game over the last 40 years, / articles /-/ 144756
"Pitfalls" in long autumn night programming, / articles /-/ 142415
Former Sony engineers are popular at work, / articles /-/ 142342
#in conclusion
Debugging XPath that specifies data while creating a program is a bit of a hassle, and sometimes it's a waste to write a program for something that is used only once. In such a case, the scrapy shell, which allows you to try various things interactively and run the python script as it is, is quite convenient, and it is useful for various experiments such as just wanting to extract a little data from the page created in the past.
#Bonus XPath A brief description of the XPath used in this article. As an example of HTML
1: <div id="latest-items">
2: <div class="article-list">
3: <ul class="business">
4: <li class="clearfix">
5: <div class="ttl">
6: <a href="/articles/-/161892" class="link-box">
7: <span class = "column-ttl"> Incorporated as work time </ span> <br>
8: <span class = "column-main-ttl"> Engineers can't go home on Premium Friday </ span>
9: <span class = "date"> March 12, 2017 </ span>
11: </a>
12: </div>
13: </li>
14: </ul>
15: </div>
16:</div>
Is used.
XPath | function |
---|---|
//e | All nodes that match the path rooted at tag e.//div Then all nodes starting with the div tag (1), 2,5th line) is taken out. |
//e1/e2 | All nodes whose tag e1 and its child elements match tag e2.//dev/ul Then specify the node on the third line.//div/a/span Then 7, 8,Take out the 9th line. |
//e1/e2[1] | The first node of the child element e2 of the tag e1.//li/div/a/span[1] Takes out the 7th line |
//e[@name="value"] | Node with tag e whose attribute name is value.//div@[@class="article-list"] Takes out the second line |
@name | Retrieves the name attribute of the selected node.//div/a/@href Gets the href value on the 6th line |
text() | Extracts the text nodes of all child elements of the selected node |
Recommended Posts