Introduction

scrapy has a shell mode that allows you to scrape interactively. When used in combination with chrome, it is relatively easy to scrape from a web page. This is useful for considering what kind of xpath to write before writing a program.

Get XPath

For scrapy, specify the data you want to retrieve in the web page with XPath. It's not difficult to write your own XPath on a page that knows the structure of HTML, but it's hard to write the XPath of the data you want to retrieve on a page you haven't created. That's where chrome comes in.

For example, suppose you want to extract the title and link of each comic from the page http://toyokeizai.net/category/diary. Open this page in chrome, right-click on the top title "Engineers can't go home on Premium Friday" and select "inspect" from the menu. Developer Tools will open as shown in the figure below, and the corresponding tag will be selected. Screen Shot 2017-03-16 at 9.48.18 AM.png Right-click on the `<span>` tag and select" Copy "→" Copy XPath "from the menu to copy the xpath of this ` `` tag to the clipboard. .. In this example, the XPath is

//*[@id="latest-items"]/div/ul/li[1]/div[2]/a/span[2]

is. In this way, you can easily get the XPath with just chrome. Please refer to the following sites for XPath.

TECHSCORE Location Path XML Path Language (XPath) Version 1.0

Scraping with Scrapy Shell

Installation of scrapy

Scrapy Install scrapy in python anaconda environment

Please refer to.

Load a web page with scrapy shell

First, start the scrapy shell.

$ scrapy shell
2017-03-16 10:44:42 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
2017-03-16 10:44:42 [scrapy] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2017-03-16 10:44:42 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-03-16 10:44:42 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-03-16 10:44:42 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-03-16 10:44:42 [scrapy] INFO: Enabled item pipelines:
[]
2017-03-16 10:44:42 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-03-16 10:44:43 [traitlets] DEBUG: Using default logger
2017-03-16 10:44:43 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1083d7668>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x108f2cb70>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
In [1]:

Then load the web page with the ``` fetch ()` `` command.

In [1]: fetch('http://toyokeizai.net/category/diary')
2017-03-16 10:46:30 [scrapy] INFO: Spider opened
2017-03-16 10:46:31 [scrapy] DEBUG: Crawled (200) <GET http://toyokeizai.net/category/diary> (referer: None)

You can also specify the URL when starting scrapy shell and load it all at once.

$ scrapy shell http://toyokeizai.net/category/diary

The loaded page is stored in the response object. Whether or not the target page can be loaded is

In [3]: view(response)
Out[3]: True

You can check it with a command such as. Use the view () command to display the web page loaded in the default browser.

Retrieving the desired data

Now let's retrieve the desired data. XPath uses the one above.

In [4]: response.xpath('//*[@id="latest-items"]/div/ul/li[1]/div[2]/a/span[2]/text()').extract()
Out[4]: ['Engineers can't go home on premium Friday']

You now have the title. The `text ()` added to the XPath copied in chrome retrieves all the child text nodes of the selected node. Also, extract () extracts text data from the node. The result is returned as an array.

Extract all titles

Then get all the comic titles listed on the page. The HTML corresponding to the XPath used so far is

The HTML of this part is

<div id="latest-items">
  <div class="article-list">
    <ul class="business">
      <li class="clearfix">
        <div class="ico">…</div>
        <div class="ttl">
          <a href="/articles/-/161892" class="link-box">
            <span class="column-ttl">Will be incorporated as work time</span><br>
            <span class="column-main-ttl">Engineers can't go home on premium Friday</span>
            <span class="date">March 12, 2017</span>
            <span class="summary">From February 24th, there will be a Friday "Premium Friday" where you can leave the office once a month ...</span>
          </a>
        </div>
      </li>
      <li class="clearfix">…</li>
      <li class="clearfix">…</li>
      <li class="clearfix">…</li>
      <li class="clearfix">…</li>
      <li class="clearfix">…</li>
    </ul>
  </div>
</div>

The structure is as follows: `<li class =" clearfix ">… </ li>` contains information on one manga. The XPath used earlier

//*[@id="latest-items"]/div/ul/li[1]/div[2]/a/span[2]

`li [1] ``` refers to the first <li class =" clearfix ">… </ li> ```, so if you do not specify this order, all `` You can specify

… </ li> ```. That is, XPath is

//*[@id="latest-items"]/div/ul/li/div[2]/a/span[2]

Just do. I will actually try it.

In [5]: response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/span[2]/text()').extract()
Out[5]:
['Engineers can't go home on premium Friday',
 'If you can't beat the machine, become a machine!',
 'Data in the cloud may disappear!',
 'What is the unexpectedly large number of paperless offices?',
 'Unfortunately common points of unpopular male engineers',
 'What you need to do when you challenge advanced programming work',
 'New Year's Day 2017 was a second longer than usual',
 'The latest situation of the engineer's advent calendar',
 'There are "unexpected enemies" in the Amazon cloud',
 'When will Mizuho Bank's system be completed?',
 'Do you remember the nostalgic "Konami Command"?',
 '"DV" has a different meaning in the engineer community',
 'Amazing evolution of the game over the last 40 years',
 '"Pit" hidden in long autumn night programming',
 'Former Sony engineers are popular at work']

I got all the titles. If you compare the above HTML and XPath, the title tag```…It seems that you can just specify directly, but this page is

 <div id = "ranking-items" style = "display: none;"> <!-In order of popularity->
  <div class="article-list ranking category">
    <ul class="ranked business">
      <li class="clearfix">
       ...
 <div id = "latest-items"> <!-Latest order->
  <div class="article-list">
    <ul class="business">
      <li class="clearfix">

The structure is almost the same as the latest order under the popularity order, so if you are not careful, extra data will be mixed. When I actually try it,

In [6]: response.xpath('//span[@class="column-main-ttl"]/text()').extract()
Out[6]:
 ['Engineers aren't playing with Pokemon GO! ',
 "When will Mizuho Bank's system be completed?",
 'Unfortunate commonality of unpopular male engineers',
 'Engineers can't go home on premium Friday',
 'Students who no longer know desktop PCs! ',
 'Why former Sony engineers are popular in the workplace',
 'Cloud data may disappear! ',
 "Why I don't envy Yahoo 3 days a week",
 "The memory of the first computer I bought is vivid",
 'What is the most profitable programming language',
 "Who is attracted to" Famicom Mini "?",
 'Programming has become a very popular lesson! ',
 "" Self-driving cars "do not run automatically",
 "The truth about engineer girls'same clothes and staying suspicions'",
 "New employees will learn the basics by" creating minutes "",
 'Engineers can't go home on premium Friday',
 'If you can't beat the machine, become a machine! ',
 'Cloud data may disappear! ',
 'What is the unexpectedly large number of paperless offices? ',
 'Unfortunate commonality of unpopular male engineers',
 'What you need to do when you challenge advanced programming work',
 'New Year's Day 2017 was a second longer than usual',
 'The latest situation of the engineer's advent calendar',
 "There are" unexpected enemies "in the Amazon cloud",
 "When will Mizuho Bank's system be completed?",
 "Do you remember the nostalgic" Konami Command "?",
 "In the engineer area," DV "has a different meaning",
 'Amazing evolution of the game in the last 40 years',
 "The" pitfalls "hidden in long autumn night programming",
 'Why former Sony engineers are popular at work']

The same data as is obtained twice. That is, the XPath must uniquely point to the required data.

##Extract the link The link URL to the manga posting page is the title when you look at the HTML<span>Parent tag<a>It is written in the href of. The XPath pointing to this looks like this:

//*[@id="latest-items"]/div/ul/li/div[2]/a/@href

Last@hrefRefers to the href attribute of the a tag. This time, I want to extract the attribute value of the a tag instead of the text node of the child of the a tag, so I am doing as above. When you actually move this

In [7]: response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/@href').extract()
Out[7]:
['/articles/-/161892',
 '/articles/-/159846',
 '/articles/-/157777',
 '/articles/-/153378',
 '/articles/-/153367',
 '/articles/-/152301',
 '/articles/-/152167',
 '/articles/-/149922',
 '/articles/-/149911',
 '/articles/-/146637',
 '/articles/-/146559',
 '/articles/-/144778',
 '/articles/-/144756',
 '/articles/-/142415',
 '/articles/-/142342']

It will be. Now that you have an XPath to get the title and link of each manga, you can retrieve the necessary information by creating a scraping program based on this.

##Export the acquired data If you only scrape once, you can output the required data as it is. First, save the scraped data in a variable and then output it to a file.

In [8]: titles = response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/span[2]/text()').extract()

In [9]: links = response.xpath('//*[@id="latest-items"]/div/ul/li/div[2]/a/@href').extract()

In [10]: f = open('bohebohe.txt', 'w')

In [11]: for title, link in zip(titles, links):
    ...:     f.write(title + ', ' + link + '\n')

In [12]: f.close()

with thisbohebohe.txtThe result of scraping was written to the file.

$ cat bohebohe.txt
 Engineers can't go home on Premium Friday, / articles /-/161892
 If you can't beat the machine, become a machine! , / articles /-/ 159846
 Data in the cloud may disappear! , / articles /-/ 157777
 What is the unexpectedly large number of paperless offices? , / articles /-/ 153378
 Unfortunately commonality of unpopular male engineers, / articles /-/ 1533767
 What you need to do to tackle advanced programming work, / articles /-/ 152301
 New Year's Day 2017 was a second longer than usual, / articles /-/ 152167
 Engineer's Advent Calendar Latest Circumstances, / articles /-/ 149922
 There are "unexpected enemies" in the Amazon cloud, / articles /-/ 149911
 When will Mizuho Bank's system be completed, / articles /-/ 146637
 Do you remember the nostalgic "Konami Command", / articles /-/ 146559
 "DV" has a different meaning in the engineer community, / articles /-/ 144778
 Amazing evolution of the game over the last 40 years, / articles /-/ 144756
 "Pitfalls" in long autumn night programming, / articles /-/ 142415
 Former Sony engineers are popular at work, / articles /-/ 142342

#in conclusion

Debugging XPath that specifies data while creating a program is a bit of a hassle, and sometimes it's a waste to write a program for something that is used only once. In such a case, the scrapy shell, which allows you to try various things interactively and run the python script as it is, is quite convenient, and it is useful for various experiments such as just wanting to extract a little data from the page created in the past.

#Bonus XPath A brief description of the XPath used in this article. As an example of HTML

 1: <div id="latest-items">
 2:  <div class="article-list">
 3:    <ul class="business">
 4:      <li class="clearfix">
 5:        <div class="ttl">
 6:          <a href="/articles/-/161892" class="link-box">
 7: <span class = "column-ttl"> Incorporated as work time </ span> <br>
 8: <span class = "column-main-ttl"> Engineers can't go home on Premium Friday </ span>
 9: <span class = "date"> March 12, 2017 </ span>
11:          </a>
12:        </div>
13:      </li>
14:    </ul>
15:  </div>
16:</div>

Is used.

XPath	function
//e	All nodes that match the path rooted at tag e.`//div`Then all nodes starting with the div tag (1), 2,5th line) is taken out.
//e1/e2	All nodes whose tag e1 and its child elements match tag e2.`//dev/ul`Then specify the node on the third line.`//div/a/span`Then 7, 8,Take out the 9th line.
//e1/e2[1]	The first node of the child element e2 of the tag e1.`//li/div/a/span[1]`Takes out the 7th line
//e[@name="value"]	Node with tag e whose attribute name is value.`//div@[@class="article-list"]`Takes out the second line
@name	Retrieves the name attribute of the selected node.`//div/a/@href`Gets the href value on the 6th line
text()	Extracts the text nodes of all child elements of the selected node