You can use the Python module Scrapy to automatically retrieve website data in sequence as you cycle through the links.
To extract the desired data from your website, you must specify ** where you want the data **.
What you specify is called ** Selector **. In Scrapy, there are css and xpath specification method, but this time I will explain how to use xpath.
Install Scrapy with pip.
commandline
$ pip install scrapy
Scrapy Shell Scrapy has a tool called Scrapy shell that allows you to interactively validate your data extraction.
commandline
scrapy shell "http://hogehoge.com/hoge/page1"
If you specify like, the interactive shell of python will be launched with the instance ** response ** containing the information of the specified page. When actually developing a spider (crawler), we will also extract data from this response instance.
Basically, we will extract the data with this syntax.
shell
>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
In this example, the body (text ()) of all title tags (// title) in the received html text is extracted. However, if it is left as it is, the return value is a selector as described above. Use .extract ()
to get the characters.
shell
>>> response.xpath('//title/text()').extract()
[u'exsample title']
Since the extracted data is a list, make it a character string by specifying an array.
shell
>>> response.xpath('//title/text()').extract()[0]
u'exsample title'
By the way, this ʻu'string'` means Unicode. In python, strings are handled in Unicode.
If you are visiting multiple websites, the xpath you specify may not be applicable anywhere. In that state, if you specify the 0th response.xpath (hoge) .extract [0]
of the array as above, an error will occur, so to avoid this
shell
>>> item['hoge'] = response.xpath('//title/text()').extract_first()
And so on.
Also, if you want to concatenate all the obtained arrays [u'hoge1', u'hoge2', u'hoge3']
, etc., and obtain them as a character string.
shell
>>> extract_list = [u'hoge1', u'hoge2', u'hoge3']
>>> ''.join(extract_list)
u'hoge1hoge2hoge3'
You can do it.
xpath | Contents |
---|---|
//div | All div tags |
//div[@class='aaa'] | In all classes'aaa'Div tag with |
//div[@id='aaa']/text() | All, id'aaa'Div tag->Body |
//a[text()='aaa']/@href | All the text'aaa'A tag->Href attribute value of |
//div/tr | All divs->Tr tag of child element |
//table/tr/th[text()='price']/following-sibling::td[1]/text() | All tables->That line->Field called price->First of the data elements->Body |
The xpath for the last table is convenient because you can get the value from the table on the web page by specifying the field (price in the above case, the amount part).
If td
is specified, the td element on the same line will be taken more and more, so the first one is extracted astd [1]
. It is [1]
. It's not [0]
.
Recommended Posts