This is a continuation of the second summary article ** that I investigated for the purpose of using python for investment utilization. This time I will actually try scraping
I will try various things for the stock investment memo that was the only "robots.txt: Allow: /" in the previous article.
https://kabuoji3.com/
python
#First check with reppy(Last review)
from reppy.robots import Robots
robots = Robots.fetch('https://kabuoji3.com/robots.txt')
print(robots.allowed('https://kabuoji3.com/', '*'))
Execution result
False
** It becomes False at ↑, but this is because there is a problem in how to write robots.txt of the stock investment memo (to acquire with reppy). ** ** It's normal to have a space after Allow :, but this site doesn't. Therefore, it seems that the reppy introduced in the previous article could not be obtained successfully. In the case of NG, it is important to go to check robots.txt by yourself.
今のHPのRobots.txt
本来こうあるべきRobots.txt
** This time, we will take up the case where you want to get the red frame part in the figure below as an example. (URL: https://kabuoji3.com/stock/) **
Now let's leave python. The goal this time is to "get the latest stock price information of all listed stocks", but even if you get the URL information without thinking about it, you will also get unnecessary information that exists on the same page (for example, help page). Where in the HTML is the "necessary part" written because it is a link to, title information, etc.)? You need to understand.
Then you need to know HTML ... and study a language ...? However, there is no problem if you know only the minimum. Moreover, in the case of Google Chrome, there is a function called "verification", so it is unnecessary and can be handled to some extent. If you left-click on the target URL, you should see a command like ↓.
** View page source: Show HTML as is ** ** Verification: Where does the page description refer to in HTML? Is immediately apparent (not only that, but main in this article) **Try to display the verification on the target page "https://kabuoji3.com/stock/"
The verification window will appear on the right.
The HTML description is written in the verification window, but try moving the mouse cursor to the <header id =" header "class =" ">
part.
When combined, you can see that the upper part of the URL is highlighted in blue as shown in the above figure.
This means that the upper part of the URL (such as the title) is written in this part.
This time, the table part of the stock price data is the part you want to get, so you can search for it from the HTML that came out in the verification.
As you search, you can see that the part `table class =" stock_table "` looks like that. So, you can see that there are `thead` and` tbody` in it, and that they are the "header" and "stock price data of each row" of this table, respectively, to be displayed by further verification. By the way, the table element is called `table` in HTML, so you may look for it.From here, we will get it with python based on the information examined in 2.
As I mentioned a little last time, you should have Beautiful Soup installed (pip install beautiful soup 4
).
** It is necessary to confirm your user agent (UA) information with Confirm in advance. ** ** The part described in "Current Browser" is UA information, so rewrite the following code according to your environment.
python
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Enter the URL to be scraped
url = 'https://kabuoji3.com/stock/'
#My user agent that I checked with confirmation(Current browser)Copy and paste * Rewrite according to the environment
headers = {"User-Agent": "Mozilla/*** Chrome/*** Safari/***"}
#Information from websites using the Requests library(HTML)To get.
response = requests.get(url, headers = headers)
print(response)
Execution result
<Response [200]>
Then a 3-digit HTTP status code
is returned. OK if 200 (successful request) is returned.
If 403 (authentication refusal) or 404 (Not Found * URL invalid) is returned, it has failed, so review the description again.
Reference: HTTP Status Code Wiki
Next, analyze the contents of HTML acquired by BeautiifulSoup, specify the necessary part with a tag, and extract it.
python
#Create a BeautifulSoup object from the acquired HTML
soup = BeautifulSoup(response.content, "html.parser")
"""
First, get the header part of the stock price list.
I know that the header part is tagged with "tr" in HTML in 2, so look for it.
There are several ways to do it, but the point is<thead>In the tag<tr>You can get all the headers by extracting all of them.
"""
#First, search for the head with the command using the find method, and find the tr in it._Extract all with all method
tag_thead_tr = soup.find('thead').find_all('tr')
print(tag_thead_tr)
Execution result
[<tr>
<th>Code / name</th>
<th>market</th>
<th>Open price</th>
<th>High price</th>
<th>Low price</th>
<th>closing price</th>
</tr>]
python
#Similarly, the stock price portion will be acquired. I already know that the tr tags inside the tbody tags are grouped together.
tag_tbody_tr = soup.find('tbody').find_all('tr')
#Since there are many, only the 0th is displayed
print(tag_tbody_tr[0])
Execution result
<tr data-href="https://kabuoji3.com/stock/1305/">
<td><a href="https://kabuoji3.com/stock/1305/">1305 Daiwa Exchange Traded Fund-Topics</a></td>
<td>TSE ETF</td>
<td>1883</td>
<td>1888</td>
<td>1878</td>
<td>1884</td>
</tr>
You can see that this is a good acquisition.
Let's put together a table with pandas, which is easy to handle with python.
python
import pandas as pd
#Search the acquired header part with th and convert it to text *[0]What we are trying to do is find_Because all cannot be overlapped.
head = [h.text for h in tag_thead_th[0].find_all('th')]
#Apply the stock price part
data = []
for i in range(len(tag_tbody_tr)):
#Since the data of each column is stored in the td tag, it is retrieved.
data.append([d.text for d in tag_tbody_tr[i].find_all('td')])
df = pd.DataFrame(data, columns = head)
#Show only the first two lines of the data frame
df.head(2)
Display result
Code / name | market | Open price | High price | Low price | closing price | |
---|---|---|---|---|---|---|
0 | 1305 Daiwa Exchange Traded Fund-Topics | TSE ETF | 1883 | 1888 | 1878 | 1884 |
1 | 1306 (NEXT FUNDS)TOPIX-linked exchange-traded fund | TSE ETF | 1861 | 1867 | 1856 | 1863 |
Now you can transform it into a form that is easy to handle with python. After that, you can boil it, bake it, process it as you like, and use it.
You can also save to csv with the to_csv
method of pandas.
You can see that the article itself is long, but the code itself is also short. In short, which tag in HTML has the information you want to get? All you have to do is look for. Of course, there are many cases where you can't get it with this (when javascript etc. are involved), but I think that you should wear it according to the time. ** Next time, as the 3rd article, we plan to combine the 1st and 2nd articles to store the stock price obtained by scraping in the database **
For the time being, I will briefly explain the HTML of https://kabuoji3.com/stock/. If you display HTML in the verification and fold the body ... part, it will be as shown in the figure below.
In other words, it looks like ⇓ in a simple way.
python
<!--First make an HTML declaration. lang="ja"Meaning that I will handle Japanese-->
<html class=・ ・>
<!--This is the head part that is not displayed in the browser. Describes the character code and the part displayed as the search result when searching the page-->
<head>...</head>
<!--This is the body part of the HTML body. Created while grouping the entire page with div tags-->
<body class=・ ・ ・>...</body>
<!--HTML description end-->
</html>
The rough body structure of the HTML in the page is written below.
The indent is divided into layers. It is easy to understand if you actually look at the page or click the verification to check. It looks like a lot of mess, but there is no duplication in the div id
where you want to check, so just pay attention to it.
python
<!--▼ Excerpt only for the body part-->
<body>
<div id="wrapper">
<!--▼ Header(Heading)part-->
<header id="header">...</header>
<!--▼ Global navigation(It comes out when you press MENU)part-->
<div id="gNav_wrap">...</div>
<!--▼ Page main part-->
<div id="contents_wrap">
<!--▼ Main part-->
<div id="container_in">
<!--▼ Main part-->
<div id="main">
<!--▼ Only the important parts below from here. Others are omitted-->
<div class="data_contents">
<!--▼ Stock price table(table)-->
<table class="stock_table">
<!--▼ Table header(column)-->
<thead>
<!--▼ Column-->
<tr>...</tr>
</thead>
<!--▼ Contents data part of the table-->
<tbody>
<!--▼ Stock price data for each line-->
<tr>...</tr>
</tbody>
</table>
</div>
</div>
<!--▼ Data Menu part-->
<div id="side">...</div>
</div>
<!-- ▼HOME,Navigation part with PAGE TOP link-->
<div id="gNav_wrap">...</div>
<!--▼ Footer(Information placed at the bottom is collected)part-->
<div id="gNav_wrap">...</div>
</div>
<!--▼ Script part. Used when reading javascript or external script-->
<script>...</script>
</body>
Recommended Posts