0. Introduction

This is a continuation of the second summary article ** that I investigated for the purpose of using python for investment utilization. This time I will actually try scraping

Previous article Web scraping with Python ① (Scraping prior knowledge)

1. Last review & confirmation of scraping target

I will try various things for the stock investment memo that was the only "robots.txt: Allow: /" in the previous article.

https://kabuoji3.com/

`python`


#First check with reppy(Last review)
from reppy.robots import Robots

robots = Robots.fetch('https://kabuoji3.com/robots.txt')
print(robots.allowed('https://kabuoji3.com/', '*'))

`Execution result`


False

** It becomes False at ↑, but this is because there is a problem in how to write robots.txt of the stock investment memo (to acquire with reppy). ** ** It's normal to have a space after Allow :, but this site doesn't. Therefore, it seems that the reppy introduced in the previous article could not be obtained successfully. In the case of NG, it is important to go to check robots.txt by yourself.

追加されたkernel 今のHPのRobots.txt

追加されたkernel 本来こうあるべきRobots.txt

** This time, we will take up the case where you want to get the red frame part in the figure below as an example. (URL: https://kabuoji3.com/stock/) **

2. Verify the HTML of the URL and find the part you want to get.

Now let's leave python. The goal this time is to "get the latest stock price information of all listed stocks", but even if you get the URL information without thinking about it, you will also get unnecessary information that exists on the same page (for example, help page). Where in the HTML is the "necessary part" written because it is a link to, title information, etc.)? You need to understand.

Then you need to know HTML ... and study a language ...? However, there is no problem if you know only the minimum. Moreover, in the case of Google Chrome, there is a function called "verification", so it is unnecessary and can be handled to some extent. If you left-click on the target URL, you should see a command like ↓.

** View page source: Show HTML as is ** ** Verification: Where does the page description refer to in HTML? Is immediately apparent (not only that, but main in this article) **

Try to display the verification on the target page "https://kabuoji3.com/stock/"

The verification window will appear on the right. The HTML description is written in the verification window, but try moving the mouse cursor to the <header id =" header "class =" "> part. When combined, you can see that the upper part of the URL is highlighted in blue as shown in the above figure. This means that the upper part of the URL (such as the title) is written in this part.

This time, the table part of the stock price data is the part you want to get, so you can search for it from the HTML that came out in the verification.

As you search, you can see that the part `table class =" stock_table "` looks like that. So, you can see that there are `thead` and` tbody` in it, and that they are the "header" and "stock price data of each row" of this table, respectively, to be displayed by further verification. By the way, the table element is called `table` in HTML, so you may look for it.

3. Get the latest stock price information of all listed stocks with the requests command

From here, we will get it with python based on the information examined in 2. As I mentioned a little last time, you should have Beautiful Soup installed (pip install beautiful soup 4).

** It is necessary to confirm your user agent (UA) information with Confirm in advance. ** ** The part described in "Current Browser" is UA information, so rewrite the following code according to your environment.

Do not copy because the description "Display size: ~~" that appears in "Current browser" is unnecessary.

`python`


import requests
from bs4 import BeautifulSoup
import pandas as pd

#Enter the URL to be scraped
url = 'https://kabuoji3.com/stock/'

#My user agent that I checked with confirmation(Current browser)Copy and paste * Rewrite according to the environment
headers = {"User-Agent": "Mozilla/*** Chrome/*** Safari/***"}

#Information from websites using the Requests library(HTML)To get.
response = requests.get(url, headers = headers)

print(response)

`Execution result`


<Response [200]>

Then a 3-digit HTTP status code is returned. OK if 200 (successful request) is returned. If 403 (authentication refusal) or 404 (Not Found * URL invalid) is returned, it has failed, so review the description again.

Reference: HTTP Status Code Wiki

4. Analyze the contents obtained by the requests command and extract the necessary parts

Next, analyze the contents of HTML acquired by BeautiifulSoup, specify the necessary part with a tag, and extract it.

`python`


#Create a BeautifulSoup object from the acquired HTML
soup = BeautifulSoup(response.content, "html.parser")

"""
First, get the header part of the stock price list.
I know that the header part is tagged with "tr" in HTML in 2, so look for it.
There are several ways to do it, but the point is<thead>In the tag<tr>You can get all the headers by extracting all of them.
"""
#First, search for the head with the command using the find method, and find the tr in it._Extract all with all method
tag_thead_tr = soup.find('thead').find_all('tr')

print(tag_thead_tr)

`Execution result`


[<tr>
<th>Code / name</th>
<th>market</th>
<th>Open price</th>
<th>High price</th>
<th>Low price</th>
<th>closing price</th>
</tr>]

`python`


#Similarly, the stock price portion will be acquired. I already know that the tr tags inside the tbody tags are grouped together.
tag_tbody_tr = soup.find('tbody').find_all('tr')

#Since there are many, only the 0th is displayed
print(tag_tbody_tr[0])

`Execution result`


<tr data-href="https://kabuoji3.com/stock/1305/">
<td><a href="https://kabuoji3.com/stock/1305/">1305 Daiwa Exchange Traded Fund-Topics</a></td>
<td>TSE ETF</td>
<td>1883</td>
<td>1888</td>
<td>1878</td>
<td>1884</td>
</tr>

You can see that this is a good acquisition.

5. Display the acquired information in pandas

Let's put together a table with pandas, which is easy to handle with python.

`python`


import pandas as pd

#Search the acquired header part with th and convert it to text *[0]What we are trying to do is find_Because all cannot be overlapped.
head = [h.text for h in tag_thead_th[0].find_all('th')] 

#Apply the stock price part
data = []
for i in range(len(tag_tbody_tr)):
    #Since the data of each column is stored in the td tag, it is retrieved.
    data.append([d.text for d in tag_tbody_tr[i].find_all('td')])
    df = pd.DataFrame(data, columns = head)

#Show only the first two lines of the data frame
df.head(2)

Display result

	Code / name	market	Open price	High price	Low price	closing price
0	1305 Daiwa Exchange Traded Fund-Topics	TSE ETF	1883	1888	1878	1884
1	1306 (NEXT FUNDS)TOPIX-linked exchange-traded fund	TSE ETF	1861	1867	1856	1863

Now you can transform it into a form that is easy to handle with python. After that, you can boil it, bake it, process it as you like, and use it. You can also save to csv with the to_csv method of pandas.

6. Finally

You can see that the article itself is long, but the code itself is also short. In short, which tag in HTML has the information you want to get? All you have to do is look for. Of course, there are many cases where you can't get it with this (when javascript etc. are involved), but I think that you should wear it according to the time. ** Next time, as the 3rd article, we plan to combine the 1st and 2nd articles to store the stock price obtained by scraping in the database **

7. (Bonus) Simple HTML supplement

For the time being, I will briefly explain the HTML of https://kabuoji3.com/stock/. If you display HTML in the verification and fold the body ... part, it will be as shown in the figure below.

In other words, it looks like ⇓ in a simple way.

`python`


<!--First make an HTML declaration. lang="ja"Meaning that I will handle Japanese-->
<html class=・ ・> 

<!--This is the head part that is not displayed in the browser. Describes the character code and the part displayed as the search result when searching the page-->
<head>...</head>
<!--This is the body part of the HTML body. Created while grouping the entire page with div tags-->
<body class=・ ・ ・>...</body>

<!--HTML description end-->
</html>

The rough body structure of the HTML in the page is written below. The indent is divided into layers. It is easy to understand if you actually look at the page or click the verification to check. It looks like a lot of mess, but there is no duplication in the div id where you want to check, so just pay attention to it.

`python`


<!--▼ Excerpt only for the body part-->
<body>
    <div id="wrapper">
        <!--▼ Header(Heading)part-->
        <header id="header">...</header>
        <!--▼ Global navigation(It comes out when you press MENU)part-->
        <div id="gNav_wrap">...</div>
        <!--▼ Page main part-->
        <div id="contents_wrap">
            <!--▼ Main part-->
            <div id="container_in">
                <!--▼ Main part-->
                <div id="main">
　　　　　　　　　    <!--▼ Only the important parts below from here. Others are omitted-->
　　　　　　　　　    <div class="data_contents">
　　　　　　　　　　　　　<!--▼ Stock price table(table)-->
                        <table class="stock_table">
　　　　　　　　　　　　　    <!--▼ Table header(column)-->
                            <thead>
                                <!--▼ Column-->
                                <tr>...</tr>
　　　　　　　　　　　　　　　</thead>
　　　　　　　　　　　　　    <!--▼ Contents data part of the table-->
                            <tbody>
                                <!--▼ Stock price data for each line-->
                                <tr>...</tr>
                            </tbody>
                        </table>
                    </div>
                </div>
                <!--▼ Data Menu part-->
                <div id="side">...</div>                
            </div>
            <!-- ▼HOME,Navigation part with PAGE TOP link-->
            <div id="gNav_wrap">...</div>
            <!--▼ Footer(Information placed at the bottom is collected)part-->
            <div id="gNav_wrap">...</div>
        </div>
    <!--▼ Script part. Used when reading javascript or external script-->
    <script>...</script>
</body>

Web scraping with Python ② (Actually scraping stock sites)

0. Introduction

1. Last review & confirmation of scraping target

python

Execution result

2. Verify the HTML of the URL and find the part you want to get.

3. Get the latest stock price information of all listed stocks with the requests command

python

Execution result

4. Analyze the contents obtained by the requests command and extract the necessary parts

python

Execution result

python

Execution result

5. Display the acquired information in pandas

python

6. Finally

7. (Bonus) Simple HTML supplement

python

python

`python`

`Execution result`

`python`

`Execution result`

`python`

`Execution result`

`python`

`Execution result`

`python`

`python`

`python`