0. Introduction

** This is the third (final) ** of the summary article that I investigated for the purpose of using python for investment utilization. As a omnibus so far, I will write a series of flow to store in the DB while scraping. Basically, I will describe it on the assumption that all the following articles are known. (This article also has as many comments as possible, but please see past articles for more details.)

** Articles up to the last ** ** Handling database easily with Python (SQLite3) ** ** Web scraping with Python ① (Scraping prior knowledge) ** ** Web scraping with Python ② (actually scraping stock sites) **

Operating environment
OS : Windows10 pro
Browser: Google Chrome
Python : 3.8.3// Miniconda 4.9.1
sqlite3：2.6.0
(Administrative tool: DB Browser for SQLite * Convenient for viewing the contents immediately)

1. Goal of this article

Extract 4 prices + trading volume of all listed stocks in 2020 (1/6-12/28) from Stock Investment Memo
* This site is robots.txt: Allow: /, and scraping permission is granted.
Store the retrieved information in the database (SQLite)

2. Confirmation of URL and HTML tags for extracting individual stock data (for one year in 2020)

2-1. First of all, which URL can I get? confirm

For example, if you search for Toyota Motor Corporation from the TOP page of the stock investment memo, you can jump to the page of the annual transition DB, and you can get the data of 2020 by selecting "2020" from it. Also, if you check the URL at that time, you can see that it is / stock/securities code/year you want to check. In other words, by changing this securities code part, it is possible to switch to data of various brands.

2-2. Which tag in HTML next has the information you want to get? Go to check

** The details have already been explained in the previous article, so please check it for details. ** ** If you search for the acquisition target using Chrome's verification function, you can see that the thead and tbody parts under the `table class =" stock_table "`` are like that. However, the difference from the previous article is that the stock price is not included directly under the tbody, but is stored for each individual tbody. But it's not a big deal.

2-3. Try to acquire Toyota Motor's 2020 stock price data as an experiment

Confirmation: https://www.ugtop.com/spill.shtml

`python`


import requests
from bs4 import BeautifulSoup

#Enter the URL to be scraped
url = 'https://kabuoji3.com/stock/7203/2020/'
#My user agent that I checked with confirmation(Current browser)Copy and paste * Rewrite according to the environment
headers = {"User-Agent": "Mozilla/*** Chrome/*** Safari/***"}
#Information from websites using the Requests library(HTML)To get.
response = requests.get(url, headers = headers)
#Create a BeautifulSoup object from the acquired HTML
soup = BeautifulSoup(response.content, "html.parser")

#First, search for the head with the command using the find method, and find the tr in it._Extract all with all method
tag_thead_tr = soup.find('thead').find_all('tr')

print(tag_thead_tr)

`Execution result`


[<tr>
 <th>date</th>
 <th>Open price</th>
 <th>High price</th>
 <th>Low price</th>
 <th>closing price</th>
 <th>Volume</th>
 <th>Closing price adjustment</th>
 </tr>]

`python`


'''
class="stock_table stock_data_table"Stock price data is stored for each tbody directly below
Therefore, you should first get the table class and then get the contents.
'''
#Specify the table, search with findAll, and the 0th element(Although there is only one)Take out
table = soup.findAll("table", {"class":"stock_table stock_data_table"})[0]

#Get the tr tag in it. However, since it is a thread column at first, it is omitted.(1:Part of)* The reason why it is not tbody will be described later.
tag_tbody_tr = table.findAll("tr")[1:]

#Show only the first as an example
print(tag_tbody_tr [0])

`Execution result`


[<tr>
<td>2020-01-06</td>
<td>7599</td>
<td>7626</td>
<td>7530</td>
<td>7565</td>
<td>6672300</td>
<td>7565</td>
</tr>]

Regarding the stock price part, if you check the actual HTML, it is as shown in the figure below, and if you do findAll with tbody, you can get only the 2020/1/6 part of the first<tbody> ～ </tbody>. This comes from the HTML rule that if you write <tbody> once, you can omit it from the next time. Therefore, this time, unlike the previous article, it was necessary to search for the <tr> tag that is written without omission on any date.

キャプチャ6.PNG

In addition, the brand name and securities code required when storing the DB will be searched for and obtained from the same page. As you can see from the execution result, it was not possible to obtain the securities code alone from this page, but in the actual production, only the first 4 digits need to be extracted.

`python`


#Header class base_box_jp class in header,You can see that the information is embedded in the span tag.
name = soup.findAll("span", {"class":"jp"})[0]

print(name.get_text())

`Execution result`


'7203 Toyota Motor(stock)'

3. While taking out the data of all brands, store it in the DB

Since I knew how to take it out by 2, I will write it to the DB while applying it to all brands.

3-1. Think about DB configuration (schema)

How to store as a database? You have to think about (** this is called a schema **). Since this is a simple example, if you suddenly write a ER diagram that represents the DB design diagram, it will look like the one below.

Since the ER diagram creation tool is simple, I used draw.io. Reference: How to make ER diagram [Simplified version]

キャプチャ7.PNG

・ The stock master stores the securities code and stock name (+ it is better if you know the listed market) ・ Enter all the data of each stock in the stock price table (which stock can be identified by code) ・ ** In this article, we will describe the acquisition of price (stock price table) as an example. The name (brand master) is easy, so please refer to this article and try it yourself ** ・ ** Originally, further processing related to "closing price adjustment (stock split, etc.)" is required, but this part is not described in this article **

3-2. Create a function to retrieve information to store in DB and save it as Dataframe once

I think there are various ways to do it from here, but I'm used to pandas, so I thought it would be better to create a stock table with pandas and then convert it to sql with to_sql, so this time I will describe it like that. You don't have to bother with pandas. .. I want you to do it your own way.

** Note that the time.sleep (1) part is the scraping load countermeasure part **. If the number of seconds is specified by the site, you can replace the numbers in the contents.

`python`


import time
import pandas as pd

def get_brand(code):
    """
A function that gets scraping information from the securities code, converts it to Datafrrame, and returns it.
    """

    #My user agent that I checked with confirmation(Current browser)Copy and paste * Rewrite according to the environment
    headers = {"User-Agent": "Mozilla/*** Chrome/*** Safari/***"}

    #Get the URL of the brand you want to get
    url = 'https://kabuoji3.com/stock/' + str(code) + '/2020/'

    #If the stock price does not exist, return None with exception handling
    try:
        #Create a BeautifulSoup object from the acquired HTML
        response = requests.get(url, headers = headers)
        soup = BeautifulSoup(response.content, "html.parser")
        
        #Get a securities code
        code_name = soup.findAll("span", {"class":"jp"})[0]
        
        #header(column)Get information
        tag_thead_tr = soup.find('thead').find_all('tr')
        head = [h.text for h in tag_thead_tr[0].find_all('th')] 
        
        #Get stock price data and convert it to Dataframe
        table = soup.findAll("table", {"class":"stock_table stock_data_table"})[0]
        tag_tbody_tr = table.findAll("tr")[1:]
        
        data = []
        for i in range(len(tag_tbody_tr)):
            data.append([d.text for d in tag_tbody_tr[i].find_all('td')])
        df = pd.DataFrame(data, columns = head)
            
        #Add a new code column to Dataframe with assign * code_Securities code up to the first 4 digits of name
        df = df.assign(code=code_name.get_text()[:4])
        
        #For debugging. Output the acquired page. * Not required for production code
        print(url)
    
    except (ValueError, IndexError, AttributeError):
        return None
    
    return df

def brands_generator(code_range):
    """
A function that generates a securities code and combines the obtained information
    """

    for i,code in enumerate(code_range):
        #Create a new data frame only for the first time
        if i==0:
            cols = ['date', 'Open price', 'High price', 'Low price', 'closing price', 'Volume', 'closing price調整', 'code']
            df = pd.DataFrame(index=[], columns=cols)

        #Pass the generated securities code to the scraping function
        brand = get_brand(code)
        
        #If the information can be obtained, the information will be combined.
        if brand is not None:
            df = pd.concat([df, brand]).reset_index(drop=True)                                  

        #Stop the program for 1 second(Countermeasures against scraping load)
        time.sleep(1)
        
    return df

"""
Here, as an example, we will acquire securities codes 1301 to 1310.
You can get the production from 1301 to 9999(Naturally it takes time)
"""
df = brands_generator(range(1301,1310))

`Execution result`


https://kabuoji3.com/stock/1301/2020/
https://kabuoji3.com/stock/1305/2020/
https://kabuoji3.com/stock/1306/2020/
https://kabuoji3.com/stock/1308/2020/
https://kabuoji3.com/stock/1309/2020/

You can see that the URL of the acquired securities code appears as an output result. Of course, 1302 and 1307 are non-existent securities codes, so I couldn't get them, so the answer is correct. You can see that it seems that scraping is done correctly.

`python`


#Sample display of only the first two lines
df.head(2)

When the result of the obtained df is displayed in Pandas, it can be obtained in chronological order as shown below. The securities code is also stored in the code column part that I added, so there seems to be no problem.

	date	Open price	High price	Low price	closing price	Volume	Closing price adjustment	code
0	2020-01-06	7599	7626	7530	7565	6672300	7565	1301
1	2020-01-07	7654	7722	7619	7715	4960700	7715	1301

3-3. Pass the Dataframe information to DB and save it

Since I was able to get the data in dataframe format, I will finally convert it to DB.

`python`


import sqlite3

#Database name.Set by db extension
dbname = ('sample.db')
#Create database
db = sqlite3.connect(dbname, isolation_level=None)
#df to_Convert to database with sql. DB table name is price
df.to_sql('price', db, if_exists='append', index=None)

#Define a cursor object in the database
cursor = db.cursor()
#Was the price table really created? With the sql function
sql = """SELECT name FROM sqlite_master WHERE TYPE='table'"""
for t in cursor.execute(sql):
    print(t)

`Execution result`


('price',)

You can see that the price table is created properly. Finally, read this database again as a dataframe and see if the contents are the same? To confirm

`python`


#Read the price table of the created database again with pandas
df_db = pd.read_sql('SELECT * FROM price', db)
#Sample display of only 2 lines as before
df_db.head(2)

Naturally, the result is the same. With this, it was confirmed that the contents of the DB are the same as the dataframe extracted by scraping earlier. (You can check the contents with the sql command, but personally ** pandas is 100 times easier to see **)

	date	Open price	High price	Low price	closing price	Volume	Closing price adjustment	code
0	2020-01-06	7599	7626	7530	7565	6672300	7565	1301
1	2020-01-07	7654	7722	7619	7715	4960700	7715	1301

`python`


#Finally connect(connect)Close
db.close()

4. Finally

How was it? Wasn't it surprisingly easy to save to your own DB? Of course, it takes time to acquire all securities codes, but for example, if you acquire it until the end of 2020 and create a batch that stores it in the DB every day from 2021, the DB will be updated automatically. I can do it, and I don't think it will take much effort.

So far, I have shared what I have researched three times regarding investment, but since Qiita is a programming knowledge sharing site, I will refrain from detailed analysis after that. By scraping various sites by yourself who read this article, while stocking your own analysis data, we will do our best together so that we can make a profit in 2021. let's go!

Also, although it is not investment related, I will post machine learning and other articles every week, so if you like, please use LGTM or stock to keep your motivation!

Store the stock price scraped by Python in the DB

0. Introduction

1. Goal of this article

2. Confirmation of URL and HTML tags for extracting individual stock data (for one year in 2020)

2-1. First of all, which URL can I get? confirm

2-2. Which tag in HTML next has the information you want to get? Go to check

2-3. Try to acquire Toyota Motor's 2020 stock price data as an experiment

python

Execution result

python

Execution result

python

Execution result

3. While taking out the data of all brands, store it in the DB

3-1. Think about DB configuration (schema)

3-2. Create a function to retrieve information to store in DB and save it as Dataframe once

python

Execution result

python

3-3. Pass the Dataframe information to DB and save it

python

Execution result

python

python

4. Finally

`python`

`Execution result`

`python`

`Execution result`

`python`

`Execution result`

`python`

`Execution result`

`python`

`python`

`Execution result`

`python`

`python`