** This is the third (final) ** of the summary article that I investigated for the purpose of using python for investment utilization. As a omnibus so far, I will write a series of flow to store in the DB while scraping. Basically, I will describe it on the assumption that all the following articles are known. (This article also has as many comments as possible, but please see past articles for more details.)
** Articles up to the last ** ** Handling database easily with Python (SQLite3) ** ** Web scraping with Python ① (Scraping prior knowledge) ** ** Web scraping with Python ② (actually scraping stock sites) **
robots.txt: Allow: /
, and scraping permission is granted.For example, if you search for Toyota Motor Corporation from the TOP page of the stock investment memo, you can jump to the page of the annual transition DB, and you can get the data of 2020 by selecting "2020" from it.
Also, if you check the URL at that time, you can see that it is / stock/securities code/year you want to check
.
In other words, by changing this securities code
part, it is possible to switch to data of various brands.
** The details have already been explained in the previous article, so please check it for details. ** **
If you search for the acquisition target using Chrome's verification function, you can see that the thead
and tbody
parts under the `table class =" stock_table "`` are like that. However, the difference from the previous article is that the stock price is not included directly under the tbody, but is stored for each individual tbody. But it's not a big deal.
Confirmation: https://www.ugtop.com/spill.shtml
python
import requests
from bs4 import BeautifulSoup
#Enter the URL to be scraped
url = 'https://kabuoji3.com/stock/7203/2020/'
#My user agent that I checked with confirmation(Current browser)Copy and paste * Rewrite according to the environment
headers = {"User-Agent": "Mozilla/*** Chrome/*** Safari/***"}
#Information from websites using the Requests library(HTML)To get.
response = requests.get(url, headers = headers)
#Create a BeautifulSoup object from the acquired HTML
soup = BeautifulSoup(response.content, "html.parser")
#First, search for the head with the command using the find method, and find the tr in it._Extract all with all method
tag_thead_tr = soup.find('thead').find_all('tr')
print(tag_thead_tr)
Execution result
[<tr>
<th>date</th>
<th>Open price</th>
<th>High price</th>
<th>Low price</th>
<th>closing price</th>
<th>Volume</th>
<th>Closing price adjustment</th>
</tr>]
python
'''
class="stock_table stock_data_table"Stock price data is stored for each tbody directly below
Therefore, you should first get the table class and then get the contents.
'''
#Specify the table, search with findAll, and the 0th element(Although there is only one)Take out
table = soup.findAll("table", {"class":"stock_table stock_data_table"})[0]
#Get the tr tag in it. However, since it is a thread column at first, it is omitted.(1:Part of)* The reason why it is not tbody will be described later.
tag_tbody_tr = table.findAll("tr")[1:]
#Show only the first as an example
print(tag_tbody_tr [0])
Execution result
[<tr>
<td>2020-01-06</td>
<td>7599</td>
<td>7626</td>
<td>7530</td>
<td>7565</td>
<td>6672300</td>
<td>7565</td>
</tr>]
Regarding the stock price part, if you check the actual HTML, it is as shown in the figure below, and if you do findAll with tbody
, you can get only the 2020/1/6 part of the first<tbody> ~ </tbody>
. This comes from the HTML rule that if you write <tbody>
once, you can omit it from the next time. Therefore, this time, unlike the previous article, it was necessary to search for the <tr>
tag that is written without omission on any date.
In addition, the brand name and securities code required when storing the DB will be searched for and obtained from the same page. As you can see from the execution result, it was not possible to obtain the securities code alone from this page, but in the actual production, only the first 4 digits need to be extracted.
python
#Header class base_box_jp class in header,You can see that the information is embedded in the span tag.
name = soup.findAll("span", {"class":"jp"})[0]
print(name.get_text())
Execution result
'7203 Toyota Motor(stock)'
Since I knew how to take it out by 2, I will write it to the DB while applying it to all brands.
How to store as a database? You have to think about (** this is called a schema **).
Since this is a simple example, if you suddenly write a ER diagram
that represents the DB design diagram, it will look like the one below.
Since the ER diagram creation tool is simple, I used draw.io. Reference: How to make ER diagram [Simplified version]
・ The stock master stores the securities code and stock name (+ it is better if you know the listed market) ・ Enter all the data of each stock in the stock price table (which stock can be identified by code) ・ ** In this article, we will describe the acquisition of price (stock price table) as an example. The name (brand master) is easy, so please refer to this article and try it yourself ** ・ ** Originally, further processing related to "closing price adjustment (stock split, etc.)" is required, but this part is not described in this article **
I think there are various ways to do it from here, but I'm used to pandas, so I thought it would be better to create a stock table with pandas and then convert it to sql with to_sql
, so this time I will describe it like that. You don't have to bother with pandas. .. I want you to do it your own way.
** Note that the time.sleep (1)
part is the scraping load countermeasure part **. If the number of seconds is specified by the site, you can replace the numbers in the contents.
python
import time
import pandas as pd
def get_brand(code):
"""
A function that gets scraping information from the securities code, converts it to Datafrrame, and returns it.
"""
#My user agent that I checked with confirmation(Current browser)Copy and paste * Rewrite according to the environment
headers = {"User-Agent": "Mozilla/*** Chrome/*** Safari/***"}
#Get the URL of the brand you want to get
url = 'https://kabuoji3.com/stock/' + str(code) + '/2020/'
#If the stock price does not exist, return None with exception handling
try:
#Create a BeautifulSoup object from the acquired HTML
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.content, "html.parser")
#Get a securities code
code_name = soup.findAll("span", {"class":"jp"})[0]
#header(column)Get information
tag_thead_tr = soup.find('thead').find_all('tr')
head = [h.text for h in tag_thead_tr[0].find_all('th')]
#Get stock price data and convert it to Dataframe
table = soup.findAll("table", {"class":"stock_table stock_data_table"})[0]
tag_tbody_tr = table.findAll("tr")[1:]
data = []
for i in range(len(tag_tbody_tr)):
data.append([d.text for d in tag_tbody_tr[i].find_all('td')])
df = pd.DataFrame(data, columns = head)
#Add a new code column to Dataframe with assign * code_Securities code up to the first 4 digits of name
df = df.assign(code=code_name.get_text()[:4])
#For debugging. Output the acquired page. * Not required for production code
print(url)
except (ValueError, IndexError, AttributeError):
return None
return df
def brands_generator(code_range):
"""
A function that generates a securities code and combines the obtained information
"""
for i,code in enumerate(code_range):
#Create a new data frame only for the first time
if i==0:
cols = ['date', 'Open price', 'High price', 'Low price', 'closing price', 'Volume', 'closing price調整', 'code']
df = pd.DataFrame(index=[], columns=cols)
#Pass the generated securities code to the scraping function
brand = get_brand(code)
#If the information can be obtained, the information will be combined.
if brand is not None:
df = pd.concat([df, brand]).reset_index(drop=True)
#Stop the program for 1 second(Countermeasures against scraping load)
time.sleep(1)
return df
"""
Here, as an example, we will acquire securities codes 1301 to 1310.
You can get the production from 1301 to 9999(Naturally it takes time)
"""
df = brands_generator(range(1301,1310))
Execution result
https://kabuoji3.com/stock/1301/2020/
https://kabuoji3.com/stock/1305/2020/
https://kabuoji3.com/stock/1306/2020/
https://kabuoji3.com/stock/1308/2020/
https://kabuoji3.com/stock/1309/2020/
You can see that the URL of the acquired securities code appears as an output result. Of course, 1302 and 1307 are non-existent securities codes, so I couldn't get them, so the answer is correct. You can see that it seems that scraping is done correctly.
python
#Sample display of only the first two lines
df.head(2)
When the result of the obtained df is displayed in Pandas, it can be obtained in chronological order as shown below. The securities code is also stored in the code column part that I added, so there seems to be no problem.
date | Open price | High price | Low price | closing price | Volume | Closing price adjustment | code | |
---|---|---|---|---|---|---|---|---|
0 | 2020-01-06 | 7599 | 7626 | 7530 | 7565 | 6672300 | 7565 | 1301 |
1 | 2020-01-07 | 7654 | 7722 | 7619 | 7715 | 4960700 | 7715 | 1301 |
Since I was able to get the data in dataframe format, I will finally convert it to DB.
python
import sqlite3
#Database name.Set by db extension
dbname = ('sample.db')
#Create database
db = sqlite3.connect(dbname, isolation_level=None)
#df to_Convert to database with sql. DB table name is price
df.to_sql('price', db, if_exists='append', index=None)
#Define a cursor object in the database
cursor = db.cursor()
#Was the price table really created? With the sql function
sql = """SELECT name FROM sqlite_master WHERE TYPE='table'"""
for t in cursor.execute(sql):
print(t)
Execution result
('price',)
You can see that the price table is created properly. Finally, read this database again as a dataframe and see if the contents are the same? To confirm
python
#Read the price table of the created database again with pandas
df_db = pd.read_sql('SELECT * FROM price', db)
#Sample display of only 2 lines as before
df_db.head(2)
Naturally, the result is the same. With this, it was confirmed that the contents of the DB are the same as the dataframe extracted by scraping earlier. (You can check the contents with the sql command, but personally ** pandas is 100 times easier to see **)
date | Open price | High price | Low price | closing price | Volume | Closing price adjustment | code | |
---|---|---|---|---|---|---|---|---|
0 | 2020-01-06 | 7599 | 7626 | 7530 | 7565 | 6672300 | 7565 | 1301 |
1 | 2020-01-07 | 7654 | 7722 | 7619 | 7715 | 4960700 | 7715 | 1301 |
python
#Finally connect(connect)Close
db.close()
How was it? Wasn't it surprisingly easy to save to your own DB? Of course, it takes time to acquire all securities codes, but for example, if you acquire it until the end of 2020 and create a batch that stores it in the DB every day from 2021, the DB will be updated automatically. I can do it, and I don't think it will take much effort.
So far, I have shared what I have researched three times regarding investment, but since Qiita is a programming knowledge sharing site, I will refrain from detailed analysis after that. By scraping various sites by yourself who read this article, while stocking your own analysis data, we will do our best together so that we can make a profit in 2021. let's go!
Also, although it is not investment related, I will post machine learning and other articles every week, so if you like, please use LGTM or stock to keep your motivation!
Recommended Posts