Recently, crawling is popular in the community I belong to, so I wanted to try it myself.
First is the target screen.
Can you crawl on your own and get angry? Regarding that point, I feel that there is no problem if it is not used for commercial purposes on any site, so I would like to believe that it is okay.
Now, I would like to crawl the investment index
on this screen.
The execution environment is as follows
$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.15.1
BuildVersion: 19B88
$ python --version
Python 3.7.4
The tool will continue to use Beautiful Soup
.
Install the package from the pip command.
$ pip install beautifulsoup4
Other packages use requests
for site access.
Since the screen to be crawled requires login, implement the login process. At the same time, access processing to the target screen will be implemented. It is like this.
app.py
class Scraper():
def __init__(self, user_id, password):
self.base_url = "https://site1.sbisec.co.jp/ETGate/"
self.user_id = user_id
self.password = password
self.login()
def login(self):
post = {
'JS_FLG': "0",
'BW_FLG': "0",
"_ControlID": "WPLETlgR001Control",
"_DataStoreID": "DSWPLETlgR001Control",
"_PageID": "WPLETlgR001Rlgn20",
"_ActionID": "login",
"getFlg": "on",
"allPrmFlg": "on",
"_ReturnPageInfo": "WPLEThmR001Control/DefaultPID/DefaultAID/DSWPLEThmR001Control",
"user_id": self.user_id,
"user_password": self.password
}
self.session = requests.Session()
res = self.session.post(self.base_url, data=post)
res.encoding = res.apparent_encoding
def financePage_html(self, ticker):
post = {
"_ControlID": "WPLETsiR001Control",
"_DataStoreID": "DSWPLETsiR001Control",
"_PageID": "WPLETsiR001Idtl10",
"getFlg": "on",
"_ActionID": "stockDetail",
"s_rkbn": "",
"s_btype": "",
"i_stock_sec": "",
"i_dom_flg": "1",
"i_exchange_code": "JPN",
"i_output_type": "0",
"exchange_code": "TKY",
"stock_sec_code_mul": str(ticker),
"ref_from": "1",
"ref_to": "20",
"wstm4130_sort_id": "",
"wstm4130_sort_kbn": "",
"qr_keyword": "",
"qr_suggest": "",
"qr_sort": ""
}
html = self.session.post(self.base_url, data=post)
html.encoding = html.apparent_encoding
return html
def get_fi_param(self, ticker):
html = self.financePage_html(ticker)
soup = BeautifulSoup(html.text, 'html.parser')
print(soup)
ʻExecute with userID,
password, and securities number as arguments. The handling of login and page access is easy. I'm using the post information obtained from the URL and the
requestspackage. I tried to output the access result as text using the
BeautifulSoup (html.text,'html.parser')` function to check if it was accessed properly.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="ja">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="text/css" http-equiv="Content-Style-Type"/>
<meta content="text/javascript" http-equiv="Content-Script-Type"/>
<meta content="IE=EmulateIE8" http-equiv="X-UA-Compatible"/><!--gethtml content start-->
<!-- header_domestic_001.html/// -->
・
・
・
<h4 class="fm01"><em>Investment index</em> 20/09th term (ream)</h4>
</div>
</div>
<div class="mgt5" id="posElem_19-1">
<table border="0" cellpadding="0" cellspacing="0" class="tbl690" style="width: 295px;" summary="Investment index">
<col style="width:75px;"/>
<col style="width:70px;"/>
<col style="width:80px;"/>
<col style="width:65px;"/>
<tbody>
<tr>
<th><p class="fm01">Expected PER</p></th>
<td><p class="fm01">23.86 times</p></td>
<th><p class="fm01">Expected EPS</p></th>
<td><p class="fm01">83.9</p></td>
</tr>
<tr>
<th><p class="fm01">Achievement PBR</p></th>
<td><p class="fm01">5.92 times</p></td>
<th><p class="fm01">Achievement BPS</p></th>
<td><p class="fm01">338.33</p></td>
</tr>
<tr>
<th><p class="fm01">Expected dividend interest</p></th>
<td><p class="fm01">0.45%</p></td>
<th><p class="fm01">Expected dividend per share</p></th>
<td><p class="fm01">9〜10</p></td>
・
・
・
</script>
<script language="JavaScript" type="text/javascript">_satellite.pageBottom();</script></body>
</html>
I was able to successfully access the target screen and get the HTML as text.
I was able to get the HTML as text, but as it is now, it contains a lot of tags and CSS.
Therefore, I will extract the necessary parts in order.
First, try to get the block that contains the investment index
.
Use the find_all ()
function of BeautifulSoup
to get a specific block.
The investment index
is included in the following blocks on the screen.
This block is inside <div id =" clmSubArea ">
on HTML.
Now, let's specify <div id =" clmSubArea ">
to the find_all ()
function.
app.py
def get_fi_param(self, ticker):
dict_ = {}
html = self.financePage_html(ticker)
soup = BeautifulSoup(html.text, 'html.parser')
div_clmsubarea = soup.find_all('div', {'id': 'clmSubArea'})[0]
print(div_clmsubarea)
When you do this
<div id="clmSubArea">
<div class="mgt10">
<table border="0" cellpadding="0" cellspacing="0" class="tbl02" summary="layout">
<tbody>
・
・
・
<h4 class="fm01"><em>Investment index</em> 20/09th term (ream)</h4>
</div>
</div>
<div class="mgt5" id="posElem_19-1">
・
・
・
</tr>
</tbody>
</table>
</div>
I was able to get the target block.
At this point, the same work is repeated until the desired character string is obtained. Let's do it all at once.
app.py
def get_fi_param(self, ticker):
dict_ = {}
html = self.financePage_html(ticker)
soup = BeautifulSoup(html.text, 'html.parser')
div_clmsubarea = soup.find_all('div', {'id': 'clmSubArea'})[0]
table = div_clmsubarea.find_all('table')[1]
p_list = table.tbody.find_all('p', {'class': 'fm01'})
per = p_list[1].string.replace('\n', '')
print('Expected PER:' + per)
Get the <table>
block of the investment index
and get all the<p class = "fm01">
in it.
Finally, if you get the character string in <p class =" fm01 ">
, the process is completed.
$ python -m unittest tests.test -v
test_lambda_handler (tests.test.TestHandlerCase) ...Expected PER:23.86 times
It's done. After that, it will be easier to use if you process the obtained character string into JSON etc.
Using crawling, I was able to get the information I wanted from the site quite easily. I think that it is a useful technology in various ways if the usage capacity is properly protected. I'm going to make this code compatible with multiple securities codes and make it up to the point where it is listed on HTML. If there is something that can be output in the process, I would like to describe it at another time.