Introduction

Recently, crawling is popular in the community I belong to, so I wanted to try it myself.

First is the target screen. スクリーンショット 2019-12-21 9.46.50.png Can you crawl on your own and get angry? Regarding that point, I feel that there is no problem if it is not used for commercial purposes on any site, so I would like to believe that it is okay. Now, I would like to crawl the investment index on this screen.

Execution environment, tools

The execution environment is as follows

$ sw_vers
ProductName:	Mac OS X
ProductVersion:	10.15.1
BuildVersion:	19B88

$ python --version
Python 3.7.4

The tool will continue to use Beautiful Soup. Install the package from the pip command.

$ pip install beautifulsoup4

Other packages use requests for site access.

Login, page access

Since the screen to be crawled requires login, implement the login process. At the same time, access processing to the target screen will be implemented. It is like this.

`app.py`


class Scraper():

    def __init__(self, user_id, password):
        self.base_url = "https://site1.sbisec.co.jp/ETGate/"
        self.user_id = user_id
        self.password = password
        self.login()

    def login(self):
        post = {
                'JS_FLG': "0",
                'BW_FLG': "0",
                "_ControlID": "WPLETlgR001Control",
                "_DataStoreID": "DSWPLETlgR001Control",
                "_PageID": "WPLETlgR001Rlgn20",
                "_ActionID": "login",
                "getFlg": "on",
                "allPrmFlg": "on",
                "_ReturnPageInfo": "WPLEThmR001Control/DefaultPID/DefaultAID/DSWPLEThmR001Control",
                "user_id": self.user_id,
                "user_password": self.password
                }
        self.session = requests.Session()
        res = self.session.post(self.base_url, data=post)
        res.encoding = res.apparent_encoding

    def financePage_html(self, ticker):
        post = {
                "_ControlID": "WPLETsiR001Control",
                "_DataStoreID": "DSWPLETsiR001Control",
                "_PageID": "WPLETsiR001Idtl10",
                "getFlg": "on",
                "_ActionID": "stockDetail",
                "s_rkbn": "",
                "s_btype": "",
                "i_stock_sec": "",
                "i_dom_flg": "1",
                "i_exchange_code": "JPN",
                "i_output_type": "0",
                "exchange_code": "TKY",
                "stock_sec_code_mul": str(ticker),
                "ref_from": "1",
                "ref_to": "20",
                "wstm4130_sort_id": "",
                "wstm4130_sort_kbn":  "",
                "qr_keyword": "",
                "qr_suggest": "",
                "qr_sort": ""
                }

        html = self.session.post(self.base_url, data=post)
        html.encoding = html.apparent_encoding
        return html

    def get_fi_param(self, ticker):
        html = self.financePage_html(ticker)
        soup = BeautifulSoup(html.text, 'html.parser')
        print(soup)

ʻExecute with userID, password, and securities number as arguments. The handling of login and page access is easy. I'm using the post information obtained from the URL and the requestspackage. I tried to output the access result as text using theBeautifulSoup (html.text,'html.parser')` function to check if it was accessed properly.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html lang="ja">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="text/css" http-equiv="Content-Style-Type"/>
<meta content="text/javascript" http-equiv="Content-Script-Type"/>
<meta content="IE=EmulateIE8" http-equiv="X-UA-Compatible"/><!--gethtml content start-->
<!-- header_domestic_001.html/// -->
・
・
・
<h4 class="fm01"><em>Investment index</em> 20/09th term (ream)</h4>
</div>
</div>
<div class="mgt5" id="posElem_19-1">
<table border="0" cellpadding="0" cellspacing="0" class="tbl690" style="width: 295px;" summary="Investment index">
<col style="width:75px;"/>
<col style="width:70px;"/>
<col style="width:80px;"/>
<col style="width:65px;"/>
<tbody>
<tr>
<th><p class="fm01">Expected PER</p></th>
<td><p class="fm01">23.86 times</p></td>
<th><p class="fm01">Expected EPS</p></th>
<td><p class="fm01">83.9</p></td>
</tr>
<tr>
<th><p class="fm01">Achievement PBR</p></th>
<td><p class="fm01">5.92 times</p></td>
<th><p class="fm01">Achievement BPS</p></th>
<td><p class="fm01">338.33</p></td>
</tr>
<tr>
<th><p class="fm01">Expected dividend interest</p></th>
<td><p class="fm01">0.45％</p></td>
<th><p class="fm01">Expected dividend per share</p></th>
<td><p class="fm01">9〜10</p></td>
・
・
・
</script>
<script language="JavaScript" type="text/javascript">_satellite.pageBottom();</script></body>
</html>

I was able to successfully access the target screen and get the HTML as text.

Get block by block from HTML

I was able to get the HTML as text, but as it is now, it contains a lot of tags and CSS. Therefore, I will extract the necessary parts in order. First, try to get the block that contains the investment index. Use the find_all () function of BeautifulSoup to get a specific block. The investment index is included in the following blocks on the screen. スクリーンショット 2019-12-21 16.27.54.png This block is inside <div id =" clmSubArea "> on HTML. Now, let's specify <div id =" clmSubArea "> to the find_all () function.

`app.py`


    def get_fi_param(self, ticker):

        dict_ = {}
        html = self.financePage_html(ticker)
        soup = BeautifulSoup(html.text, 'html.parser')
        div_clmsubarea = soup.find_all('div', {'id': 'clmSubArea'})[0]
        print(div_clmsubarea)

When you do this

<div id="clmSubArea">
<div class="mgt10">
<table border="0" cellpadding="0" cellspacing="0" class="tbl02" summary="layout">
<tbody>
・
・
・
<h4 class="fm01"><em>Investment index</em> 20/09th term (ream)</h4>
</div>
</div>
<div class="mgt5" id="posElem_19-1">
・
・
・
</tr>
</tbody>
</table>
</div>

I was able to get the target block.

Get string

At this point, the same work is repeated until the desired character string is obtained. Let's do it all at once.

`app.py`


    def get_fi_param(self, ticker):

        dict_ = {}
        html = self.financePage_html(ticker)
        soup = BeautifulSoup(html.text, 'html.parser')
        div_clmsubarea = soup.find_all('div', {'id': 'clmSubArea'})[0]
        table = div_clmsubarea.find_all('table')[1]
        p_list = table.tbody.find_all('p', {'class': 'fm01'})
        per = p_list[1].string.replace('\n', '')
        print('Expected PER:' + per)

Get the <table> block of the investment index and get all the<p class = "fm01">in it. Finally, if you get the character string in <p class =" fm01 ">, the process is completed.

$ python -m unittest tests.test -v
test_lambda_handler (tests.test.TestHandlerCase) ...Expected PER:23.86 times

It's done. After that, it will be easier to use if you process the obtained character string into JSON etc.

At the end

Using crawling, I was able to get the information I wanted from the site quite easily. I think that it is a useful technology in various ways if the usage capacity is properly protected. I'm going to make this code compatible with multiple securities codes and make it up to the point where it is listed on HTML. If there is something that can be output in the process, I would like to describe it at another time.

Crawling a securities company's site

Introduction

Execution environment, tools

Login, page access

app.py

Get block by block from HTML

app.py

Get string

app.py

At the end

`app.py`

`app.py`

`app.py`