Data collection is troublesome, isn't it? I wanted to analyze boat races in the future, so as a practice of collecting data odds table on the official boat race website = 02 & hd = 20200511) is scraped.
--Since I can only think of a language that seems to be useful for scraping = python, I use python3.7. --Beatutiful soup in python3.7 ~~ It looks a little erotic ~~ The library with the name seems to be useful for scraping. --By using the css selector of beautifulsoup, you can specify the location of the table without decoding the html one by one! --Copy the CSS selector using the verification tool installed in the browser (easy) ――Because it is a practice beautifulsoup, I will not explain the method in detail (there are many other good articles!) ――Do your best to pull out the information of the triple table and put it in the dictionary type this time.
print('sample1:', three_rentan_odds_dict['1']['2']['3'])
print('sample2:', three_rentan_odds_dict['6']['5']['4'])
# output:
# sample1: 47.2
# sample2: 285.7
Of course, the list type is fine, but the dictionary type is faster to access, and the order in the array doesn't matter, so don't dig in too much here!
Development uses python3.7, but I think you can go with any 3 series! It might be easier to copy and paste using jupyter or something!
Enter with pip
#For pip
pip install request, beautifulsoup4, numpy
#For pipenv
pipenv install request beautifulsoup4 numpy
Unintentionally or unintentionally ** Never put a load on the other server **. There are also cases of arrest. Specifically, it is okay to copy and paste the source code introduced here as it is and execute it only once, but if you try to scrape the information of the entire schedule using the ** for statement, the load on the server will be increased. Please do not do this as it may cause inconvenience **. It seems that scraping itself for the purpose of data analysis is not illegal. For more information here
from urllib.request import urlopen
from bs4 import BeautifulSoup
#Using the triple single odds table of the 12th race of Toda Racecourse on May 11, 2020 as an example
target_url = \
'https://www.boatrace.jp/owpc/pc/race/odds3t?rno=12&jcd=02&hd=20200511'
#load html
html_content = urlopen(target_url).read()
print(type(html_content))
# output
# <class 'bytes'>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
print(type(soup))
# output
# <class 'bs4.BeautifulSoup'>
select
method provided by beautifulsoup.In the select
method, you can scrape by specifying the location of the html tag using the specified css selector.
This time, I would like to take out the odds table part of the triplet.
Therefore, use the verification tool of the browser to get the css selector of the target part.
In particular
target_table_selector
and paste it.#Paste the copied css selector
target_table_selector = \
'body > main > div > div > div > '\
'div.contentsFrame1_inner > div:nth-child(6) > table'
# select_Fetch the html specified by the one method
odds_table = soup.select_one(target_table_selector)
print(type(odds_table))
# output:
# <class 'bs4.element.Tag'>
# print(odds_table)When you execute, the html of only the specified table part is displayed.
Looking at the browser verification tool we saw earlier, in order to extract the elements, only the 'tbody'
part is required in ʻodds_table, so specify it with
select_one. Then, in order to store each row as a list, use
select to specify the
'tr'` part and make it a list.
#specification of tbody
odds_table_elements = odds_table.select_one('tbody')
#Specify tr and store as a list
row_list = odds_table_elements.select('tr')
print(len(row_list))
# output:
# 20 :Matches the number of rows in the table
Next, paying attention to the tag that stores the value of the odds that are the elements, we can see that it is a class called ʻoddsPoint` in the td tag. Since we want to extract this for each line, we will create a function first.
#Processing to be performed for each line
def getoddsPoint2floatlist(odds_tr):
#Get the list of html where the odds values are stored
html_list = odds_tr.select('td.oddsPoint')
print(html_list[0])
# example output:
# <td class="oddsPoint">47.2</td>
#By using text, you can extract only the elements surrounded by tags
text_list = list(map(lambda x: x.text, html_list))
# print(text_list)
# example output:
# ['47.2', '60.3', '588.7', '52.8', '66.0', '248.7']
#Odds are decimal numbers, so cast to float type
float_list = list(map(
lambda x: float(x), text_list))
return float_list
Use the map function to generate a matrix that extracts only the elements of the entire table
odds_matrix = list(map(
lambda x: getoddsPoint2floatlist(x),
row_list
))
print(odds_matrix)
# output
# [[47.2, 60.3, 588.7, 52.8, 66.0, 248.7],
# [14.7, 13.3, 994.9, 361.6, 363.8, 1276.0],
# [12.0, 11.1, 747.7, 67.1, 137.8, 503.6],
# [26.7, 26.6, 1155.0, 96.5, 123.7, 414.5],
# [157.0, 188.8, 566.8, 50.4, 64.3, 241.5],
# [242.2, 215.7, 660.5, 261.5, 314.5, 1037.0],
# [237.5, 190.8, 561.6, 36.4, 66.8, 183.4],
# [403.5, 281.1, 926.8, 49.2, 73.1, 183.6],
# [35.0, 25.4, 1276.0, 750.0, 930.3, 2462.0],
# [219.2, 152.2, 959.6, 517.5, 799.1, 1950.0],
# [59.6, 23.6, 963.4, 650.0, 1139.0, 1779.0],
# [89.4, 38.4, 1433.0, 639.7, 1237.0, 2321.0],
# [34.6, 23.8, 1019.0, 63.9, 119.7, 387.5],
# [212.5, 143.8, 752.3, 36.9, 64.1, 174.3],
# [76.3, 30.5, 1231.0, 270.8, 452.2, 952.1],
# [79.6, 35.8, 1614.0, 44.9, 84.1, 244.4],
# [83.7, 90.6, 2031.0, 110.1, 171.1, 391.8],
# [356.3, 308.5, 1552.0, 63.2, 103.9, 201.7],
# [159.7, 77.7, 1408.0, 326.7, 560.3, 1346.0],
# [136.0, 69.0, 1562.0, 71.4, 148.1, 285.7]]
** This completes scraping! !! ** **
This is not an essential part of scraping, so I will omit detailed explanations.
import numpy as np
#numpy array
odds_matrix = np.array(odds_matrix)
#Take transposes, connect and list
odds_list = list(odds_matrix.T.reshape(-1))
#Store in dictionary
three_rentan_odds_dict = {}
for fst in range(1, 7):
if fst not in three_rentan_odds_dict.keys():
three_rentan_odds_dict[str(fst)] = {}
for snd in range(1, 7):
if snd != fst:
if snd not in three_rentan_odds_dict[str(fst)].keys():
three_rentan_odds_dict[str(fst)][str(snd)] = {}
for trd in range(1, 7):
if trd != fst and trd != snd:
three_rentan_odds_dict[str(fst)][str(snd)][str(trd)] = \
odds_list.pop(0)
print('sample1:', three_rentan_odds_dict['1']['2']['3'])
print('sample2:', three_rentan_odds_dict['6']['5']['4'])
# output:
# sample1: 47.2
# sample2: 285.7
Recommended Posts