I tried to touch scikit-learn
while studying because it was a summer vacation. It's like a free study during summer vacation. Please forgive it though it is a beginner's content.
I really wanted to do something like machine learning, but I started because I could do it because I lacked knowledge and data. Since the summer high school baseball Koshien tournament is just getting excited (personally), I decided to cluster the representative schools using the data of the local tournament.
It seems that we can analyze various things by collecting data such as personal results, but first we decided to use basic data such as team batting average and ERA.
Create the original data by referring to this site. The batting average, home runs, sacrifice bunts, and stolen bases of each representative school's local tournament are summarized. The number of home runs is outstanding at only one school. By the way, if you look closely, the representative schools are lined up in the order of prefecture code.
https://github.com/radiocat/study-sklearn/blob/master/hs-bb/batting-2016.csv
[This site](http://koshien.site/wp/2016/08/05/%E9%AB%98%E6%A0%A1%E9%87%8E%E7%90%83%E5%A4% 8F% E3% 81% AE% E7% 94% B2% E5% AD% 90% E5% 9C% 92% E5% 87% BA% E5% A0% B4% E6% A0% A1% E6% 8A% 95% Create the original data with reference to E6% 89% 8B% E6% 88% 90% E7% B8% BE /). The main pitchers, innings pitched, runs, and ERA at each representative school's local tournament are summarized. If one person does not throw more than 60% of the pitches, it seems that two or three pitchers are used for the calculation. The main pitcher counted the number of people and added it to another item.
https://github.com/radiocat/study-sklearn/blob/master/hs-bb/pitching-2016.csv
I set the number of clusters to 5
for the time being. There is no particular basis.
The algorithm uses k-means
. I don't have the knowledge to choose another rather than this is good ...
#coding:utf-8
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
dataframe = pd.read_csv('batting-2016.csv')
array = np.array([dataframe['Number of games'].tolist(),
dataframe['batting average'].tolist(),
dataframe['Home run'].tolist(),
dataframe['Sacrifice'].tolist(),
dataframe['Stolen base'].tolist()
], np.float)
array = array.T
predict = KMeans(n_clusters=5).fit_predict(array)
print(predict)
#coding:utf-8
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
dataframe = pd.read_csv('pitching-2016.csv')
array = np.array([dataframe['Number of pitchers'].tolist(),
dataframe['Number of pitches'].tolist(),
dataframe['Conceded'].tolist(),
dataframe['Earned run average'].tolist()
], np.float)
array = array.T
predict = KMeans(n_clusters=5).fit_predict(array)
print(predict)
I tried to match the batting and pitcher results.
#coding:utf-8
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
batting_dataframe = pd.read_csv('batting-2016.csv')
pitching_dataframe = pd.read_csv('pitching-2016.csv')
array = np.array([batting_dataframe['Number of games'].tolist(),
batting_dataframe['batting average'].tolist(),
batting_dataframe['Home run'].tolist(),
batting_dataframe['Sacrifice'].tolist(),
batting_dataframe['Stolen base'].tolist(),
pitching_dataframe['Number of pitchers'].tolist(),
pitching_dataframe['Number of pitches'].tolist(),
pitching_dataframe['Conceded'].tolist(),
pitching_dataframe['Earned run average'].tolist()
], np.float)
array = array.T
predict = KMeans(n_clusters=5).fit_predict(array)
print(predict)
school name | Blow | pitcher | Comprehensive |
---|---|---|---|
Clark Memorial International | 1 | 2 | 2 |
North Sea | 2 | 0 | 1 |
Hachinohe Gakuin Kosei | 3 | 2 | 3 |
With Morioka Dai | 0 | 1 | 0 |
Tohoku | 4 | 0 | 1 |
Omagari | 0 | 2 | 2 |
Tsuruoka Higashi | 1 | 4 | 0 |
Seiko Gakuin | 0 | 3 | 0 |
Joso Gakuin | 0 | 1 | 2 |
Sakushin Gakuin | 0 | 3 | 0 |
Maebashi Ikuei | 0 | 0 | 1 |
Hanasaki Tokuharu | 1 | 1 | 3 |
Kisarazu synthesis | 4 | 2 | 1 |
Kanto Daiichi | 1 | 2 | 2 |
Hachioji | 3 | 3 | 4 |
Yokohama | 1 | 2 | 2 |
Chuetsu | 3 | 3 | 4 |
Toyama Daiichi | 4 | 4 | 3 |
Star Ridge | 1 | 3 | 0 |
Hokuriku | 0 | 3 | 0 |
Yamanashi Gakuin | 0 | 4 | 0 |
Saku Chosei | 1 | 1 | 2 |
Chukyo | 0 | 4 | 0 |
Tokoha Kikugawa | 3 | 1 | 3 |
Toho | 3 | 1 | 4 |
Inabe synthesis | 0 | 4 | 0 |
Omi | 0 | 1 | 2 |
Kyoto Shoei | 0 | 1 | 2 |
Shoshosha | 1 | 0 | 1 |
Amagasaki City | 4 | 0 | 1 |
Chiben Gakuen | 1 | 2 | 2 |
Ichi Wakayama | 4 | 2 | 3 |
Border | 0 | 3 | 0 |
Izumo | 4 | 1 | 3 |
Soshi Gakuen | 0 | 1 | 2 |
Hiroshima Shinjo | 4 | 2 | 3 |
Takagawa Gakuen | 3 | 0 | 3 |
Naruto | 1 | 4 | 0 |
Jinseigakuen | 4 | 3 | 3 |
Matsuyama Seiryo | 4 | 2 | 3 |
Meitoku Gijuku | 0 | 4 | 0 |
Kyushu International University High School | 0 | 2 | 2 |
Karatsu merchant | 1 | 3 | 2 |
Nagasaki Commercial | 1 | 2 | 2 |
Shugakukan | 3 | 4 | 4 |
Oita | 0 | 3 | 2 |
Nichinan Gakuen | 0 | 4 | 0 |
Shonan | 2 | 2 | 1 |
Kadena | 1 | 2 | 2 |
There seems to be various tsukkomi, but it seems that 0 or 4 is not stronger because the trends of the data are numerically similar. Well, the data of the local tournament is completely different at Koshien, and it can't be helped to pursue it deeply.
Since it is not the purpose to predict the winning school, I will not mention the contents any more, but I will recalculate it to the average batting average per game, etc., and if there are multiple main pitchers, this will also be averaged or pitchers I feel that different results may be obtained by examining information such as the number of pitches for each pitch and giving more detailed numerical values for clustering. I think this area is an important area of data science, but this time I would like to finish it.
Recommended Posts