It is fun I stove league Hello!
In this entry, I would like to introduce an example of baseball statistics "Sabermetrics" using professional baseball, which everyone loves, and pandas, which is popular in the streets recently and is also used for both work and hobbies.
In addition, this article is Python Advent Calendar 2016 12/4 minutes article & ** Until you read the article and copy the code and move it The time required for is assumed to be about 33 minutes and 4 seconds. **
Starting Menmber
In addition, this article wrote in the past "Let's see how great the batter Shohei Ohtani is with just a few lines of Python code ”Is newly transcribed from the perspective of Python programming & baseball statistics.
Write Python code and analyze the batter! Before ..., I would like to introduce you to Sabermetrics.
If you write it properly, it will not end even if it takes a whole day, so I will introduce only the points necessary for Hack in the future.
This is a quote from Wikipedia (Sabermetrics).
Sabermetrics (SABRmetrics, Sabermetrics) is an analysis method that objectively analyzes data from a statistical point of view in baseball and considers player evaluations and strategies. ** **
Batting average, RBI, home run, victory, ERA, error, etc ... There are various indicators in baseball, but once these are set aside, an index for "objective analysis" is created based on the "hypothesis". , "Thinking about player evaluation / strategy" analysis method and its idea are Sabermetrics.
The index is made by making full use of existing data (score data such as hits, hits, pitching innings, etc.) and detailed sensor data captured by a speed gun or camera / radar.
This time, we will use score data (sensor data does not exist much on the Web in the first place).
It is OK if you catch the following three.
** The winning percentage is 50% when the team's goal difference is zero (Pythagorean expectation) **
** It is justice for the batter to "do not go out" and "go to the previous base"! ** **
** Pitchers & fielders should "take as many outs as possible" and "don't let the opponent aim at the front base"! ** **
has the idea of "[Pythagorean expectation](https://ja.wikipedia.org/wiki/Pythagorean expectation)", and this winning percentage prediction is the basis of all baseball statistics.
** If the goal is zero, 50%, plus is a savings, and minus is a high possibility of debt. **
Also, baseball will continue to attack unless you get three outs, so
Should be cherished! You can see the guideline.
(There are various theories & I admit the objection) Sabermetrics avoids "play with a high risk of being out" such as sacrifice bunts and stolen bases.
I want to know a little more! For those who say, Article explained in 30 minutes (This is my past article & introduce the background and simple use cases Please refer to this because there is (Stemma)
Here is the topic of programming at last (wait, Suimasen).
Let's actually use Python.
This article is
I will explain on the premise of (I hope you can read it according to your own environment)
& Fragmented code snippets will appear, but I'll also give you the last one.
(To put it plainly, if you copy the last one, you should be able to imitate it)
This time, write the code with the following configuration.
If you are using pip, it is enough to execute this one line (please read it well for people such as anaconda).
> $ pip install ipython pandas beautifulsoup4 numpy lxml html5lib jupyter matplotlib seaborn
Start Jupyter notebook.
> $ jupyter notebook
Since I will write a graph, I will write a Magic Function (the one that starts with%) & import pandas.
By the way, specify the number of columns and rows to display in the notebook to facilitate subsequent debugging.
%matplotlib inline
import pandas
#Increase display columns / rows(30 rows,10 lines)
pandas.options.display.max_columns = 30
pandas.options.display.max_rows = 10
Get it with pandas' read_html method.
On the back side, I used beautifulsoup, html5lib, etc. to scrape the Table tag.
url = 'http://npb.jp/bis/players/21825112.html' #Dai-Kang Yang's HP(npb.jp)
df = pandas.io.html.read_html(url) #Scraping the table tag! List of df(The contents are pandas.core.frame.DataFrame object)Will return with
There are multiple tables on this page, so find out which row is the batting score.
If you say the answer first, index = 3 (4th line) is the batting score.
Maybe this will happen.
Select the required Table from multiple Tables on the page & clean the data type etc. for easy subsequent processing.
#Discard garbage data and move to another data frame(index and,Results before 2010&total)
atbats = df[3].drop([0, 1, 2, 3,4, 11])
Also, since the column of this data is index (actual Japanese column is treated as data), re-add the column.
Japanese is OK, but I personally recommend the abbreviation for baseball English (I think it's familiar to those who are watching MLB).
Also, since the data type is also an ordinary object (it does not change the data type like read_csv), I will specify the type properly.
#Give a column name(Baseball English abbreviation)
atbats.columns = ['year', 'team', 'g', 'pa', 'ab', 'r', 'h', '_2b', '_3b', 'hr', 'tb', 'rbi', 'sb', 'cs', 'sh', 'sf', 'bb', 'hbp', 'so', 'dp', 'ba', 'slg', 'obp']
#Preprocess each column
import numpy as np
atbats['year'] = atbats['year'].fillna(0).astype(np.float64)
atbats['g'] = atbats['g'].fillna(0).astype(np.float64)
atbats['pa'] = atbats['pa'].fillna(0).astype(np.float64)
atbats['ab'] = atbats['ab'].fillna(0).astype(np.float64)
atbats['r'] = atbats['r'].fillna(0).astype(np.float64)
atbats['h'] = atbats['h'].fillna(0).astype(np.float64)
atbats['_2b'] = atbats['_2b'].fillna(0).astype(np.float64)
atbats['_3b'] = atbats['_3b'].fillna(0).astype(np.float64)
atbats['hr'] = atbats['hr'].fillna(0).astype(np.float64)
atbats['tb'] = atbats['tb'].fillna(0).astype(np.float64)
atbats['rbi'] = atbats['rbi'].fillna(0).astype(np.float64)
atbats['sb'] = atbats['sb'].fillna(0).astype(np.float64)
atbats['cs'] = atbats['cs'].fillna(0).astype(np.float64)
atbats['sh'] = atbats['tb'].fillna(0).astype(np.float64)
atbats['sf'] = atbats['sf'].fillna(0).astype(np.float64)
atbats['bb'] = atbats['bb'].fillna(0).astype(np.float64)
atbats['hbp'] = atbats['hbp'].fillna(0).astype(np.float64)
atbats['so'] = atbats['so'].fillna(0).astype(np.float64)
atbats['dp'] = atbats['dp'].fillna(0).astype(np.float64)
atbats['ba'] = atbats['ba'].fillna(0).astype(np.float64)
atbats['slg'] = atbats['slg'].fillna(0).astype(np.float64)
atbats['obp'] = atbats['obp'].fillna(0).astype(np.float64)
If you can bring it to this state, it's time for fun visualization and index calculation.
Here we use seaborn.
Let's see the transition of batting average from 2011 to 2016.
#Draw a graph(at seaborn)
import seaborn as sns
#batting average(ba, batting average)To a line graph
sns.pointplot(x="year", y="ba", data=atbats)
I'm sure you will see a graph like this.
You can see that the numbers for 2015 have dropped.
This year has been the year of the worst performance due to injuries during the season, in addition to the injured shoulders and thighs since 2014 (from my memory & Wikipedia).
There are some other data, so if you are interested, please make a graph and play with it.
In addition, it is good to check the official website for how to use seaborn.
seaborn official documentation
This time I will challenge with a simple index.
OPS(On the base Plus Slugging) Grasp the rough attack power by adding the on-base percentage + slugging percentage, the ability to not go out (on-base percentage) and the ability to advance to base (slugging percentage)
BB/K(Base on Balls per Strike out)
It can be done by four arithmetic operations between data frames.
#OPS and BB/Calculate K
atbats['ops'] = atbats['obp'] + atbats['slg'] # OPS
atbats['bb_k'] = atbats['bb'] / atbats['so'] # BB/K
OPS
BB/K
There is also K / BB, which is the opposite of BB / K, so let's take a look there as well.
** You can see that this year's Dai-Kang Yang struck out about three per walk. **
If K / BB doesn't fit in 1.5-2.0, if not as bad as 6.09 in 2011 and 4.42 in 2015,
** No. 1 batter type in terms of OPS, but there are more strikeouts than bases
It seems to be a painful feeling.
The Hokkaido Nippon-Ham Fighters are a team that looks at the data in considerable detail (although it is a mystery to recommend a sacrifice bunt), and it seems possible that the deterioration of the numbers in this area led to FA (actually more detailed numbers & here). The defense that cannot be expressed is also affected).
Here is a summary of the snippets I wrote in pieces.
Let's Sabermetrics of Dai-Kang Yang
Mainly what I want to say Main (experienced)
** Why isn't it related to Hanshin! ** **
... tomorrow, kimihiro_n, thank you!
For those who want to learn more about Sabermetrics.
I wonder if there is no doubt if this area is suppressed!
Recommended Posts