Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~

Review up to the last time

Data analysis for improving POG 1-web scraping with Python- shows the profile (gender, pedigree, stables) of horses born between 2010 and 2013. Etc.) and earned prizes during the POG period. The obtained data is saved separately for each year of birth under the file name "horse_db / horse_prof_ (yyyy) .csv".

Purpose of this time

This time, based on these data, I would like to analyze the causal relationship between the horse profile and the prize money won during the POG period, and find out the law of POG winning.

Unfortunately, I'm not a data analysis specialist. Therefore, it will be possible to draw some conclusions through trial and error. Ultimately, it is expected that multiple regression analysis and machine learning will be taken care of, but first, I would like to understand the characteristics of each factor by simple analysis.

Data analysis

This time, we will proceed with the analysis on a jupyter notebook that seems to be suitable for data analysis including trial and error. In addition, the pandas module is used to calculate statistics (mean, variance, etc.) for each factor.

Preparation

First, a data frame to be analyzed is generated.

AnalyseUmaData_160105.ipynb


import os
import pandas as pd

year_l = range(2010, 2014)
masta_df = pd.DataFrame()
for year in year_l:
    i_dname = './horse_db/'
    i_fname = 'horse_prof_%d.csv' % year
    i_fpath = os.path.join(i_dname, i_fname)
    tmp_df = pd.read_csv(i_fpath,index_col=0, header=0, encoding='utf-8')
    masta_df = pd.concat([masta_df, tmp_df])
    
masta_df[:10]

A part of the generated data frame is shown below. AnalyseUmaData_160105.jpg

Gender, date of birth, trainer, owner, producer, auction transaction price, father, mother and father are likely to affect the POG period prize money.

As it is, the date of birth and the auction transaction price are difficult to handle, so I will mold it a little.

AnalyseUmaData_160105.ipynb



import datetime

#Birthday=>Year of birth, month of birth
birth_y = masta_df[u'Birthday'].dropna().map(lambda x: datetime.datetime.strptime(x.encode('utf-8'), '%Y year%m month%d day').strftime('%Y'))
birth_y.name = u'Year of birth'
birth_m = masta_df[u'Birthday'].dropna().map(lambda x: datetime.datetime.strptime(x.encode('utf-8'), '%Y year%m month%d day').strftime('%m'))
birth_m.name = u'Birth month'
df = pd.concat([masta_df, birth_y, birth_m], axis=1)

#Auction transaction price character string=>Quantify
df[u'Auction transaction price'] = masta_df[u'Auction transaction price'].fillna('-1')
df[u'Auction transaction price'] = df[u'Auction transaction price'].dropna().map(lambda x: x.replace(',', ''))
df[u'Auction transaction price'] = df[u'Auction transaction price'].dropna().map(lambda x: int(x.split(u'Ten thousand yen')[0]))

df[:10]

AnalyseUmaData_160105.jpg

The date of birth was divided into the year of birth and the month of birth. Now the auction transaction price, year of birth, and month of birth can be handled as numerical data.

analysis

Let's start the analysis immediately. As a result of trial and error, the data aggregation script settled in the following form.

AnalyseUmaData_160105.ipynb


import numpy as np

#Explanatory variable
param_l = [u'sex', u'Birth month', u'Trainer', u'Horse owner', u'Producer', u'father', u'母father']

#Objective variable
prize_l = [u'POG period prize_half period',u'POG period prize_Year-round']

#Setting
param = param_l[0]
prize = prize_l[1]
pts_filter = 5
prize_filter = 1000 #Prize filter

#Aggregate
ser_ave = df.groupby(param).mean()[prize]
ser_ave.name = u'prize>=0_ave'
ser_std = df.groupby(param).std()[prize]
ser_std.name = u'prize>=0_std'
ser_pts = df.groupby(param).size()
ser_pts.name = u'prize>=0_pts'
ser_fave = df[df[prize]>=prize_filter].groupby(param).mean()[prize]
ser_fave.name = u'prize>=%d_ave' % prize_filter
ser_fstd = df[df[prize]>=prize_filter].groupby(param).std()[prize]
ser_fstd.name = u'prize>=%d_std' % prize_filter
ser_fpts = df[df[prize]>=prize_filter].groupby(param).size()
ser_fpts.name = u'prize>=%d_pts' % prize_filter
ser_fper = (df[df[prize]>=prize_filter].groupby(param).size()/df.groupby(param).size()).map(lambda x: float(x)*100)
ser_fper.name = u'prize>=%d_pts / prize>=0_pts [%%]' % prize_filter
result = pd.concat([ser_ave, ser_std, ser_pts, ser_fave, ser_fstd, ser_fpts, ser_fper],axis=1)
result.index.name = '%s_%s' % (param, prize)
result = np.round(result.sort_values(by=ser_fper.name, ascending=0),2)
result[result[ser_fpts.name] >= pts_filter][:10]

Each column has a prize of 0 or more, that is, the mean, variance, and score for all horses as a population. Mean, variance, score, when the population is a horse with a prize of 10 million or more It also shows the percentage of horses with prizes of 10 million or more.

sex

The result was a gelding advantage, but it is uncertain whether it was a gelding during the POG period. At least it turns out that mares are at a disadvantage. AnalyseUmaData_160105.jpg

Birth month

The sooner you are born, the better. AnalyseUmaData_160105.jpg

Trainer

I feel that the ranking will change as much as possible depending on the setting of the prize filter, but when sorted at a rate of 10 million or more, the following results were obtained. It is a pity that Matsuda Stables, which is scheduled to retire in February 2016, ranked first. AnalyseUmaData_160105.jpg

Horse owner

Surprisingly, the clubs that produce active horses in G1 have not come to the top. Is there a big difference between hit and miss for club horses? AnalyseUmaData_160105.jpg

Producer

Until now, I had chosen horses with the idea of "Northern Farm for the time being if I was at a loss", but I got data that confirms that it is a reasonably reasonable idea. AnalyseUmaData_160105.jpg

father

Unfamiliar overseas horses came to the top. Considering the population parameter and average value, it can be said that Deep, Da Major, Hearts, and King Kamehameha are for POG. AnalyseUmaData_160105.jpg

Mother father

I don't know what it is anymore. As the good results of horses born from the combination of "Stay Gold" x "Mejiro McQueen" and "Deep Impact" x "Storm Cat" have become a hot topic in recent years, it seems necessary to re-evaluate the combination of father and mother. AnalyseUmaData_160105.jpg

Auction transaction price

Horizontal axis: auction transaction price, vertical axis: POG period prize money scatter plot was created. It seems that there is no need to worry about the auction price in POG as no significant correlation can be found.

AnalyseUmaData_160105.ipynb



%matplotlib inline
import matplotlib.pyplot as plt

#Explanatory variable
param = u'Auction transaction price'

#Objective variable
prize_l = [u'POG period prize_half period',u'POG period prize_Year-round']
prize = prize_l[1]

#Scatter plot
x = df[df[param]>0][param]
y = df[df[param]>0][prize]
plt.plot(x, y, linestyle='None', marker='o')
plt.xlabel('seri')
plt.ylabel('prize')
plt.show()

AnalyseUmaData_160105.jpg

This summary

To summarize the results obtained in this analysis, if you select a horse that is an early-born stallion and is produced by Deep Impact, Daiwa Major, Heart's Cry, or King Kamehameha and is owned by an individual owner, you can select a horse that will win with a reasonable probability. It turns out that it is.

However, in this analysis, there is still room for improvement, such as the fact that the prize threshold is set to 10 million and the effect of the combination of each factor has not been verified, so it is not possible to easily draw a conclusion. Based on this result, it seems necessary to review the analysis method itself.

from now on

Work on data visualization, multiple regression analysis, machine learning, etc.

Recommended Posts

Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Data analysis for improving POG 3-Regression analysis-
Data analysis for improving POG 1 ~ Web scraping with Python ~
Convenient analysis with Pandas + Jupyter notebook
<Python> Build a dedicated server for Jupyter Notebook data analysis
Data analysis with python 2
Data analysis with Python
Data analysis environment construction with Python (IPython notebook + Pandas)
Shortcut key for Jupyter notebook
Embed audio data with Jupyter
Using Graphviz with Jupyter Notebook
Python for Data Analysis Chapter 4
Use pip with Jupyter Notebook
Use Cython with Jupyter Notebook
Play with Jupyter Notebook (IPython Notebook)
Python for Data Analysis Chapter 2
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Snippet settings for python jupyter notebook
Jupyter Notebook essential for software development
Visualize decision trees with jupyter notebook
Make a sound with Jupyter notebook
Use markdown with jupyter notebook (with shortcut)
Preprocessing template for data analysis (Python)
Add more kernels with Jupyter Notebook
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Output log file with Job (Notebook) of Cloud Pak for Data
Build a comfortable psychological experiment / analysis environment with PsychoPy + Jupyter Notebook
Create a USB boot Ubuntu with a Python environment for data analysis
Use nb extensions with Anaconda's Jupyter notebook
Jupyter Notebook extension, nbextensions settings for myself
Python visualization tool for data analysis work
Use apache Spark with jupyter notebook (IPython notebook)
I want to blog with Jupyter Notebook
Use Jupyter Lab and Jupyter Notebook with EC2
Allow Jupyter Notebook to embed audio data in HTML tables for playback
I tried factor analysis with Titanic data!
Try SVM with scikit-learn on Jupyter Notebook
Deploy functions with Cloud Pak for Data
How to use jupyter notebook with ABCI
Initial setting of Jupyter Notebook for Vim lovers ・ Exit with jj (jupyter-vim-binding)
Linking python and JavaScript with jupyter notebook
Data analysis starting with python (data preprocessing-machine learning)
[Jupyter Notebook memo] Display kanji with matplotlib
JupyterLab Basic Setting 2 (pip) for data analysis
Rich cell output with Jupyter Notebook (IPython)
JupyterLab Basic Setup for Data Analysis (pip)
Best practices for messing with data with pandas
Analysis for Data Scientists: Qiita Self-Article Summary 2020
Execute API of Cloud Pak for Data analysis project Job with environment variables
How to replace with Pandas DataFrame, which is useful for data analysis (easy)
Library for "I want to do that" of data science on Jupyter Notebook
How to debug with Jupyter or iPython Notebook
When Html cannot be output with Jupyter Notebook
Analytical environment construction with Docker (jupyter notebook + PostgreSQL)
Prepare a programming language environment for data analysis
Analysis for Data Scientists: Qiita Self-Article Summary 2020 (Practice)
[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading
Verify NLC accuracy with Watson Studio's Jupyter Notebook
Enable Jupyter Notebook with conda on remote server