This is a continuation of Last time. [THE IDOLM @ STER CINDERELLA GIRLS](https://ja.wikipedia.org/wiki/%E3%82%A2%E3%82%A4%E3%83%89%E3%83%AB%E3%83%9E% E3% 82% B9% E3% 82% BF% E3% 83% BC_% E3% 82% B7% E3% 83% B3% E3% 83% 87% E3% 83% AC% E3% 83% A9% E3% Prediction of 3 types (Cu, Co, Pa) from profile data of 183 people (as of April 2017) of 82% AC% E3% 83% BC% E3% 83% AB% E3% 82% BA) I will.
The following 16 items were acquired. It is a 183 x 16 matrix. [Type, Name, Age, Birth, Constellation, Blood type, Height, Weight, B, W, H, Handedness, Hometown, Hobbies, CV, Implementation date]
Of these, this time we will use the following 6 data to predict the type. [Age, height, weight, B, W, H]
Since all the types of the acquired data are objects, convert them to numeric types.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
import pandas as pd
from pandas import DataFrame
import matplotlib
import matplotlib.pyplot as plt
def translate(df):
#Convert data type to float
df['age']=df['age'].str.extract('([0-9]+)').astype(float)
df['height']=df['height'].astype(float)
df['body weight']=df['body weight'].str.extract('([0-9]+)').astype(float)
df['B']=df['B'].str.extract('([0-9]+)').astype(float)
df['W']=df['W'].str.extract('([0-9]+)').astype(float)
df['H']=df['H'].str.extract('([0-9]+)').astype(float)
#Numerical conversion of attribute values
df.loc[df['attribute'] == "Cu", 'attribute'] = 0
df.loc[df['attribute'] == "Co", 'attribute'] = 1
df.loc[df['attribute'] == "Pa", 'attribute'] = 2
df['attribute']=df['attribute'].astype(int)
return df
if __name__ == '__main__':
#Data read
df = pd.read_csv('aimasudata.csv')
df=translate(df)
――Since Japanese is sometimes mixed in data such as age, str.extract ('([0-9] +)') is used to extract only numbers. \ ([Eternal ○ years old] → [○]. You did it!) --Attribute values are numerically converted for use in SVM determination.
Let's graph the data to see if it is really discernible by machine learning.
def checkdata(df,index):
#Get data for each type
x1 = [df[index][df['attribute']==0]]
x2 = [df[index][df['attribute']==1]]
x3 = [df[index][df['attribute']==2]]
#Histogram generation
plt.hist([x1,x2,x3], bins=16)
#Save image
plt.savefig("%s_graph.png " %index)
#Image display
plt.show()
if __name__ == '__main__':
#Data read
df = pd.read_csv('row_data.csv')
df=translate(df)
checkdata(df,"age")
Blue: Cu, Orange: Co, Green: Pa. Co has a high proportion of older people.
Is Cu low and Co high? This data makes the most difference.
The difference is not large, but Co is slightly higher. Overall too light.
B
W
H In the body data system, the value of Co is high as a whole. Is the separation of Cu and Pa subtle?
This time, we will use SVM to determine three classes (Co, Cu, Pa). Since it is necessary to set parameters when implementing SVM, First, use grid search to determine the parameters to be applied to SVM.
[Parameter optimization by grid search from Scikit learn] (http://qiita.com/SE96UoC5AfUt7uY/items/c81f7cea72a44a7bfd3a)
def gridsearch(df):
tuned_parameters = [{'C': [1, 10, 100, 1000, 10000], 'kernel': ['rbf'], 'gamma': [0.01,0.001, 0.0001]}]
score = 'f1'
clf = GridSearchCV(
SVC(), #Identifyer
tuned_parameters, #Parameter set you want to optimize
cv=5, #Number of cross-validations
scoring='%s_weighted' % score ) #Specifying the evaluation function of the model
df = df.dropna(subset=['age','height','body weight','B','W','H'])
X = df[['age','height','body weight','B','W','H']]
y = df["attribute"]
clf.fit(X, y)
print"mean score for cross-validation:\n"
for params, mean_score, all_scores in clf.grid_scores_:
print "{:.3f} (+/- {:.3f}) for {}".format(mean_score, all_scores.std() / 2, params)
print clf.best_params_
The result seems to be best when C = 100 and gamma = 0.0001.
Implement SVM using the parameters obtained by grid search.
def dosvm(df):
#Delete rows with missing values
df=df.dropna(subset=['age','height','body weight','B','W','H'])
X = df[['age','height','body weight','B','W','H']]
y = df["attribute"]
data_train,data_test,label_train,label_test=train_test_split(X,y,test_size=0.2)
clf = svm.SVC(kernel="rbf",C=100,gamma=0.0001)
clf.fit(data_train, label_train)
result=clf.predict(data_test)
cmat=confusion_matrix(label_test,result)
acc=accuracy_score(label_test,result)
print cmat
print acc
After about 100 trials, I was able to determine with an accuracy of about 0.45. Looking at the confusion matrix, it seems that Pa is not predicted well.
――When I started, I was wondering if I could identify it at all, but I was able to identify it unexpectedly. --Parameter setting is required when using SVM. (If I did it without setting, the result was about 0.3) ――This time, I made a type prediction using 6 parameters, but even if I use only the height parameter, the accuracy is about 0.42. On the contrary, using 5 parameters excluding height, the accuracy is about 0.36. I want to learn how to analyze the cause of the results around here ――I was Co (as expected)
Recommended Posts