This is a continuation of Last time. [THE IDOLM @ STER CINDERELLA GIRLS](https://ja.wikipedia.org/wiki/%E3%82%A2%E3%82%A4%E3%83%89%E3%83%AB%E3%83%9E% E3% 82% B9% E3% 82% BF% E3% 83% BC_% E3% 82% B7% E3% 83% B3% E3% 83% 87% E3% 83% AC% E3% 83% A9% E3% Prediction of 3 types (Cu, Co, Pa) from profile data of 183 people (as of April 2017) of 82% AC% E3% 83% BC% E3% 83% AB% E3% 82% BA) I will.

The following 16 items were acquired. It is a 183 x 16 matrix. [Type, Name, Age, Birth, Constellation, Blood type, Height, Weight, B, W, H, Handedness, Hometown, Hobbies, CV, Implementation date]

Since the item name was not included only for the type, it was added manually.

Of these, this time we will use the following 6 data to predict the type. [Age, height, weight, B, W, H]

Data shaping

Since all the types of the acquired data are objects, convert them to numeric types.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
import pandas as pd
from pandas import DataFrame
import matplotlib
import matplotlib.pyplot as plt

def translate(df):

    #Convert data type to float
    df['age']=df['age'].str.extract('([0-9]+)').astype(float)
    df['height']=df['height'].astype(float)
    df['body weight']=df['body weight'].str.extract('([0-9]+)').astype(float)
    df['B']=df['B'].str.extract('([0-9]+)').astype(float)
    df['W']=df['W'].str.extract('([0-9]+)').astype(float)
    df['H']=df['H'].str.extract('([0-9]+)').astype(float)

    #Numerical conversion of attribute values
    df.loc[df['attribute'] == "Cu", 'attribute'] = 0
    df.loc[df['attribute'] == "Co", 'attribute'] = 1
    df.loc[df['attribute'] == "Pa", 'attribute'] = 2
    df['attribute']=df['attribute'].astype(int)

    return df

if __name__ == '__main__':
    #Data read
    df = pd.read_csv('aimasudata.csv')
    df=translate(df)

――Since Japanese is sometimes mixed in data such as age, str.extract ('([0-9] +)') is used to extract only numbers. \ ([Eternal ○ years old] → [○]. You did it!) --Attribute values are numerically converted for use in SVM determination.

Data confirmation

Let's graph the data to see if it is really discernible by machine learning.

def checkdata(df,index):

    #Get data for each type
    x1 = [df[index][df['attribute']==0]]
    x2 = [df[index][df['attribute']==1]]
    x3 = [df[index][df['attribute']==2]]
    
    #Histogram generation
    plt.hist([x1,x2,x3], bins=16)

    #Save image
    plt.savefig("%s_graph.png " %index)

    #Image display
    plt.show()

if __name__ == '__main__':
    #Data read
    df = pd.read_csv('row_data.csv')
    df=translate(df)
    checkdata(df,"age")

result

age

年齢_graph.png Blue: Cu, Orange: Co, Green: Pa. Co has a high proportion of older people.

height

身長_graph.png Is Cu low and Co high? This data makes the most difference.

body weight

体重_graph.png The difference is not large, but Co is slightly higher. Overall too light.

H In the body data system, the value of Co is high as a whole. Is the separation of Cu and Pa subtle?

Grid search

This time, we will use SVM to determine three classes (Co, Cu, Pa). Since it is necessary to set parameters when implementing SVM, First, use grid search to determine the parameters to be applied to SVM.

[Parameter optimization by grid search from Scikit learn] (http://qiita.com/SE96UoC5AfUt7uY/items/c81f7cea72a44a7bfd3a)

def gridsearch(df):
    tuned_parameters = [{'C': [1, 10, 100, 1000, 10000], 'kernel': ['rbf'], 'gamma': [0.01,0.001, 0.0001]}]
    score = 'f1'
    clf = GridSearchCV(
        SVC(), #Identifyer
        tuned_parameters, #Parameter set you want to optimize
        cv=5, #Number of cross-validations
        scoring='%s_weighted' % score ) #Specifying the evaluation function of the model

    df = df.dropna(subset=['age','height','body weight','B','W','H'])
    X = df[['age','height','body weight','B','W','H']]
    y = df["attribute"]

    clf.fit(X, y)

    print"mean score for cross-validation:\n"
    for params, mean_score, all_scores in clf.grid_scores_:
        print "{:.3f} (+/- {:.3f}) for {}".format(mean_score, all_scores.std() / 2, params)
    print clf.best_params_

result

The result seems to be best when C = 100 and gamma = 0.0001.

SVM implementation

Implement SVM using the parameters obtained by grid search.

def dosvm(df):
    #Delete rows with missing values
    df=df.dropna(subset=['age','height','body weight','B','W','H'])

    X = df[['age','height','body weight','B','W','H']]
    y = df["attribute"]
   
    data_train,data_test,label_train,label_test=train_test_split(X,y,test_size=0.2)

    clf = svm.SVC(kernel="rbf",C=100,gamma=0.0001)
    clf.fit(data_train, label_train)
    result=clf.predict(data_test)
    cmat=confusion_matrix(label_test,result)
    acc=accuracy_score(label_test,result)
    
    print cmat
    print acc

result

After about 100 trials, I was able to determine with an accuracy of about 0.45. Looking at the confusion matrix, it seems that Pa is not predicted well.

Note

――When I started, I was wondering if I could identify it at all, but I was able to identify it unexpectedly. --Parameter setting is required when using SVM. (If I did it without setting, the result was about 0.3) ――This time, I made a type prediction using 6 parameters, but even if I use only the height parameter, the accuracy is about 0.42. On the contrary, using 5 parameters excluding height, the accuracy is about 0.36. I want to learn how to analyze the cause of the results around here ――I was Co (as expected)

Machine learning in Delemas (practice)

Data shaping

Data confirmation

result

age

height

body weight

Grid search

result

SVM implementation

result

Note