I'm always looking at the same thing, so I took this opportunity to make a note.
Here, I actually read csv with python and Pre-processing such as dummy variable conversion was performed and prediction was performed with SVM.
When you plunge your data into machine learning, your data may contain categorical variables (eg gender, country of origin). At that time, if you change it to just a numerical value (example: 1 in Japan, 2 in the United States), the unintended meaning will be converted into data. Learning may not be successful because it will be given.
Here, we will deal with it by converting categorical variables into numerical values using a method called dummy variables.
For example, assume the following data.
{Japan, America, China}
countries in totalCountry
, in which one of{Japan, USA, China}
is
Suppose it is storedIn the dummy variable conversion, the column country
is changed to three columns country.Japan, country.US, country.China
.
Converts only the applicable values to 1
and the others to 0
.
Country |
---|
Japan |
Country.Japan | Country.America | Country.China |
---|---|---|
1 | 0 | 0 |
This time, we used the data used in experiments such as anonymization processing called "Adult Income Data Set". You can probably get it by google, but this time I got it with R (don't worry if you say python but use R right away).
This dataset also has an item called ʻincome (income)and has three values:
large, small, NaN. In this implementation, we want to predict
large or small for
NaN (missing value) `.
Therefore, the row without NaN
is used as training data, and the data with NaN
is used as evaluation data.
library('arules')
data("AdultUCI")
id <- 1:nrow(AdultUCI)
d <- data.frame(id, AdultUCI)
write.csv(d, "AdultDataSet.csv", quote = FALSE, fileEncoding = 'cp932', row.names = FALSE)
import numpy as np
import pandas as pd
from sklearn import svm
df = pd.read_csv("AdultDataSet.csv", encoding='cp932', low_memory=False)
#Training label
Y_train = df.copy()
Y_train['income'] = Y_train['income'].map({"large":1, "small":0})
Y_train = Y_train[Y_train['income'].notnull()]
Y_train = Y_train.iloc[:, 15].values #income only
#Creating dummy variables for categorical variables
X = df.iloc[:, 0:15] #Other than income
colnames_categorical = ['workclass', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
X_dummy = pd.get_dummies(X[colnames_categorical], drop_first=True)
#Joining dummy variables
X = pd.merge(X, X_dummy, left_index=True, right_index=True)
#Remove duplicate columns that you don't use
X = X.drop(colnames_categorical, axis=1)
X = X.drop(['id', 'education'], axis=1)
#Separate train and test depending on whether income is NaN or not
X_train = X[df['income'].notnull()].values
X_test = X[df['income'].isnull()].values
#Learning
clf = svm.LinearSVC() #Because learning is fast. Another svm.SVC(kernel='rbf')Etc.
print('start!')
clf.fit(X_train, Y_train)
print('end!')
#Forecast
Y_predict = clf.predict(X_test)
#Add the predicted value
df2 = df.copy()
df2.loc[df2['income'].isnull(), 'income'] = Y_predict
df2['income'] = df2['income'].map({1.:"large", 0.:"small", "small":"small", "large":"large"})
df2.head()
Originally, I think that the data with the correct label should be classified in advance to evaluate the performance. This time, what I predicted for the missing data and the purpose is to apply dummy variable conversion. For the time being, let's check that there are no missing values.
#Aggregated value of income before learning
count_before = df['income'].value_counts(dropna=False)
pd.DataFrame(count_before) # print(count_before)May be
#Aggregated value of income after learning
count_after = df2['income'].value_counts(dropna=False)
pd.DataFrame(count_after)
If NaN
disappears after learning, it's OK for the time being.
df2.to_csv('AfterAdultDataSet.csv', index=False)
It shouldn't be that difficult, but how to use pandas and scikit learn I had a hard time typing ... sad ...
Recommended Posts