I'm always looking at the same thing, so I took this opportunity to make a note.

Here, I actually read csv with python and Pre-processing such as dummy variable conversion was performed and prediction was performed with SVM.

Motivated

When you plunge your data into machine learning, your data may contain categorical variables (eg gender, country of origin). At that time, if you change it to just a numerical value (example: 1 in Japan, 2 in the United States), the unintended meaning will be converted into data. Learning may not be successful because it will be given.

Here, we will deal with it by converting categorical variables into numerical values using a method called dummy variables.

What is a dummy variable?

For example, assume the following data.

Here I have only one line called country
There are only {Japan, America, China} countries in total
In the original data, there is a column called Country, in which one of{Japan, USA, China}is Suppose it is stored

In the dummy variable conversion, the column country is changed to three columns country.Japan, country.US, country.China. Converts only the applicable values to 1 and the others to 0.

Data example before dummy variable conversion

Country
Japan

Data example after conversion to dummy variable

Country.Japan	Country.America	Country.China
1	0	0

Implementation

Data acquisition

This time, we used the data used in experiments such as anonymization processing called "Adult Income Data Set". You can probably get it by google, but this time I got it with R (don't worry if you say python but use R right away).

This dataset also has an item called ʻincome (income)and has three values:large, small, NaN. In this implementation, we want to predict large or small for NaN (missing value) `.

Therefore, the row without NaN is used as training data, and the data with NaN is used as evaluation data.

library('arules')
data("AdultUCI")
id <- 1:nrow(AdultUCI)
d <- data.frame(id, AdultUCI)
write.csv(d, "AdultDataSet.csv", quote = FALSE, fileEncoding = 'cp932', row.names = FALSE)

Loading libraries and csv files

import numpy as np
import pandas as pd
from sklearn import svm

df = pd.read_csv("AdultDataSet.csv", encoding='cp932', low_memory=False)

Pre-processing (also making dummy variables)

#Training label
Y_train = df.copy()
Y_train['income'] = Y_train['income'].map({"large":1, "small":0})
Y_train = Y_train[Y_train['income'].notnull()]
Y_train = Y_train.iloc[:, 15].values #income only

#Creating dummy variables for categorical variables
X = df.iloc[:, 0:15] #Other than income
colnames_categorical = ['workclass', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
X_dummy = pd.get_dummies(X[colnames_categorical], drop_first=True)

#Joining dummy variables
X = pd.merge(X, X_dummy, left_index=True, right_index=True)

#Remove duplicate columns that you don't use
X = X.drop(colnames_categorical, axis=1)
X = X.drop(['id', 'education'], axis=1)

#Separate train and test depending on whether income is NaN or not
X_train = X[df['income'].notnull()].values
X_test  = X[df['income'].isnull()].values

Learning and prediction

#Learning
clf = svm.LinearSVC() #Because learning is fast. Another svm.SVC(kernel='rbf')Etc.
print('start!')
clf.fit(X_train, Y_train)
print('end!')

#Forecast
Y_predict = clf.predict(X_test)

Combining predicted results

#Add the predicted value
df2 = df.copy()
df2.loc[df2['income'].isnull(), 'income'] = Y_predict
df2['income'] = df2['income'].map({1.:"large", 0.:"small", "small":"small", "large":"large"})
df2.head()

Check the result

Originally, I think that the data with the correct label should be classified in advance to evaluate the performance. This time, what I predicted for the missing data and the purpose is to apply dummy variable conversion. For the time being, let's check that there are no missing values.

#Aggregated value of income before learning
count_before = df['income'].value_counts(dropna=False)
pd.DataFrame(count_before) #  print(count_before)May be

#Aggregated value of income after learning
count_after = df2['income'].value_counts(dropna=False)
pd.DataFrame(count_after)

If NaN disappears after learning, it's OK for the time being.

Result output

df2.to_csv('AfterAdultDataSet.csv', index=False)

at the end

It shouldn't be that difficult, but how to use pandas and scikit learn I had a hard time typing ... sad ...

Machine learning with python without losing to categorical variables (dummy variable)