In this article pandas 0.18.I am using 1.
If someone using R tries to do the same thing with Python (scikit-learn), there may be cases where it is difficult to handle categorical variables. Categorical data cannot be handled as it is by sklearn (when numpy.ndarray is used as input), so convert it to a dummy variable.
The data is as follows. It is assumed that sex has 1 for men, 2 for women, and age has values 1 to 3 corresponding to each age group.
df1
id sex age
0 1001 1 3
1 1002 2 2
2 1003 1 3
3 1004 2 1
4 1005 2 1
df1 = df1.reset_index(drop=True) #It will be merged by index later, so initialize it just in case.
dummy_df = pd.get_dummies(df1[['sex', 'age']], drop_first = True)
print dummy_df
sex_2 age_2 age_3
0 0.0 0.0 1.0
1 1.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
4 1.0 0.0 0.0
It's nicely made into a dummy variable. After setting a dummy variable for each variable, drop_first removes the first variable. (If you leave it, the variables will become dependent and it is inconvenient, so we are taking measures to exclude it here) Please note that drop_first is compatible with pandas 0.18.0 or later.
df2 = pd.merge(df1, dummy_df, left_index=True, right_index=True)
print df2
id sex age sex_2 age_2 age_3
0 1001 1 3 0.0 0.0 1.0
1 1002 2 2 1.0 1.0 0.0
2 1003 1 3 0.0 0.0 1.0
3 1004 2 1 1.0 0.0 0.0
4 1005 2 1 1.0 0.0 0.0
After merging, you can see that it is properly created as a dummy variable.
Recommended Posts