Create dummy variables in pandas (get

Create dummy variables in pandas

In this article pandas 0.18.I am using 1.

If someone using R tries to do the same thing with Python (scikit-learn), there may be cases where it is difficult to handle categorical variables. Categorical data cannot be handled as it is by sklearn (when numpy.ndarray is used as input), so convert it to a dummy variable.

The data is as follows. It is assumed that sex has 1 for men, 2 for women, and age has values 1 to 3 corresponding to each age group.

df1

df1 = df1.reset_index(drop=True)    #It will be merged by index later, so initialize it just in case.

dummy_df = pd.get_dummies(df1[['sex', 'age']], drop_first = True)   
print dummy_df

	sex_2	age_2	age_3
0	0.0	0.0	1.0
1	1.0	1.0	0.0
2	0.0	0.0	1.0
3	1.0	0.0	0.0
4	1.0	0.0	0.0

It's nicely made into a dummy variable. After setting a dummy variable for each variable, drop_first removes the first variable. (If you leave it, the variables will become dependent and it is inconvenient, so we are taking measures to exclude it here) Please note that drop_first is compatible with pandas 0.18.0 or later.


df2 = pd.merge(df1, dummy_df, left_index=True, right_index=True)
print df2

    id sex age  sex_2  age_2  age_3
0  1001   1   3    0.0    0.0    1.0
1  1002   2   2    1.0    1.0    0.0
2  1003   1   3    0.0    0.0    1.0
3  1004   2   1    1.0    0.0    0.0
4  1005   2   1    1.0    0.0    0.0

After merging, you can see that it is properly created as a dummy variable.

Create dummy variables in pandas (get_dummies)

Create dummy variables in pandas