Introduction

If you use scikit-learn's OneHotEncoder as it is, dummy variables will be created for the number of levels of categorical variables. In this case, since multicollinearity occurs in the linear regression method, we want to reduce the dummy variable to the number of levels -1. I found out how to do it, so make a note of it.

environment

scikit-learn 0.21.2

Method

If you set the drop option of OneHotEncoder to "first", it will remove the first dummy variable.

let's try it

Here, let's extract the column that stores the following categorical variables and make them categorical variables.

[['D']
 ['D']
 ['D']
 ['T']
 ['T']
 ['T']
 ['N']
 ['N']
 ['N']]

The source is as follows. If you set the drop option to "first", you will get angry if you do not set handle_unknown ='error'.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np

def main():

    X = np.array([
        [1, "D"],
        [3, "D"],
        [5, "D"],
        [2, "T"],
        [4, "T"],
        [6, "T"],
        [1, "N"],
        [8, "N"],
        [2, "N"],
        ])
    y = np.array([2, 6, 10, 6, 12, 18, 1, 8, 2])

    #Take out the second row
    category = X[:, [1]]
    print(category)
    encoder = OneHotEncoder(handle_unknown='error', drop='first')
    encoder.fit(category)
    result = encoder.transform(category)
    print(result.toarray())


if __name__ == "__main__":
    main()

result

When drop ='first' is not added.

[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]]

When drop ='fist' is added

[[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]]

Certainly the first column is gone. Now you can call the fit method without any hesitation.

Reduce redundant dummy variables with OneHotEndoder

Introduction

environment

Method

let's try it

result