Need for OneHotEncoder

Why is OneHotEncoder important for using Scikit-learn?

When using classification and regression in machine learning, computers basically treat numbers as consecutive numbers. In other words, when there is a number from 1 to 10, 1 is always recognized as being larger than 10.

What are you talking about! You will think.

But think about it.

For example, if an animal is converted to a numerical value as shown below, does it really mean that Tiger is numerically larger than Human? For example, based on the table below, if you take the average of Tiger and Cat, you end up with a Panda, which is a strange situation.

Animal	Transform to Numbers
Tiger	0
Panda	1
Cat	2
Human	3
Python	4

In this way, OneHotEncoder is used when the numbers before and after the numbers do not make any sense when converted to numbers.

How does OneHotEncoder deal with the problems mentioned above?

To make it easier to understand, when the above figure is processed by OneHotEncoder, it becomes like this.

Tiger	Panda	Cat	Human	Python
1	0	0	0	0
0	1	0	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	0	1

In other words, it is a method of counting each discontinuous object by dividing it into columns. By doing this, each animal can be treated as an independent value, not as a continuous number.

code

`python`



>>>
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Retrieved from Official Documentation

Need for LabelEncoder

What is LabelEncoder?

Let's take the table used above again here. This table replaces the names of animals such as Tiger and Panda with numbers. LabelEncoder performs this replacement operation. So, after adapting LabelEncoder, apply OneHotEncoder.

Animal	Transform to Numbers
Tiger	0
Panda	1
Cat	2
Human	3
Python	4

code

`python`


from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])

le.transform(["tokyo", "tokyo", "paris"]) 
>>>array([2, 2, 1]...)

list(le.inverse_transform([2, 2, 1]))
>>>['tokyo', 'tokyo', 'paris']

#By the way, when using a column with df, it looks like this:
#LabelEncoder is applied to the column called City in df.

df.City = le.fit_transform(df.City)
#Or
df.City = le.fit_transform(df['City'].values)

#When you want to undo

df.City = le.inverse_trainsform(df.City)

This is the Official Documentation

Pandas pd.get_dummies function

get_dummies is like OneHotEncoder.

I was able to use LabelEncoder well, but I couldn't use OneHotEncoder well. So, as a result of my research, I got the information that Pandas get_dummies seems to do almost the same thing. By the way, if anyone who teaches OneHotEncoder or knows a site that is organized in a nice way, please let me know. So, get_dummies seems to play the same role as OneHotEncoder by creating a column for each element of categorical value. The atmosphere is as follows.

Tiger	Panda	Cat	Human	Python
1	0	0	0	0
0	1	0	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	0	1



df = pd.get_dummies(df, columns = ['animal'])

#Create a column for each element of animal, 0,Notated by 1.

Here Official Document

The guy who summarizes the differences in a nice way (English)

It nicely summarizes the differences between LabelEncoder and OneHotEncoder. This

Data preprocessing (2) Data is changed from Categorical to Numerical.

Need for OneHotEncoder

Why is OneHotEncoder important for using Scikit-learn?

How does OneHotEncoder deal with the problems mentioned above?

code

python

Need for LabelEncoder

What is LabelEncoder?

code

python

Pandas pd.get_dummies function

get_dummies is like OneHotEncoder.

The guy who summarizes the differences in a nice way (English)

`python`

`python`