When using classification and regression in machine learning, computers basically treat numbers as consecutive numbers. In other words, when there is a number from 1 to 10, 1 is always recognized as being larger than 10.
What are you talking about! You will think.
But think about it.
For example, if an animal is converted to a numerical value as shown below, does it really mean that Tiger is numerically larger than Human? For example, based on the table below, if you take the average of Tiger and Cat, you end up with a Panda, which is a strange situation.
Animal | Transform to Numbers |
---|---|
Tiger | 0 |
Panda | 1 |
Cat | 2 |
Human | 3 |
Python | 4 |
In this way, OneHotEncoder is used when the numbers before and after the numbers do not make any sense when converted to numbers.
To make it easier to understand, when the above figure is processed by OneHotEncoder, it becomes like this.
Tiger | Panda | Cat | Human | Python |
---|---|---|---|---|
1 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 1 |
In other words, it is a method of counting each discontinuous object by dividing it into columns. By doing this, each animal can be treated as an independent value, not as a continuous number.
python
>>>
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
Retrieved from Official Documentation
Let's take the table used above again here. This table replaces the names of animals such as Tiger and Panda with numbers. LabelEncoder performs this replacement operation. So, after adapting LabelEncoder, apply OneHotEncoder.
Animal | Transform to Numbers |
---|---|
Tiger | 0 |
Panda | 1 |
Cat | 2 |
Human | 3 |
Python | 4 |
python
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])
>>>array([2, 2, 1]...)
list(le.inverse_transform([2, 2, 1]))
>>>['tokyo', 'tokyo', 'paris']
#By the way, when using a column with df, it looks like this:
#LabelEncoder is applied to the column called City in df.
df.City = le.fit_transform(df.City)
#Or
df.City = le.fit_transform(df['City'].values)
#When you want to undo
df.City = le.inverse_trainsform(df.City)
This is the Official Documentation
I was able to use LabelEncoder well, but I couldn't use OneHotEncoder well. So, as a result of my research, I got the information that Pandas get_dummies seems to do almost the same thing. By the way, if anyone who teaches OneHotEncoder or knows a site that is organized in a nice way, please let me know. So, get_dummies seems to play the same role as OneHotEncoder by creating a column for each element of categorical value. The atmosphere is as follows.
Tiger | Panda | Cat | Human | Python |
---|---|---|---|---|
1 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 1 |
df = pd.get_dummies(df, columns = ['animal'])
#Create a column for each element of animal, 0,Notated by 1.
Here Official Document
It nicely summarizes the differences between LabelEncoder and OneHotEncoder. This
Recommended Posts