In machine learning, character string data such as category data cannot be included in the machine learning model unless it is converted to numerical data. Also, numerical data that is not an ordinal scale should be treated as a categorical variable. In this article, I'll show you how to convert categorical variables into a machine-understandable form.
This time, as in the case of Features Engineering Traveling with Pokemon-Numerical Edition-, [Pokemon Dataset](https://www. use kaggle.com/abcsds/pokemon).
import pandas as pd
from sklearn.feature_extraction import FeatureHasher
df = pd.read_csv('./data/121_280_bundle_archive.zip')
df.head()
data
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
Dummy encoding is the most popular & frequently appearing technique when dealing with categorical variables in feature engineering. Each categorical variable is represented by bits 0 and 1. The bit of the part that corresponds to the category value is 1, and the bit of the part that does not correspond is 0.
pandas has a simple function for dummy encoding. Let's take a look at the code.
# One-hot Encoding
gdm = pd.get_dummies(df['Type 1'])
gdm = pd.concat([df['Name'], gdm], axis=1)
You can see that the bit corresponding to the Grass type of Bulbasaur is 1.
Name | Bug | Dark | Dragon | Electric | Fairy | Fighting | Fire | Flying | Ghost | Grass | Ground | Ice | Normal | Poison | Psychic | Rock | Steel | Water |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bulbasaur | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Ivysaur | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Venusaur | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
VenusaurMega Venusaur | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Charmander | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The dummy encoding is represented by 0, 1 bits, but the label encoding is represented by an integer.
The difference between feature hashing and conventional conversion is that feature hashing has a smaller number of categories after conversion. There is no problem if you can imagine reducing the number of input features by using a magic function called a hash function. ~~ It's easier to remember any study with an image. ~~
Let's take a look at the code. sklearn has a FeatureHasher module, so let's use it. We will narrow down the types of Pokemon to 5 by feature amount hashing.
You might think, "When do you use it?" Remember to use it when you usually have too many categorical variables.
fh = FeatureHasher(n_features=5, input_type='string')
hash_table = pd.DataFrame(fh.transform(df['Type 1']).todense())
Features after conversion
0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|
2 | 0 | 0 | 0 | -1 |
2 | 0 | 0 | 0 | -1 |
2 | 0 | 0 | 0 | -1 |
2 | 0 | 0 | 0 | -1 |
1 | -1 | 0 | -1 | 1 |
When should I choose one of these methods? One answer is to use different methods depending on the specifications of the computer for analysis. While dummy encoding and label encoding are simple, too many categorical variables can cause memory errors. At that time, you may consider feature amount hashing that compresses the feature amount.
However, recently it has become available in High-spec calculator is free, and it is often used in Kaggle. GBDT, the decision tree model that has been used, can handle label encoding. It is thought that the turn of feature hashing is not so big.
Recommended Posts