Feature Engineering Traveling with Pokemon-Category Variables-

In machine learning, character string data such as category data cannot be included in the machine learning model unless it is converted to numerical data. Also, numerical data that is not an ordinal scale should be treated as a categorical variable. In this article, I'll show you how to convert categorical variables into a machine-understandable form.

This time, as in the case of Features Engineering Traveling with Pokemon-Numerical Edition-, [Pokemon Dataset](https://www. use kaggle.com/abcsds/pokemon).

Loading the library

import pandas as pd
from sklearn.feature_extraction import FeatureHasher

Data read

df = pd.read_csv('./data/121_280_bundle_archive.zip')
df.head()

data

#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

Dummy encoding

Dummy encoding is the most popular & frequently appearing technique when dealing with categorical variables in feature engineering. Each categorical variable is represented by bits 0 and 1. The bit of the part that corresponds to the category value is 1, and the bit of the part that does not correspond is 0.

pandas has a simple function for dummy encoding. Let's take a look at the code.

# One-hot Encoding
gdm = pd.get_dummies(df['Type 1'])
gdm = pd.concat([df['Name'], gdm], axis=1)

You can see that the bit corresponding to the Grass type of Bulbasaur is 1.

Name	Fire	Grass
Bulbasaur	0	1
Ivysaur	0	1
Venusaur	0	1
VenusaurMega Venusaur	0	1
Charmander	1	0

Label encoding

The dummy encoding is represented by 0, 1 bits, but the label encoding is represented by an integer.

Feature hashing

The difference between feature hashing and conventional conversion is that feature hashing has a smaller number of categories after conversion. There is no problem if you can imagine reducing the number of input features by using a magic function called a hash function. ~~ It's easier to remember any study with an image. ~~

Let's take a look at the code. sklearn has a FeatureHasher module, so let's use it. We will narrow down the types of Pokemon to 5 by feature amount hashing.

You might think, "When do you use it?" Remember to use it when you usually have too many categorical variables.

fh = FeatureHasher(n_features=5, input_type='string')
hash_table = pd.DataFrame(fh.transform(df['Type 1']).todense())

Features after conversion

0	1	3	4
2	0	0	-1
2	0	0	-1
2	0	0	-1
2	0	0	-1
1	-1	-1	1

at the end

When should I choose one of these methods? One answer is to use different methods depending on the specifications of the computer for analysis. While dummy encoding and label encoding are simple, too many categorical variables can cause memory errors. At that time, you may consider feature amount hashing that compresses the feature amount.

However, recently it has become available in High-spec calculator is free, and it is often used in Kaggle. GBDT, the decision tree model that has been used, can handle label encoding. It is thought that the turn of feature hashing is not so big.