Convert qualitative variables (categorical variables) to One-hot vectors
Data: Kaggle's Titanic data
Environment: kaggle notebook
onehot_encoding.py
#Module import, os preparation
import numpy as np
import pandas as pd
import matplotlib as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
Read data
onehot_encoding.py
train_data=pd.read_csv('../input/titanic/train.csv')
test_data=pd.read_csv('../input/titanic/test.csv')
Take a look at the data
onehot_encoding.py
train.data.head()
You can see that there are some data frames of categorical variables. We aim to convert these into One-hot vectors.
For the time being, it is difficult to handle the character string as it is, so assign different numerical values to each category.
Use Pandas's factorize ()
.
factorize ()
returns both numeric data (emb_cat_encoded) and a list of categories (emb_categories).
onehot_encoding.py
train_cat=train_data['Embarked']
train_cat_encoded,train_categories=train_cat.factorize()
#Take a look
print(train_cat.head())
print(train_cat_encoded[:10])
print(train_categories)
Then convert to one-hot vector
Use OneHotEncoder provided by scikit-learn.
onehot_encoding.py
#scikit-Import OneHotEncoder from learn
from sklearn.preprocessing import OneHotEncoder
#one-Convert to hot vector
oe=OneHotEncoder(categories='auto')
train_cat_1hot=oe.fit_transform(train_cat_encoded.reshape(-1,1))
#Take a look inside
train_cat_1hot
Conversion completed.