"Kaggle Memorandum" Conversion to One-hot Vector

Purpose

Convert qualitative variables (categorical variables) to One-hot vectors

Usage data / environment

Data: Kaggle's Titanic data

Environment: kaggle notebook

Method

onehot_encoding.py


#Module import, os preparation
import numpy as np
import pandas as pd 
import matplotlib as plt 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Read data

onehot_encoding.py


train_data=pd.read_csv('../input/titanic/train.csv')
test_data=pd.read_csv('../input/titanic/test.csv')

Take a look at the data

onehot_encoding.py


train.data.head()
スクリーンショット 2020-02-17 23.13.00.png

You can see that there are some data frames of categorical variables. We aim to convert these into One-hot vectors.

For the time being, it is difficult to handle the character string as it is, so assign different numerical values to each category. Use Pandas's factorize ().

factorize () returns both numeric data (emb_cat_encoded) and a list of categories (emb_categories).

onehot_encoding.py


train_cat=train_data['Embarked']
train_cat_encoded,train_categories=train_cat.factorize()

#Take a look
print(train_cat.head())
print(train_cat_encoded[:10])
print(train_categories)
スクリーンショット 2020-02-17 23.17.10.png

Then convert to one-hot vector

Use OneHotEncoder provided by scikit-learn.

onehot_encoding.py


#scikit-Import OneHotEncoder from learn
from sklearn.preprocessing import OneHotEncoder

#one-Convert to hot vector
oe=OneHotEncoder(categories='auto')
train_cat_1hot=oe.fit_transform(train_cat_encoded.reshape(-1,1))

#Take a look inside
train_cat_1hot
スクリーンショット 2020-02-17 23.21.03.png

Conversion completed.

Recommended Posts

"Kaggle Memorandum" Conversion to One-hot Vector
Conveniently upload to kaggle Dataset
A memorandum regarding γ conversion
Introduction to vi command (memorandum)