Introduction

Even if I study, I will forget it soon, so I will post an article on Qiita for memorandum and output practice. I would be grateful if you could comment on any mistakes or better ways.

Idea

I want to perform One-hot Encoding by machine learning, but I don't know what kind of data is in the test data. Every site says that if you want to do One-hot Encoding, you should use get_dummies, but for example ** train_df ['sex'] has Male and Female, but test_df ['sex'] has only Male ** In such a case, if you normally use get_dummies, the number of columns created will change. that's no good.

After a lot of research, I arrived at the following article.

[Python] Don't use pandas.get_dummies for machine learning

The article itself does not use get_dummies, but uses sklearn's ʻOne Hot Encoder. However, I wanted to analyze the data in Pandas format and then finally convert it to Numpy format, so I was particular about doing something with Pandas`.

That is explained in the comment of the article ↑, and in this article I will drop it to the point where I can chew it in my own way.

Implementation

The implementation ends up using get_dummies.

# (i) df_A unique of train"hoge"When"fuga".. B unique"a"When"b"
df_train = pd.DataFrame({"A": ["hoge", "fuga"], "B": ["a", "b"]})

# (ii) df_A unique of train"hoge"When"piyo".. B unique"a"When"c"
df_test = pd.DataFrame({"A": ["hoge", "piyo"], "B": ["a", "c"]})

# (iii)In Categorical A is"hoge"When"fuga", B"a"When"b"だよWhen決め打ちしてしまう
df_train["A"] = pd.Categorical(df_train["A"], categories=["hoge", "fuga"])
df_train["B"] = pd.Categorical(df_train["B"], categories=["a", "b"])
df_test["A"] = pd.Categorical(df_test["A"], categories=["hoge", "fuga"])
df_test["B"] = pd.Categorical(df_test["B"], categories=["a", "b"])

# (iv) get_one with dummies-hot
df_train = pd.get_dummies(df_train)
df_test = pd.get_dummies(df_test)

The final one-hot data is as follows.

df_train
   A_hoge  A_fuga  B_a  B_b
0       1       0    1    0
1       0       1    0    1
df_test
   A_hoge  A_fuga  B_a  B_b
0       1       0    1    0
1       0       0    0    0

I was able to use only the unique train. This time it was hard-coded, but if you use ʻunique` separately, you can handle it more flexibly.

Supplement

The reason why df_train also fixes the category is that if you do not do this, the order of hoge and fuga will be reversed.

"Usable" one-hot Encoding method for machine learning

Introduction

Idea

Implementation

Supplement