Even if I study, I will forget it soon, so I will post an article on Qiita for memorandum and output practice. I would be grateful if you could comment on any mistakes or better ways.
I want to perform One-hot Encoding by machine learning, but I don't know what kind of data is in the test data.
Every site says that if you want to do One-hot Encoding, you should use get_dummies
, but for example
** train_df ['sex']
has Male
and Female
, but test_df ['sex']
has only Male
**
In such a case, if you normally use get_dummies
, the number of columns created will change. that's no good.
After a lot of research, I arrived at the following article.
[Python] Don't use pandas.get_dummies for machine learning
The article itself does not use get_dummies, but uses sklearn
's ʻOne Hot Encoder. However, I wanted to analyze the data in Pandas format and then finally convert it to
Numpy format, so I was particular about doing something with
Pandas`.
That is explained in the comment of the article ↑, and in this article I will drop it to the point where I can chew it in my own way.
The implementation ends up using get_dummies
.
# (i) df_A unique of train"hoge"When"fuga".. B unique"a"When"b"
df_train = pd.DataFrame({"A": ["hoge", "fuga"], "B": ["a", "b"]})
# (ii) df_A unique of train"hoge"When"piyo".. B unique"a"When"c"
df_test = pd.DataFrame({"A": ["hoge", "piyo"], "B": ["a", "c"]})
# (iii)In Categorical A is"hoge"When"fuga", B"a"When"b"だよWhen決め打ちしてしまう
df_train["A"] = pd.Categorical(df_train["A"], categories=["hoge", "fuga"])
df_train["B"] = pd.Categorical(df_train["B"], categories=["a", "b"])
df_test["A"] = pd.Categorical(df_test["A"], categories=["hoge", "fuga"])
df_test["B"] = pd.Categorical(df_test["B"], categories=["a", "b"])
# (iv) get_one with dummies-hot
df_train = pd.get_dummies(df_train)
df_test = pd.get_dummies(df_test)
The final one-hot data is as follows.
df_train
A_hoge A_fuga B_a B_b
0 1 0 1 0
1 0 1 0 1
df_test
A_hoge A_fuga B_a B_b
0 1 0 1 0
1 0 0 0 0
I was able to use only the unique train. This time it was hard-coded, but if you use ʻunique` separately, you can handle it more flexibly.
The reason why df_train also fixes the category is that if you do not do this, the order of hoge and fuga will be reversed.
Recommended Posts