The movieLens dataset was awkward to process like this, so as a memo
The dataset passed in the one-hot encoded state as shown below
movie_id action horror romance sf
0 1 1 0 0 0
1 2 0 0 1 0
2 2 1 0 0 0
3 3 0 0 0 1
4 3 1 0 0 0
5 4 0 1 0 0
6 5 0 0 0 1
7 5 0 1 0 0
8 5 1 0 0 0
I want to return to the categorical state before one-hot encoding as shown below
movie_id genre
0 1 action
1 2 romance
2 2 action
3 3 sf
4 3 action
5 4 horror
6 5 sf
7 5 horror
8 5 action
Prepare the following function
def convert_onehot_to_category(df, id_col, one_hot_columns, category_col='category'):
df_concat = pd.DataFrame(columns=[id_col, category_col])
for col in one_hot_columns:
#Leave only those with a value of 1 or more
df_each = df[df[col] >= 1][[id_col, col]]
#Replace value with categorical value
df_each[col] = col
df_each.columns = [id_col, category_col]
df_concat = pd.concat([df_concat, df_each], axis=0)
#Duplicate deletion
df_concat = df_concat.drop_duplicates().reset_index(drop=True).sort_values(by=id_col)
return df_concat
As below,
--Column name after one-hot encoding --Column containing id --Column name after conversion to category value
If you pass
genres = ['action', 'romance', 'sf', 'horror']
id_col = 'movie_id'
category_col = 'genre'
df_category = convert_onehot_to_category(df_onehot, id_col=id_col, one_hot_columns=genres, category_col=category_col)
print(df_category)
Converts to the original category value
movie_id genre
0 1 action
1 2 action
4 2 romance
2 3 action
5 3 sf
7 4 horror
3 5 action
6 5 sf
8 5 horror
Recommended Posts