When dealing with data frames in pandas, it may be inconvenient if the column name is Japanese. In such a case, it is troublesome to convert the Japanese column name manually, so I am eager to make it easier by using googletrans.
Googletrans is a python library for using the google translate API. For Google colaboratory, you can install it with the following code.
!pip install googletrans==4.0.0-rc1
** * As of January 12, googletrans 3.0.0 will be installed if you install without specifying the version. It doesn't work well with this version. ** ** Reference: https://qiita.com/_yushuu/items/83c51e29771530646659
from googletrans import Translator
columns = df.columns
translator = Translator()
str = 'Hello'
print(translator.translate(str, dest='en').text)
Output result
Hello
The default for dest is English, so you can omit dest ='en'. Although it deviates from the purpose, it is possible to translate into other languages by changing the dest.
print(translator.translate(str, dest='fr').text)
Output result
Bonjour
It's finally the main subject. We will convert from Japanese to English columns. First, prepare the data frame.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(25).reshape(5, 5),
columns=['Customer ID', 'Store ID', 'Quantity', 'price', 'Store area'])
df.head()
The data frame is ready. If it is a Japanese column, it is troublesome such as an error occurs when training with lightGBM. Let's convert it to English.
eng_columns = {}
columns = df.columns
translator = Translator()
for column in columns:
eng_columns[column] = translator.translate(column).text
print(eng_columns)
Output result
{'Customer ID': 'Customer ID', 'Store ID': 'Store ID', 'Quantity': 'Quantity', 'price': 'price', 'Store area': 'Store area'}
I was able to convert it to English safely. However, if it is left as it is, there will be spaces and it will be annoying. Implement the code to convert whitespace to underscores.
eng_columns = {}
columns = df.columns
translator = Translator()
for column in columns:
eng_column = translator.translate(column).text
eng_column = eng_column.replace(' ', '_')
eng_columns[column] = eng_column
df.rename(columns=eng_columns, inplace=True)
I was able to make it into an English column safely.
Recommended Posts