I think there is a usage scene where you want to predict gender from your name. For example, if you ask for gender on the registration form with a membership service, the CVR will drop, so make up for it with a prediction! Is it a scene like that?
There are several ways to predict gender from a name, such as using machine learning to generate a classifier and making predictions, or using an external API to make predictions. This time, it will be an approach to predict gender from name using Gender API in Python.
Gender API is an American company that seems to have made gender predictions from a huge amount of name data. There are several similar services, but this time we will use this Gender API to predict gender.
Gender API First, let's create an account for Gender API. After creating, get API_KEY. If you want to use it for free, you can use it for free up to 500 names
Use Personal Generator to generate pseudo personal information. You can freely select the items to be displayed, but this time, we also want to judge the correct answer, so we will get the serial number, name, name (katakana), and gender. This time, I will try to predict the gender from the names of about 30 people.
Pykakasi The name to be predicted will be First_name, and whether the name is predicted in Kanji, Katakana, Hiragana, or Romaji will greatly affect the accuracy. In conclusion, probably because it is an overseas service, it was the most accurate to convert it to Romaji and make it predict. (The verification process is omitted.)
Therefore, it is necessary to perform romaji conversion from the name below. For how to use it, refer to the developer's documentation. How to use pykakasi Install the following two packages.
pip install six semidbm
pip install pykakasi
We will actually predict the gender of the 30 subjects. The general procedure is as follows.
gender_estimation.py
import sys
import json
from urllib import request, parse
from urllib.request import urlopen
import pandas as pd
import pykakasi
class GenderEstimation:
"""
Predict gender from romaji-converted name
"""
__GENDER_API_BASE_URL = 'https://gender-api.com/get?'
__API_KEY = "your api_key"
def create_estimated_genders_date_frame(self):
df = pd.DataFrame(self._estimate_gender())
print('\n{}Completed gender prediction for a person.'.format((len(df))))
df.columns = [
'estimated_gender', 'accuracy', 'samples', 'duration'
]
df1 = self._create_member_data_frame()
estimated_genders_df = pd.merge(df1, df, left_index=True, right_index=True)
return estimated_genders_df
def _estimate_gender(self):
unique_names = self._convert_first_name_to_romaji()
genders = []
print(u'{}Predict the gender of a person'.format(len(unique_names)))
for name in unique_names:
res = request.urlopen(self._gender_api_endpoint(params={
'name': name,
'country': 'JP',
'key': self.__API_KEY
}))
decoded = res.read().decode('utf-8')
data = json.loads(decoded)
genders.append(
[data['gender'], data['accuracy'], data['samples'], data['duration']])
return genders
def _gender_api_endpoint(self, params):
return '{base_url}{param_str}'.format(
base_url=self.__GENDER_API_BASE_URL, param_str=parse.urlencode(params))
def _convert_first_name_to_romaji(self):
df = self._create_member_data_frame()
df['first_name_roma'] = df['first_name'].apply(
lambda x: self._set_kakasi(x))
return df['first_name_roma']
def _set_kakasi(self, x):
kakasi = pykakasi.kakasi()
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
kakasi.setMode('J', 'a')
kakasi.setMode('r', 'Hepburn')
kakasi.setMode('s', False)
kakasi.setMode('C', False)
return kakasi.getConverter().do(x)
def _create_member_data_frame(self):
df = pd.read_csv('personal_infomation.csv').rename(columns={
'Serial number':'row_num',
'Full name':'name',
'Name (Katakana)':'name_katakana',
'sex':'gender'
})
df['first_name']=df.name_katakana.str.split().str[1]
print(u"{}Extract the person to be predicted.".format(len(df)))
return df
The data frame of the prediction result is as follows. The response regarding the prediction of the Gender API is defined as follows.
estimated_gender | accuracy | samples | duration |
---|---|---|---|
Gender prediction results | Prediction correctness | Sample size used for prediction | Elapsed time to 1 call |
Finally, let's examine the accuracy of the gender prediction results. Plot the correct and predicted results and their numbers for the table below to generate a matrix The correct answer rate was almost 100%. In this case, I predicted that only one case was actually a woman, but a man. After all, it seems difficult to predict names such as "Iori" that can be taken by both men and women.
Correct answer | Forecast | num |
---|---|---|
male | male | 11 |
male | female | 0 |
male | unknown | 0 |
female | male | 1 |
female | female | 18 |
female | unknown | 0 |
unknown | male | 0 |
unknown | female | 0 |
unknown | unknown | 0 |
Forecast/Correct answer | male | female | unknown | Correct answer rate |
---|---|---|---|---|
male | 11 | 0 | 0 | 100.00% |
female | 1 | 18 | 0 | 94.74% |
unknown | 0 | 0 | 0 | 0% |
Recommended Posts