Predict gender from name using Gender API and Pykakasi in Python

Introduction

I think there is a usage scene where you want to predict gender from your name. For example, if you ask for gender on the registration form with a membership service, the CVR will drop, so make up for it with a prediction! Is it a scene like that?

There are several ways to predict gender from a name, such as using machine learning to generate a classifier and making predictions, or using an external API to make predictions. This time, it will be an approach to predict gender from name using Gender API in Python.

Gender API is an American company that seems to have made gender predictions from a huge amount of name data. There are several similar services, but this time we will use this Gender API to predict gender.

Preparation

Gender API First, let's create an account for Gender API. After creating, get API_KEY. If you want to use it for free, you can use it for free up to 500 names

Pseudo personal information acquisition

Use Personal Generator to generate pseudo personal information. You can freely select the items to be displayed, but this time, we also want to judge the correct answer, so we will get the serial number, name, name (katakana), and gender. This time, I will try to predict the gender from the names of about 30 people. スクリーンショット 2020-08-10 15.27.17.png

Pykakasi The name to be predicted will be First_name, and whether the name is predicted in Kanji, Katakana, Hiragana, or Romaji will greatly affect the accuracy. In conclusion, probably because it is an overseas service, it was the most accurate to convert it to Romaji and make it predict. (The verification process is omitted.)

Therefore, it is necessary to perform romaji conversion from the name below. For how to use it, refer to the developer's documentation. How to use pykakasi Install the following two packages.

pip install six semidbm
pip install pykakasi

Gender prediction

Gender prediction with python

We will actually predict the gender of the 30 subjects. The general procedure is as follows.

Prepare the target person's dataframe, divide it by double-byte space and generate a name column
Create a Pykakasi instance, set it to convert to Romaji, convert the name and generate a Romaji string
Pass the romaji list to the Gender API and get the prediction result
Merge the prediction result with the original dataframe

`gender_estimation.py`


import sys
import json
from urllib import request, parse
from urllib.request import urlopen
import pandas as pd
import pykakasi


class GenderEstimation:
    """
Predict gender from romaji-converted name
    """
    __GENDER_API_BASE_URL = 'https://gender-api.com/get?'
    __API_KEY = "your api_key"
    def create_estimated_genders_date_frame(self):
        df = pd.DataFrame(self._estimate_gender())
        print('\n{}Completed gender prediction for a person.'.format((len(df))))
        df.columns = [
            'estimated_gender', 'accuracy', 'samples', 'duration'
        ]
        df1 = self._create_member_data_frame()
        estimated_genders_df = pd.merge(df1, df, left_index=True, right_index=True)
        
        return estimated_genders_df
    
    def _estimate_gender(self):
        unique_names = self._convert_first_name_to_romaji()
        genders = []
        print(u'{}Predict the gender of a person'.format(len(unique_names)))
        for name in unique_names:
            res = request.urlopen(self._gender_api_endpoint(params={
                'name': name,
                'country': 'JP',
                'key': self.__API_KEY
            }))
            decoded = res.read().decode('utf-8')
            data = json.loads(decoded)
            genders.append(
                [data['gender'], data['accuracy'], data['samples'], data['duration']])
            
        return genders
    
    def _gender_api_endpoint(self, params):
        return '{base_url}{param_str}'.format(
            base_url=self.__GENDER_API_BASE_URL, param_str=parse.urlencode(params))
    
    def _convert_first_name_to_romaji(self):
        df = self._create_member_data_frame()
        df['first_name_roma'] = df['first_name'].apply(
            lambda x: self._set_kakasi(x))
        
        return df['first_name_roma']
    
    def _set_kakasi(self, x):
        kakasi = pykakasi.kakasi()
        kakasi.setMode('H', 'a')
        kakasi.setMode('K', 'a')
        kakasi.setMode('J', 'a')
        kakasi.setMode('r', 'Hepburn')
        kakasi.setMode('s', False)
        kakasi.setMode('C', False)
        
        return kakasi.getConverter().do(x)

    def _create_member_data_frame(self):
        df = pd.read_csv('personal_infomation.csv').rename(columns={
            'Serial number':'row_num',
            'Full name':'name',
            'Name (Katakana)':'name_katakana',
            'sex':'gender'
        })
        df['first_name']=df.name_katakana.str.split().str[1]
        print(u"{}Extract the person to be predicted.".format(len(df)))
        return df

Gender prediction results

The data frame of the prediction result is as follows. The response regarding the prediction of the Gender API is defined as follows.

estimated_gender	accuracy	samples	duration
Gender prediction results	Prediction correctness	Sample size used for prediction	Elapsed time to 1 call

Gender prediction accuracy verification

Finally, let's examine the accuracy of the gender prediction results. Plot the correct and predicted results and their numbers for the table below to generate a matrix The correct answer rate was almost 100%. In this case, I predicted that only one case was actually a woman, but a man. After all, it seems difficult to predict names such as "Iori" that can be taken by both men and women.

Correct answer	Forecast	num
male	male	11
male	female	0
male	unknown	0
female	male	1
female	female	18
female	unknown	0
unknown	male	0
unknown	female	0
unknown	unknown	0

Forecast/Correct answer	male	female	Correct answer rate
male	11	0	100.00%
female	1	18	94.74%
unknown	0	0	0%