This is the record of the 93rd "Calculation of the accuracy rate of analogy tasks" in Language processing 100 knocks 2015. It's easy to win just by calculating the correct answer rate for the previous knock result. The result of the self-made program is about 25%, and the result when using Gensim is 58%, which is a big difference (there is some doubt as to how to calculate the correct answer rate).

Reference link

Link	Remarks
093.Calculation of accuracy rate of analogy task.ipynb	Answer program GitHub link
100 amateur language processing knocks:93	I am always indebted to you by knocking 100 language processing

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pandas	0.25.3

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

93. Calculation of the accuracy rate of analogy tasks

Use the data created in> 92 to find the correct answer rate for the analogy task of each model.

Answer

Answer program [093. Calculation of accuracy rate of analogy task.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83 % 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /093.%E3%82%A2%E3%83%8A%E3%83 % AD% E3% 82% B8% E3% 83% BC% E3% 82% BF% E3% 82% B9% E3% 82% AF% E3% 81% AE% E6% AD% A3% E8% A7% A3 % E7% 8E% 87% E3% 81% AE% E8% A8% 88% E7% AE% 97.ipynb)

import pandas as pd

def calc_accuracy(file):
    df = pd.read_table(file, header=None, usecols=[3, 4, 5], names=['word4', 'result', 'similarity'])
    print(df.info())
    print('Total records:', len(df))
    print('Available records:', (df['similarity'] != -1).sum())
    print('Correct records:', (df['word4'] == df['result']).sum())
    print('Accuracy', (df['word4'] == df['result']).sum() / (df['similarity'] != -1).sum())

calc_accuracy('092.analogy_word2vec_1.txt')

calc_accuracy('092.analogy_word2vec_2.txt')

Answer commentary

I think there is a smarter way to write it, but I don't really focus on time. The file is read and the correct answer rate is calculated. I wasn't sure what to do with the denominator, but if I couldn't find a word in the corpus, I excluded it from the denominator.

df = pd.read_table(file, header=None, usecols=[3, 4, 5], names=['word4', 'result', 'similarity'])
print(df.info())
print('Total records:', len(df))
print('Available records:', (df['similarity'] != -1).sum())
print('Correct records:', (df['word4'] == df['result']).sum())
print('Accuracy', (df['word4'] == df['result']).sum() / (df['similarity'] != -1).sum())

This is the result of my own program.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 3 columns):
word4         506 non-null object
result        462 non-null object
similarity    504 non-null float64
dtypes: float64(1), object(2)
memory usage: 12.0+ KB
None
Total records: 506
Available records: 462
Correct records: 114
Accuracy 0.24675324675324675

This is the result when using Gensim. I'm worried that "Available records" are decreasing. Certainly Gensim seems to have logic that Word2Vec does not target when it is infrequent ...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 3 columns):
word4         506 non-null object
result        400 non-null object
similarity    506 non-null float64
dtypes: float64(1), object(2)
memory usage: 12.0+ KB
None
Total records: 506
Available records: 400
Correct records: 231
Accuracy 0.5775

100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks