This is the record of the 93rd "Calculation of the accuracy rate of analogy tasks" in Language processing 100 knocks 2015. It's easy to win just by calculating the correct answer rate for the previous knock result. The result of the self-made program is about 25%, and the result when using Gensim is 58%, which is a big difference (there is some doubt as to how to calculate the correct answer rate).
Link | Remarks |
---|---|
093.Calculation of accuracy rate of analogy task.ipynb | Answer program GitHub link |
100 amateur language processing knocks:93 | I am always indebted to you by knocking 100 language processing |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 0.25.3 |
In Chapter 10, we will continue to study word vectors from the previous chapter.
Use the data created in> 92 to find the correct answer rate for the analogy task of each model.
import pandas as pd
def calc_accuracy(file):
df = pd.read_table(file, header=None, usecols=[3, 4, 5], names=['word4', 'result', 'similarity'])
print(df.info())
print('Total records:', len(df))
print('Available records:', (df['similarity'] != -1).sum())
print('Correct records:', (df['word4'] == df['result']).sum())
print('Accuracy', (df['word4'] == df['result']).sum() / (df['similarity'] != -1).sum())
calc_accuracy('092.analogy_word2vec_1.txt')
calc_accuracy('092.analogy_word2vec_2.txt')
I think there is a smarter way to write it, but I don't really focus on time. The file is read and the correct answer rate is calculated. I wasn't sure what to do with the denominator, but if I couldn't find a word in the corpus, I excluded it from the denominator.
df = pd.read_table(file, header=None, usecols=[3, 4, 5], names=['word4', 'result', 'similarity'])
print(df.info())
print('Total records:', len(df))
print('Available records:', (df['similarity'] != -1).sum())
print('Correct records:', (df['word4'] == df['result']).sum())
print('Accuracy', (df['word4'] == df['result']).sum() / (df['similarity'] != -1).sum())
This is the result of my own program.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 3 columns):
word4 506 non-null object
result 462 non-null object
similarity 504 non-null float64
dtypes: float64(1), object(2)
memory usage: 12.0+ KB
None
Total records: 506
Available records: 462
Correct records: 114
Accuracy 0.24675324675324675
This is the result when using Gensim. I'm worried that "Available records" are decreasing. Certainly Gensim seems to have logic that Word2Vec does not target when it is infrequent ...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 3 columns):
word4 506 non-null object
result 400 non-null object
similarity 506 non-null float64
dtypes: float64(1), object(2)
memory usage: 12.0+ KB
None
Total records: 506
Available records: 400
Correct records: 231
Accuracy 0.5775
Recommended Posts