100 language processing knock-92 (using Gensim): application to analogy data

This is the record of the 92nd "Application to analogy data" of Language processing 100 knock 2015. Word vector calculation and extraction of similar words are performed in two ways: when using the Numpy format word vector data handmade in Chapter 9 and when using Gensim. You can experience the greatness of Gensim, such as the speed of calculation.

Reference link

Link Remarks
092.Application to analogy data_1.ipynb AnswerprogramGitHublink(selfmade)
092.Application to analogy data_2.ipynb AnswerprogramGitHublink(Gensimversion)
100 amateur language processing knocks:92 I am always indebted to you by knocking 100 language processing


type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
gensim 3.8.1
numpy 1.17.4
pandas 0.25.3


Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

92. Application to analogy data

For each case of the evaluation data created in> 91, vec (word in the second column) --vec (word in the first column) + vec (word in the third column) is calculated, and the vector and similarity are Find the highest word and its similarity. Add the obtained word and similarity to the end of each case. Apply this program to the word vector created in 85 and the word vector created in 90.


Self-made answer program [092. Application to analogy data_1.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3% 83% 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /092.%E3%82%A2%E3%83%8A%E3% 83% AD% E3% 82% B8% E3% 83% BC% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% B8% E3% 81% AE% E9% 81% A9% E7% 94% A8_1.ipynb)

import csv

import numpy as np
import pandas as pd

#I didn't specify any arguments when saving'arr_0'Stored in
matrix_x300 = np.load('./../09.Vector space method(I)/085.matrix_x300.npz')['arr_0']

print('matrix_x300 Shape:', matrix_x300.shape)

group_t = pd.read_pickle('./../09.Vector space method(I)/083_group_t.zip')

#Cosine similarity calculation
def get_cos_similarity(v1, v1_norm, v2):
    #If the vectors are all zero-Returns 1
    if np.count_nonzero(v2) == 0:
        return -1
        return np.dot(v1, v2) / (v1_norm * np.linalg.norm(v2))

#Get words with high similarity
def get_similar_word(cols):
        vec = matrix_x300[group_t.index.get_loc(cols[1])] \
              - matrix_x300[group_t.index.get_loc(cols[0])] \
              + matrix_x300[group_t.index.get_loc(cols[2])]
        vec_norm = np.linalg.norm(vec)
        #Exclude your own 3 words used in the calculation
        cos_sim = [-1 if group_t.index[i] in cols[:3] else get_cos_similarity(vec, vec_norm, matrix_x300[i]) for i in range(len(group_t))]
        index = np.argmax(cos_sim)
        cols.extend([group_t.index[index], cos_sim[index]])
    except KeyError:
        cols.extend(['', -1])
    return cols

#Read evaluation data
with open('./091.analogy_family.txt') as file_in:
    result = [get_similar_word(line.split()) for line in file_in]

with open('092.analogy_word2vec_1.txt', 'w') as file_out:
    writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')

Answer commentary

I'm getting similar words here. I didn't write it in the question, but I try to exclude the words used in the calculation. I don't know if this is okay, but excluding it will increase the percentage of correct answers.

cos_sim = [-1 if group_t.index[i] in cols[:3] else get_cos_similarity(vec, vec_norm, matrix_x300[i]) for i in range(len(group_t))]

Words that are not on the corpus have a similarity of -1.

except KeyError:
    cols.extend(['', -1])

After that, there is a lot of content written by knocking so far, and there is not much special thing in the code, and there is no particular explanation. It takes about 17 minutes to say the least, so I tried to write as much as possible in list comprehension. If you put out the first 10 lines of the contents of the output file, it looks like this. It may or may not match.


boy	girl	brother	sister	son	0.8804225566858075
boy	girl	brothers	sisters	sisters	0.8426790631091488
boy	girl	dad	mom	mum	0.8922065515297802
boy	girl	father	mother	mother	0.847494164274725
boy	girl	grandfather	grandmother	grandmother	0.820584129035444
boy	girl	grandpa	grandma		-1
boy	girl	grandson	granddaughter	grandfather	0.6794604718339272
boy	girl	groom	bride	seduce	0.5951703092628703
boy	girl	he	she	she	0.8144501058726975
boy	girl	his	her	Mihailov	0.5752869854520882

Gensim usage answer program [092. Application to analogy data_2.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3 % 83% 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /092.%E3%82%A2%E3%83%8A%E3 % 83% AD% E3% 82% B8% E3% 83% BC% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% B8% E3% 81% AE% E9% 81 % A9% E7% 94% A8_2.ipynb)

import csv

from gensim.models import Word2Vec

model = Word2Vec.load('./090.word2vec.model')


#Get words with high similarity
def get_similar_word(cols):
        cos_sim = model.wv.most_similar(positive=[cols[1], cols[2]], negative=[cols[0]], topn=4)       
        for word, similarity in cos_sim:
            #Exclude the 3 words used in the calculation
            if word not in cols[:2]:
                cols.extend([word, similarity])
    #For words not in the original corpus
    except KeyError:
        cols.extend(['', -1])
    return cols

#Read evaluation data
with open('./091.analogy_family.txt') as file_in:
    result = [get_similar_word(line.split()) for line in file_in]

with open('./092.analogy_word2vec_2.txt', 'w') as file_out:
    writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')

Answer commentary

It's a little slimmer than the self-made version because it's done using a package. And, as you can see when you run it, the process is fast! It takes about 4 seconds and is ** more than 200 times faster than the self-made version **. Gensim is amazing. This is the output result. ** The percentage of correct answers is also increasing. ** **


boy	girl	brother	sister	sister	0.745887041091919
boy	girl	brothers	sisters	sisters	0.8522343039512634
boy	girl	dad	mom	mum	0.7720432281494141
boy	girl	father	mother	mother	0.8608728647232056
boy	girl	grandfather	grandmother	granddaughter	0.8341050148010254
boy	girl	grandpa	grandma		-1
boy	girl	grandson	granddaughter	granddaughter	0.8497666120529175
boy	girl	groom	bride	bride	0.7476662397384644
boy	girl	he	she	she	0.7702984809875488
boy	girl	his	her	her	0.6540039777755737

