This is the record of the 94th "Similarity calculation with WordSimilarity-353" of Language processing 100 knock 2015. Calculates the similarity between words on a file. Technically, it's just a small change in coding for what you've done so far.
Link | Remarks |
---|---|
094.WordSimilarity-Similarity calculation at 353_1.ipynb | Answer program GitHub link |
094.WordSimilarity-Similarity calculation at 353_2.ipynb | Gensim version answer program GitHub link |
100 amateur language processing knocks:94 | I am always indebted to you by knocking 100 language processing |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
gensim | 3.8.1 |
numpy | 1.17.4 |
pandas | 0.25.3 |
In Chapter 10, we will continue to study word vectors from the previous chapter.
Enter the evaluation data of The WordSimilarity-353 Test Collection and use the words in the first and second columns. Create a program that calculates the similarity of and adds the similarity value to the end of each line. Apply this program to the word vector created in 85 and the word vector created in 90.
When I downloaded the ZIP file, there were several files, and I used combined.tab
in them.
The first row is the header row, there are two words in the first two columns, and it seems that the numerical value that humans judge the similarity comes to the third column (similarity out of 10 points). For cosine similarity, calculate how similar it is and set it in the 4th column.
combined.tab
Word 1 Word 2 Human (mean)
love sex 6.77
tiger cat 7.35
tiger tiger 10.00
book paper 7.46
computer keyboard 7.62
computer internet 7.58
plane car 5.77
train car 6.31
telephone communication 7.50
Omission
import csv
import numpy as np
import pandas as pd
#I didn't specify any arguments when saving'arr_0'Stored in
matrix_x300 = np.load('./../09.Vector space method(I)/085.matrix_x300.npz')['arr_0']
print('matrix_x300 Shape:', matrix_x300.shape)
group_t = pd.read_pickle('./../09.Vector space method(I)/083_group_t.zip')
#Cosine similarity calculation
def get_cos_similarity(line):
try:
v1 = matrix_x300[group_t.index.get_loc(line[0])]
v2 = matrix_x300[group_t.index.get_loc(line[1])]
#If the vectors are all zero-Returns 1
if np.count_nonzero(v1) == 0 \
or np.count_nonzero(v2) == 0:
line.extend([-1])
else:
line.extend([np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))])
except:
line.extend([-1])
return line
#Read evaluation data
with open('./combined.tab') as file_in:
reader = csv.reader(file_in, delimiter='\t')
header = next(reader)
result = [get_cos_similarity(line) for line in reader]
with open('094.combine_1.txt', 'w') as file_out:
writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')
writer.writerows(result)
There is no technical explanation because it is just a combination of the contents so far. The result is output as a tab-delimited text file without header lines. It may be natural, but it is quite different from the similarity set by humans. This doesn't extract similar words, so it takes less than a second.
text:094.combine_1.txt
love sex 6.77 0.28564147035983395
tiger cat 7.35 0.848285056343736
tiger tiger 10.00 1.0000000000000002
book paper 7.46 0.4900762715360672
computer keyboard 7.62 0.09513773584009234
computer internet 7.58 0.2659421289876719
plane car 5.77 0.48590778050802136
train car 6.31 0.2976902017313069
telephone communication 7.50 0.1848868997304664
television radio 6.77 0.7724947668094843
Omission
import csv
import numpy as np
from gensim.models import Word2Vec
model = Word2Vec.load('./090.word2vec.model')
print(model)
#Cosine similarity calculation
def get_cos_similarity(line):
try:
v1 = model.wv[line[0]]
v2 = model.wv[line[1]]
#If the vectors are all zero-Returns 1
if np.count_nonzero(v1) == 0 \
or np.count_nonzero(v2) == 0:
line.extend([-1])
else:
line.extend([np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))])
except KeyError:
line.extend([-1])
return line
#Read evaluation data
with open('./combined.tab') as file_in:
reader = csv.reader(file_in, delimiter='\t')
header = next(reader)
result = [get_cos_similarity(line) for line in reader]
with open('094.combine_2.txt', 'w') as file_out:
writer = csv.writer(file_out, delimiter='\t', lineterminator='\n')
writer.writerows(result)
It's not much different from my own program. As for the result, the Gensim version is much better than the self-made program.
text:094.combine_2.txt
love sex 6.77 0.5481953
tiger cat 7.35 0.7811356
tiger tiger 10.00 1.0
book paper 7.46 0.5549785
computer keyboard 7.62 0.6746693
computer internet 7.58 0.6775914
plane car 5.77 0.5873176
train car 6.31 0.6229327
telephone communication 7.50 0.52026355
television radio 6.77 0.7744317
Omission
Recommended Posts