This is the record of the 87th "word similarity" of Language Processing 100 Knock 2015. Finally, the word vector is used to find the similarity between words. It feels like the pre-processing is finally over and it's the main subject. I would like to find the similarity between words using my email or the minutes of a meeting. Use cosine similarity for similarity. When I was learning trigonometric functions in high school, it was really useful to feel "what is it useful for?" Programmatically, it's not difficult.
Link | Remarks |
---|---|
087.Word similarity.ipynb | Answer program GitHub link |
100 amateur language processing knocks:87 | I am always indebted to you by knocking 100 language processing |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
numpy | 1.17.4 |
pandas | 0.25.3 |
enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.
Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).
This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.
Read the word meaning vector obtained in> 85 and calculate the cosine similarity between "United States" and "U.S.". However, note that "U.S." is internally expressed as "U.S.".
The cosine similarity is the following formula, which is the inner product of vectors divided by the product of norms. For more information, go to Google and you will find a lot.
\frac{\boldsymbol{A}\cdot\boldsymbol{B}}{|\boldsymbol{A}|\,|\boldsymbol{B}|}
import numpy as np
import pandas as pd
#I didn't specify any arguments when saving'arr_0'Stored in
matrix_x300 = np.load('085.matrix_x300.npz')['arr_0']
print('matrix_x300 Shape:', matrix_x300.shape)
# 'United States'When'U.S'Word vector display
v1 = matrix_x300[group_t.index.get_loc('United_States')]
v2 = matrix_x300[group_t.index.get_loc('U.S')]
print(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))
The first half is the same program as the previous knock.
Each word vector is taken out as variables v1
and v2
.
# 'United States'When'U.S'Word vector display
v1 = matrix_x300[group_t.index.get_loc('United_States')]
v2 = matrix_x300[group_t.index.get_loc('U.S')]
All you have to do is calculate. The inner product is calculated with dot
, and the norm is calculated with norm
.
print(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))
Since the maximum value is 1 (1 is the same), 0.83 would be quite similar.
0.837516976284694
Recommended Posts