100 amateur language processing knocks: 87

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 9: Vector Space Law (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

87. Word similarity

Read the word meaning vector obtained in> 85 and calculate the cosine similarity between "United States" and "U.S.". However, note that "U.S." is internally expressed as "U.S.".

The finished code:

main.py


# coding: utf-8
import pickle
from collections import OrderedDict
from scipy import io
import numpy as np

fname_dict_index_t = 'dict_index_t'
fname_matrix_x300 = 'matrix_x300'


def cos_sim(vec_a, vec_b):
	'''Calculation of cosine similarity
Vector vec_a、vec_Find the cosine similarity of b

Return value:
Cosine similarity
	'''
	norm_ab = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
	if norm_ab != 0:
		return np.dot(vec_a, vec_b) / norm_ab
	else:
		#The lowest value because it is not even possible to determine if the vector norm is similar to 0
		return -1


#Read dictionary
with open(fname_dict_index_t, 'rb') as data_file:
	dict_index_t = pickle.load(data_file)

#Matrix reading
matrix_x300 = io.loadmat(fname_matrix_x300)['matrix_x300']

# 'United States'When'U.S'Cosine similarity display
vec_a = matrix_x300[dict_index_t['United_States']]
vec_b = matrix_x300[dict_index_t['U.S']]

print(cos_sim(vec_a, vec_b))

Execution result:

Execution result


0.832760703627

Cosine similarity

Cosine similarity is a measure of how well two vectors are oriented in the same direction. It is not possible to determine if the words are similar as they are, but thanks to the vectorization of the words, it is now possible to determine if they are similar by examining the cosine similarity.

The cosine similarity of the vectors $ \ boldsymbol {A} $ and $ \ boldsymbol {B} $ can be calculated by the following formula.

\frac{\boldsymbol{A}\cdot\boldsymbol{B}}{|\boldsymbol{A}|\,|\boldsymbol{B}|}

By the way$ |\boldsymbol{A}| Is a vector \boldsymbol{A} $With the length (norm) ofnumpy.linalg.norm()Is required at.

The maximum cosine similarity is 1, and the higher the value, the more similar. The result of this execution was about 0.83, so I guess that "United States" and "U.S." are quite similar words. It's pretty cool to guess that "United States" and "U.S." are close words just by feeding on a Wikipedia article.

For more information on cosine similarity, try google for "cosine similarity".

That's all for the 88th knock. If you have any mistakes, I would appreciate it if you could point them out.


Recommended Posts

100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 24
100 amateur language processing knocks: 50
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 08
100 amateur language processing knocks: 42
100 amateur language processing knocks: 19
100 amateur language processing knocks: 73
100 amateur language processing knocks: 75
100 amateur language processing knocks: 83
100 amateur language processing knocks: 95
100 amateur language processing knocks: 96
100 amateur language processing knocks: 87
100 amateur language processing knocks: 72
100 amateur language processing knocks: 79
100 amateur language processing knocks: 23
100 amateur language processing knocks: 05
100 amateur language processing knocks: 00
100 amateur language processing knocks: 02
100 amateur language processing knocks: 37
100 amateur language processing knocks: 21
100 amateur language processing knocks: 68
100 amateur language processing knocks: 11
100 amateur language processing knocks: 90
100 amateur language processing knocks: 74
100 amateur language processing knocks: 66
100 amateur language processing knocks: 28
100 amateur language processing knocks: 64
100 amateur language processing knocks: 34
100 amateur language processing knocks: 36
100 amateur language processing knocks: 77
100 amateur language processing knocks: 01
100 amateur language processing knocks: 16
100 amateur language processing knocks: 27
100 amateur language processing knocks: 10
100 amateur language processing knocks: 03
100 amateur language processing knocks: 82
100 amateur language processing knocks: 69