This is a record of the 90th "Learning with word2vec" of Language Processing 100 Knock 2015. The question is, let's easily do what we have done in Chapter 9 using a package. The fact that the content that I desperately came up with while worrying about running out of memory can be made with about 3 lines of code is awesome, but I am keenly aware of it. This time, instead of using Google's word2vec specified in the question, the open source Geinsim I'm using /). I've heard that packages are updated frequently and are often used (I haven't researched them thoroughly because of my knowledge).

Reference link

Link	Remarks
090.Learning with word2vec.ipynb	Answer program GitHub link
100 amateur language processing knocks:90	I am always indebted to you by knocking 100 language processing

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
gensim	3.8.1
numpy	1.17.4

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

90. Learning with word2vec

Apply word2vec to the corpus created in 81 and learn the word vector. In addition, convert the form of the learned word vector and run the program 86-89.

Answer

Answer program [090.word2vec learning.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88%E3 % 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /090.word2vec%E3%81%AB%E3%82%88%E3%82%8B% E5% AD% A6% E7% BF% 92.ipynb)

from pprint import pprint

from gensim.models import word2vec

corpus = word2vec.Text8Corpus('./../09.Vector space method(I)/081.corpus.txt')

model = word2vec.Word2Vec(corpus, size=300)
model.save('090.word2vec.model')

# 86.Display word vector
pprint(model.wv['United_States'])

# 87.Word similarity
print(np.dot(model.wv['United_States'], model.wv['U.S']) / (np.linalg.norm(model.wv['United_States']) * np.linalg.norm(model.wv['U.S'])))

# 88.10 words with high similarity
pprint(model.wv.most_similar('England'))

# 89.Analogy by additive construct
# vec("Spain") - vec("Madrid") + vec("Athens")
pprint(model.wv.most_similar(positive=['Spain', 'Athens'], negative=['Madrid']))

Answer commentary

Word vector generation

First, read the file. I thought that there were many examples of using the Text8Corpus function, so I wondered what the Text8Corpus was in the first place. According to the article "Making a Japanese version of the text8 corpus and learning distributed expressions" (https://hironsan.hatenablog.com/entry/japanese-text8-corpus), text8 is Wikipedia data that has been processed as follows. It seems.

--Keep text and image captions --Removed links to tables and foreign language versions --Remove citations, footnotes and markup --Hypertext retains only anchor text. Remove everything else --Numbers convert spelling. For example, "20" is converted to "two zero" --Convert uppercase to lowercase --Convert characters that do not fall within the a-z range to spaces

I think there were capital letters, but I felt that they generally met the conditions, so I used Text8Corpus.

corpus = word2vec.Text8Corpus('./../09.Vector space method(I)/081.corpus.txt')

All you have to do is use the Word2Vec function to complete the 300-dimensional word vector. It took less than 4 minutes to generate. Wow ... I didn't use any options, but the gemsim word2vec option list was easy to understand.

model = word2vec.Word2Vec(corpus, size=300)

Then save the file for subsequent knocks.

model.save('090.word2vec.model')

Then, it seems that the following 3 files are created. It's unpleasant not to be one.

File	size
090.word2vec.model	5MB
090.word2vec.model.trainables.syn1neg.npy	103MB
090.word2vec.model.wv.vectors.npy	103MB

86. Display of word vector

Read the word meaning vector obtained in> 85 and display the "United States" vector. However, note that "United States" is internally referred to as "United_States".

There is a vector in model.wv, so just specify it.

pprint(model.wv['United_States'])

array([ 2.3478289 , -0.61461514,  0.0478639 ,  0.6709404 ,  1.1090833 ,
       -1.0814637 , -0.78162867, -1.2584596 , -0.04286158,  1.2928476 ,
Result omitted

87. Word similarity

Read the word meaning vector obtained in> 85 and calculate the cosine similarity between "United States" and "U.S.". However, note that "U.S." is internally expressed as "U.S.".

Use model to calculate the cosine similarity between the same vectors as in Chapter 9. In Chapter 9, it was 0.837516976284694, which gives a higher degree of similarity.

print(np.dot(model.wv['United_States'], model.wv['U.S']) / (np.linalg.norm(model.wv['United_States']) * np.linalg.norm(model.wv['U.S'])))

0.8601596

88. 10 words with high similarity

Read the meaning vector of the word obtained in> 85, and output 10 words with high cosine similarity to "England" and their similarity.

You can output it just by using the modst_similar function.

pprint(model.wv.most_similar('England'))

[('Scotland', 0.7884809970855713),
 ('Wales', 0.7721374034881592),
 ('Ireland', 0.6838206052780151),
 ('Britain', 0.6335258483886719),
 ('Hampshire', 0.6147407293319702),
 ('London', 0.6021863222122192),
 ('Cork', 0.5809425115585327),
 ('Manchester', 0.5767091512680054),
 ('Liverpool', 0.5765234231948853),
 ('Orleans', 0.5624016523361206)]

By the way, the result in Chapter 9 was as follows, but this time you can see that the words related to the United Kingdom have come out higher and more correct data is output.

Scotland    0.6364961613062289
Italy   0.6033905306935802
Wales   0.5961887337227456
Australia   0.5953277272306978
Spain   0.5752511915429617
Japan   0.5611603300967408
France  0.5547284075334182
Germany 0.5539239745925412
United_Kingdom  0.5225684232409136
Cheshire    0.5125286144779688

89. Analogy by additive construct

Read the word meaning vector obtained in 85, calculate vec ("Spain") --vec ("Madrid") + vec ("Athens"), and find 10 words with high similarity to that vector and their similarity. Output it.

If you pass positive and negative to the modst_similar function, it will calculate and output 10 words with high similarity.

pprint(model.wv.most_similar(positive=['Spain', 'Athens'], negative=['Madrid']))

[('Denmark', 0.7606724500656128),
 ('Italy', 0.7585107088088989),
 ('Austria', 0.7528095841407776),
 ('Greece', 0.7401891350746155),
 ('Egypt', 0.7314825057983398),
 ('Russia', 0.7225484848022461),
 ('Great_Britain', 0.7184625864028931),
 ('Norway', 0.7148114442825317),
 ('Rome', 0.7076312303543091),
 ('kingdom', 0.6994863748550415)]

By the way, the result in Chapter 9 was as follows, but this time Greece also came out in 4th place and you can see that more correct data is output.

Spain   0.8178213952646727
Sweden  0.8071582503798717
Austria 0.7795030693787409
Italy   0.7466099164394225
Germany 0.7429125848677439
Belgium 0.729240312232219
Netherlands 0.7193045612969573
Télévisions   0.7067876635156688
Denmark 0.7062857691945504
France  0.7014078181006329

100 language processing knock-90 (using Gensim): learning with word2vec