This is a record of the 90th "Learning with word2vec" of Language Processing 100 Knock 2015. The question is, let's easily do what we have done in Chapter 9 using a package. The fact that the content that I desperately came up with while worrying about running out of memory can be made with about 3 lines of code is awesome, but I am keenly aware of it. This time, instead of using Google's word2vec specified in the question, the open source Geinsim I'm using /). I've heard that packages are updated frequently and are often used (I haven't researched them thoroughly because of my knowledge).
Link | Remarks |
---|---|
090.Learning with word2vec.ipynb | Answer program GitHub link |
100 amateur language processing knocks:90 | I am always indebted to you by knocking 100 language processing |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
gensim | 3.8.1 |
numpy | 1.17.4 |
In Chapter 10, we will continue to study word vectors from the previous chapter.
Apply word2vec to the corpus created in 81 and learn the word vector. In addition, convert the form of the learned word vector and run the program 86-89.
from pprint import pprint
from gensim.models import word2vec
corpus = word2vec.Text8Corpus('./../09.Vector space method(I)/081.corpus.txt')
model = word2vec.Word2Vec(corpus, size=300)
model.save('090.word2vec.model')
# 86.Display word vector
pprint(model.wv['United_States'])
# 87.Word similarity
print(np.dot(model.wv['United_States'], model.wv['U.S']) / (np.linalg.norm(model.wv['United_States']) * np.linalg.norm(model.wv['U.S'])))
# 88.10 words with high similarity
pprint(model.wv.most_similar('England'))
# 89.Analogy by additive construct
# vec("Spain") - vec("Madrid") + vec("Athens")
pprint(model.wv.most_similar(positive=['Spain', 'Athens'], negative=['Madrid']))
First, read the file. I thought that there were many examples of using the Text8Corpus
function, so I wondered what the Text8Corpus
was in the first place.
According to the article "Making a Japanese version of the text8 corpus and learning distributed expressions" (https://hironsan.hatenablog.com/entry/japanese-text8-corpus), text8 is Wikipedia data that has been processed as follows. It seems.
--Keep text and image captions --Removed links to tables and foreign language versions --Remove citations, footnotes and markup --Hypertext retains only anchor text. Remove everything else --Numbers convert spelling. For example, "20" is converted to "two zero" --Convert uppercase to lowercase --Convert characters that do not fall within the a-z range to spaces
I think there were capital letters, but I felt that they generally met the conditions, so I used Text8Corpus
.
corpus = word2vec.Text8Corpus('./../09.Vector space method(I)/081.corpus.txt')
All you have to do is use the Word2Vec
function to complete the 300-dimensional word vector. It took less than 4 minutes to generate. Wow ...
I didn't use any options, but the gemsim word2vec option list was easy to understand.
model = word2vec.Word2Vec(corpus, size=300)
Then save the file for subsequent knocks.
model.save('090.word2vec.model')
Then, it seems that the following 3 files are created. It's unpleasant not to be one.
File | size |
---|---|
090.word2vec.model | 5MB |
090.word2vec.model.trainables.syn1neg.npy | 103MB |
090.word2vec.model.wv.vectors.npy | 103MB |
Read the word meaning vector obtained in> 85 and display the "United States" vector. However, note that "United States" is internally referred to as "United_States".
There is a vector in model.wv
, so just specify it.
pprint(model.wv['United_States'])
array([ 2.3478289 , -0.61461514, 0.0478639 , 0.6709404 , 1.1090833 ,
-1.0814637 , -0.78162867, -1.2584596 , -0.04286158, 1.2928476 ,
Result omitted
Read the word meaning vector obtained in> 85 and calculate the cosine similarity between "United States" and "U.S.". However, note that "U.S." is internally expressed as "U.S.".
Use model
to calculate the cosine similarity between the same vectors as in Chapter 9.
In Chapter 9, it was 0.837516976284694, which gives a higher degree of similarity.
print(np.dot(model.wv['United_States'], model.wv['U.S']) / (np.linalg.norm(model.wv['United_States']) * np.linalg.norm(model.wv['U.S'])))
0.8601596
Read the meaning vector of the word obtained in> 85, and output 10 words with high cosine similarity to "England" and their similarity.
You can output it just by using the modst_similar
function.
pprint(model.wv.most_similar('England'))
[('Scotland', 0.7884809970855713),
('Wales', 0.7721374034881592),
('Ireland', 0.6838206052780151),
('Britain', 0.6335258483886719),
('Hampshire', 0.6147407293319702),
('London', 0.6021863222122192),
('Cork', 0.5809425115585327),
('Manchester', 0.5767091512680054),
('Liverpool', 0.5765234231948853),
('Orleans', 0.5624016523361206)]
By the way, the result in Chapter 9 was as follows, but this time you can see that the words related to the United Kingdom have come out higher and more correct data is output.
Scotland 0.6364961613062289
Italy 0.6033905306935802
Wales 0.5961887337227456
Australia 0.5953277272306978
Spain 0.5752511915429617
Japan 0.5611603300967408
France 0.5547284075334182
Germany 0.5539239745925412
United_Kingdom 0.5225684232409136
Cheshire 0.5125286144779688
Read the word meaning vector obtained in 85, calculate vec ("Spain") --vec ("Madrid") + vec ("Athens"), and find 10 words with high similarity to that vector and their similarity. Output it.
If you pass positive
and negative
to the modst_similar
function, it will calculate and output 10 words with high similarity.
pprint(model.wv.most_similar(positive=['Spain', 'Athens'], negative=['Madrid']))
[('Denmark', 0.7606724500656128),
('Italy', 0.7585107088088989),
('Austria', 0.7528095841407776),
('Greece', 0.7401891350746155),
('Egypt', 0.7314825057983398),
('Russia', 0.7225484848022461),
('Great_Britain', 0.7184625864028931),
('Norway', 0.7148114442825317),
('Rome', 0.7076312303543091),
('kingdom', 0.6994863748550415)]
By the way, the result in Chapter 9 was as follows, but this time Greece also came out in 4th place and you can see that more correct data is output.
Spain 0.8178213952646727
Sweden 0.8071582503798717
Austria 0.7795030693787409
Italy 0.7466099164394225
Germany 0.7429125848677439
Belgium 0.729240312232219
Netherlands 0.7193045612969573
Télévisions 0.7067876635156688
Denmark 0.7062857691945504
France 0.7014078181006329
Recommended Posts