2020 version of 100 knocks of language processing, which is famous as a collection of problems of natural language processing, has been released. This article summarizes the results of solving "Chapter 7: Word Vectors" from Chapters 1 to 10 below.
-Chapter 1: Preparatory Movement -Chapter 2: UNIX Commands -Chapter 3: Regular Expressions -Chapter 4: Morphological analysis -Chapter 5: Dependency Analysis -Chapter 6: Machine Learning --Chapter 7: Word Vector -Chapter 8: Neural Net --Chapter 9: RNN, CNN --Chapter 10: Machine Translation
We use Google Colaboratory for answers. For details on how to set up and use Google Colaboratory, see this article. The notebook containing the execution results of the following answers is available on github.
Create a program that performs the following processing for a word vector (word embedding) that expresses the meaning of a word as a real vector.
Download the learned word vector (3 million words / phrases, 300 dimensions) in the Google News dataset (about 100 billion words) and display the word vector of "United States". However, note that "United States" is internally referred to as "United_States".
First, download the specified learned word vector.
FILE_ID = "0B7XkCwpI5KDYNlNUTTlSS21pQmM"
FILE_NAME = "GoogleNews-vectors-negative300.bin.gz"
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=$FILE_ID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=$FILE_ID" -O $FILE_NAME && rm -rf /tmp/cookies.txt
It then reads the word vector using Gensim, which is used in various tasks of natural language processing.
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin.gz', binary=True)
After reading, you can easily get a word vector just by specifying the word you want to vectorize.
model['United_States']
output
array([-3.61328125e-02, -4.83398438e-02, 2.35351562e-01, 1.74804688e-01,
-1.46484375e-01, -7.42187500e-02, -1.01562500e-01, -7.71484375e-02,
1.09375000e-01, -5.71289062e-02, -1.48437500e-01, -6.00585938e-02,
1.74804688e-01, -7.71484375e-02, 2.58789062e-02, -7.66601562e-02,
-3.80859375e-02, 1.35742188e-01, 3.75976562e-02, -4.19921875e-02,
・ ・ ・
Calculate the cosine similarity between "United States" and "U.S.".
Here, we will use the similarity
method. If you specify two words, you can calculate the cosine similarity between the words.
model.similarity('United_States', 'U.S.')
output
0.73107743
Output 10 words with high cosine similarity to "United States" and their similarity.
Here we use the most_similar
method. If you specify a word, you can get the top words of similarity up to topn
and their similarity.
model.most_similar('United_States', topn=10)
output
[('Unites_States', 0.7877248525619507),
('Untied_States', 0.7541370391845703),
('United_Sates', 0.74007248878479),
('U.S.', 0.7310774326324463),
('theUnited_States', 0.6404393911361694),
('America', 0.6178410053253174),
('UnitedStates', 0.6167312264442444),
('Europe', 0.6132988929748535),
('countries', 0.6044804453849792),
('Canada', 0.6019070148468018)]
Subtract the "Madrid" vector from the "Spain" word vector, calculate the vector by adding the "Athens" vector, and output 10 words with high similarity to that vector and their similarity.
The `most_similar``` method used in the previous question can acquire words that have a high degree of similarity to the calculated vector after specifying the vector to be added and the vector to be subtracted. Here, following the instructions in the question sentence, we are displaying words that are highly similar to the vector of
`Spain-
Madrid+
Athens `` Greece
is in first place.
vec = model['Spain'] - model['madrid'] + model['Athens']
model.most_similar(positive=['Spain', 'Athens'], negative=['Madrid'], topn=10)
output
[('Greece', 0.6898481249809265),
('Aristeidis_Grigoriadis', 0.5606848001480103),
('Ioannis_Drymonakos', 0.5552908778190613),
('Greeks', 0.545068621635437),
('Ioannis_Christou', 0.5400862693786621),
('Hrysopiyi_Devetzi', 0.5248444676399231),
('Heraklio', 0.5207759737968445),
('Athens_Greece', 0.516880989074707),
('Lithuania', 0.5166866183280945),
('Iraklion', 0.5146791934967041)]
Download the evaluation data of the word analogy, calculate vec (word in the second column) --vec (word in the first column) + vec (word in the third column), and find the word with the highest similarity to the vector. , Find the similarity. Add the obtained word and similarity to the end of each case.
Downloads the specified data.
!wget http://download.tensorflow.org/data/questions-words.txt
#Check the first 10 lines
!head -10 questions-words.txt
output
: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
Athens Greece Beijing China
Athens Greece Berlin Germany
Athens Greece Bern Switzerland
Athens Greece Cairo Egypt
Athens Greece Canberra Australia
Athens Greece Hanoi Vietnam
Athens Greece Havana Cuba
This data includes a set for evaluating semantic analogies, such as (Athens-Greece, Tokyo-Japan), and a set for evaluating grammatical analogies, such as (walk-walks, write-writes). I will. It consists of the following 14 categories in total, the above 5 correspond to the former and the others correspond to the latter.
No. | category |
---|---|
1 | capital-common-countries |
2 | capital-world |
3 | currency |
4 | city-in-state |
5 | family |
6 | gram1-adjective-to-adverb |
7 | gram2-opposite |
8 | gram3-comparative |
9 | gram4-superlative |
10 | gram5-present-participle |
11 | gram6-nationality-adjective |
12 | gram7-past-tense |
13 | gram8-plural |
14 | gram9-plural-verbs |
It reads line by line, calculates the similarity with the specified word, and outputs the formatted data.
with open('./questions-words.txt', 'r') as f1, open('./questions-words-add.txt', 'w') as f2:
for line in f1: #Read line by line from f1, add the desired word and similarity, and write to f2
line = line.split()
if line[0] == ':':
category = line[1]
else:
word, cos = model.most_similar(positive=[line[1], line[2]], negative=[line[0]], topn=1)[0]
f2.write(' '.join([category] + line + [word, str(cos) + '\n']))
!head -10 questions-words-add.txt
output
capital-common-countries Athens Greece Baghdad Iraq Iraqi 0.6351870894432068
capital-common-countries Athens Greece Bangkok Thailand Thailand 0.7137669324874878
capital-common-countries Athens Greece Beijing China China 0.7235777974128723
capital-common-countries Athens Greece Berlin Germany Germany 0.6734622120857239
capital-common-countries Athens Greece Bern Switzerland Switzerland 0.4919748306274414
capital-common-countries Athens Greece Cairo Egypt Egypt 0.7527809739112854
capital-common-countries Athens Greece Canberra Australia Australia 0.583732545375824
capital-common-countries Athens Greece Hanoi Vietnam Viet_Nam 0.6276341676712036
capital-common-countries Athens Greece Havana Cuba Cuba 0.6460992097854614
capital-common-countries Athens Greece Helsinki Finland Finland 0.6899983882904053
Measure the accuracy rate of the semantic analogy and the syntactic analogy using the execution results of> 64.
Calculate for each corresponding category.
with open('./questions-words-add.txt', 'r') as f:
sem_cnt = 0
sem_cor = 0
syn_cnt = 0
syn_cor = 0
for line in f:
line = line.split()
if not line[0].startswith('gram'):
sem_cnt += 1
if line[4] == line[5]:
sem_cor += 1
else:
syn_cnt += 1
if line[4] == line[5]:
syn_cor += 1
print(f'Semantic analogy correct answer rate: {sem_cor/sem_cnt:.3f}')
print(f'Literary analogy correct answer rate: {syn_cor/syn_cnt:.3f}')
output
Semantic analogy correct answer rate: 0.731
Literary analogy correct answer rate: 0.740
Download the evaluation data of The WordSimilarity-353 Test Collection and calculate the Spearman correlation coefficient between the ranking of similarity calculated by word vector and the ranking of human similarity judgment.
This data is given a human-rated similarity to a pair of words. Calculate the similarity of the word vectors for each pair and the Spearman's rank correlation coefficient for both.
!wget http://www.gabrilovich.com/resources/data/wordsim353/wordsim353.zip
!unzip wordsim353.zip
output
Archive: wordsim353.zip
inflating: combined.csv
inflating: set1.csv
inflating: set2.csv
inflating: combined.tab
inflating: set1.tab
inflating: set2.tab
inflating: instructions.txt
!head -10 './combined.csv'
output
Word 1,Word 2,Human (mean)
love,sex,6.77
tiger,cat,7.35
tiger,tiger,10.00
book,paper,7.46
computer,keyboard,7.62
computer,internet,7.58
plane,car,5.77
train,car,6.31
telephone,communication,7.50
ws353 = []
with open('./combined.csv', 'r') as f:
next(f)
for line in f: #Read line by line and calculate word vector and similarity
line = [s.strip() for s in line.split(',')]
line.append(model.similarity(line[0], line[1]))
ws353.append(line)
#Verification
for i in range(5):
print(ws353[i])
output
['love', 'sex', '6.77', 0.2639377]
['tiger', 'cat', '7.35', 0.5172962]
['tiger', 'tiger', '10.00', 0.99999994]
['book', 'paper', '7.46', 0.3634626]
['computer', 'keyboard', '7.62', 0.39639163]
import numpy as np
from scipy.stats import spearmanr
#Calculation of Spearman's correlation coefficient
human = np.array(ws353).T[2]
w2v = np.array(ws353).T[3]
correlation, pvalue = spearmanr(human, w2v)
print(f'Spearman correlation coefficient: {correlation:.3f}')
output
Spearman correlation coefficient: 0.685
Extract the word vector related to the country name and execute k-means clustering with the number of clusters k = 5.
Since we could not find a suitable source for the list of country names, we are collecting it from the evaluation data of the word analogy.
#Acquisition of country name
countries = set()
with open('./questions-words-add.txt') as f:
for line in f:
line = line.split()
if line[0] in ['capital-common-countries', 'capital-world']:
countries.add(line[2])
elif line[0] in ['currency', 'gram6-nationality-adjective']:
countries.add(line[1])
countries = list(countries)
#Get word vector
countries_vec = [model[country] for country in countries]
from sklearn.cluster import KMeans
# k-means clustering
kmeans = KMeans(n_clusters=5)
kmeans.fit(countries_vec)
for i in range(5):
cluster = np.where(kmeans.labels_ == i)[0]
print('cluster', i)
print(', '.join([countries[k] for k in cluster]))
output
cluster 0
Taiwan, Afghanistan, Iraq, Lebanon, Indonesia, Turkey, Egypt, Libya, Syria, Korea, China, Nepal, Cambodia, India, Bhutan, Qatar, Laos, Malaysia, Iran, Vietnam, Oman, Bahrain, Pakistan, Thailand, Bangladesh, Morocco, Jordan, Israel
cluster 1
Madagascar, Uganda, Botswana, Guinea, Malawi, Tunisia, Nigeria, Mauritania, Kenya, Zambia, Algeria, Mozambique, Ghana, Niger, Somalia, Angola, Mali, Senegal, Sudan, Zimbabwe, Gambia, Eritrea, Liberia, Burundi, Gabon, Rwanda, Namibia
cluster 2
Suriname, Uruguay, Tuvalu, Nicaragua, Colombia, Belize, Venezuela, Ecuador, Fiji, Peru, Guyana, Jamaica, Brazil, Honduras, Samoa, Bahamas, Dominica, Philippines, Cuba, Chile, Mexico, Argentina
cluster 3
Netherlands, Sweden, USA, Ireland, Canada, Spain, Malta, Greenland, Europe, Greece, France, Austria, Norway, Finland, Australia, Japan, Iceland, England, Italy, Denmark, Belgium, Switzerland, Germany, Portugal, Liechtenstein
cluster 4
Croatia, Belarus, Uzbekistan, Latvia, Tajikistan, Slovakia, Ukraine, Hungary, Albania, Poland, Montenegro, Georgia, Russia, Kyrgyzstan, Armenia, Romania, Cyprus, Lithuania, Azerbaijan, Serbia, Slovenia, Turkmenistan, Moldova, Bulgaria, Estonia, Kazakhstan, Macedonia
Execute hierarchical clustering by Ward's method for word vectors related to country names. Furthermore, visualize the clustering result as a dendrogram.
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
plt.figure(figsize=(15, 5))
Z = linkage(countries_vec, method='ward')
dendrogram(Z, labels=countries)
plt.show()
Visualize the vector space of word vectors related to country names with t-SNE.
Compress the word vector in two dimensions with t-SNE and visualize it in a scatter plot.
!pip install bhtsne
import bhtsne
embedded = bhtsne.tsne(np.array(countries_vec).astype(np.float64), dimensions=2, rand_seed=123)
plt.figure(figsize=(10, 10))
plt.scatter(np.array(embedded).T[0], np.array(embedded).T[1])
for (x, y), name in zip(embedded, countries):
plt.annotate(name, (x, y))
plt.show()
100 Language Processing Knock is designed so that you can learn not only natural language processing itself, but also basic data processing and general-purpose machine learning. Even those who are studying machine learning in online courses will be able to practice very good output, so please try it.
Recommended Posts