It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).
enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.
Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).
Read the word meaning vector obtained in> 85 and display the "United States" vector. However, note that "United States" is internally referred to as "United_States".
main.py
# coding: utf-8
import pickle
from collections import OrderedDict
from scipy import io
import numpy as np
fname_dict_index_t = 'dict_index_t'
fname_matrix_x300 = 'matrix_x300'
#Read dictionary
with open(fname_dict_index_t, 'rb') as data_file:
dict_index_t = pickle.load(data_file)
#Matrix reading
matrix_x300 = io.loadmat(fname_matrix_x300)['matrix_x300']
# 'United States'Word vector display
print(matrix_x300[dict_index_t['United_States']])
Execution result
[ 3.60083662e+00 -7.88128084e-01 2.61994036e-01 6.54614795e+00
2.77978401e+00 -1.34643069e+00 -4.14679788e-01 -2.36571397e+00
6.48454026e-01 -1.81798692e-01 8.09115337e-01 1.87915076e+00
8.89790386e-01 2.91057396e+00 -2.05724474e+00 3.95015466e+00
-1.16785393e+00 -2.29594086e+00 2.87483314e-01 -3.22295491e-01
5.27833027e-02 1.32591124e+00 7.19768653e-02 -2.79842130e-01
-9.13285892e-03 -3.48882763e+00 2.80629048e+00 1.81757020e+00
1.01202749e+00 4.60392799e+00 4.35931867e-01 -9.47200476e-02
1.41464997e+00 -1.20815237e+00 1.59811576e+00 -7.90093385e-01
1.56584573e+00 -5.84883096e-01 -2.98866212e-01 -2.40969175e-01
2.01912319e+00 1.25522702e+00 -7.00878790e-01 2.54301034e+00
1.29071807e+00 4.99864524e-01 -2.08366007e-01 -3.34177888e-01
2.82855195e-01 -2.03289817e+00 -1.83255892e-02 6.94784136e-01
-1.68611375e-01 -6.54874637e-01 1.68042850e+00 1.89579749e-01
-4.58780381e-01 -1.39461125e+00 3.96468153e-02 1.07982308e+00
-2.01647855e+00 -6.31583022e-01 1.17090230e+00 -5.17860032e-01
-1.54354587e-01 -1.90240747e+00 4.24975361e-01 8.51292185e-01
-6.75733687e-01 -1.95373302e+00 2.86401504e+00 7.83145997e-01
3.60769615e-01 2.18517822e-01 9.09328784e-01 -2.31164499e+00
-5.29962261e-02 4.64315874e-01 -1.14818717e+00 4.37807725e-01
-8.72936322e-01 2.75689461e-01 -4.98886439e-01 -1.39319595e-01
1.70818184e+00 -1.42530608e+00 6.12346577e-01 -1.53452675e+00
-5.78771041e-01 5.74044574e-01 7.92225223e-01 -6.06557682e-01
4.20942844e-01 -6.44757207e-01 3.01797352e-02 -6.70597324e-01
-9.94382162e-01 -4.99397126e-01 5.90489124e-01 -3.31522663e-01
-1.49982021e+00 1.04485370e+00 1.30888498e+00 -7.15508080e-01
1.19164194e+00 5.10634752e-01 -6.83826569e-01 -1.70204338e+00
-3.06551527e-01 -7.96233183e-02 -8.78035415e-01 4.85365765e-01
-1.10059988e-01 1.08476384e+00 3.70272417e-01 -1.66487297e-02
2.53257364e+00 6.92406581e-01 -1.75201566e+00 -8.92891751e-02
-1.17317031e+00 -8.04520667e-01 -3.72208639e-01 -5.87968726e-01
6.33897294e-02 -4.25470101e-03 -1.07647720e-02 -1.43349655e+00
1.17827771e+00 -3.15443937e-01 1.12394158e+00 -1.26831340e+00
-9.69257805e-01 2.26313588e-01 2.13254757e-01 -1.03473199e+00
-9.07201782e-01 -9.96541296e-01 -1.09652409e+00 -1.95598158e+00
-1.44103220e+00 -6.48140969e-02 -9.82980349e-01 -8.45786568e-02
5.25832288e-01 -3.41535417e-01 1.67332240e+00 1.04440244e-01
-4.89830507e-01 1.47568054e-01 1.70129190e+00 1.14422426e+00
8.26973739e-01 7.07649835e-01 3.63384617e-01 -1.40773247e+00
-4.84105306e-01 -1.59593171e+00 1.01640270e+00 5.11171720e-01
-1.81608472e-01 2.09511452e-02 -3.97071523e-01 -3.68544617e-01
-3.03775580e-01 -7.36060412e-03 3.47125090e-01 -8.10847522e-01
-5.94050339e-02 1.04952201e+00 -1.81959226e-01 6.39576649e-01
-2.13652769e+00 2.21193903e-01 2.22833706e+00 3.15404529e-01
2.94974306e-02 1.81699352e+00 -2.52513345e-01 1.21497867e+00
1.93127372e+00 -1.40049583e+00 -3.92976140e-01 2.01746604e+00
3.48323962e-01 -1.27851426e+00 -8.37106664e-01 -6.77627274e-01
-7.55016169e-01 -7.26088763e-01 8.90254556e-01 2.05618152e+00
4.35043576e-01 -3.47253538e-01 2.45200710e+00 9.80268307e-01
-2.27851060e-01 9.84062157e-01 -4.81094077e-02 -2.76938831e-02
-1.73872055e+00 6.27352186e-01 3.69610149e-01 -2.39375141e+00
1.20634311e+00 9.16879237e-03 1.88932943e+00 -2.12446506e-01
-3.73810763e-01 -4.52664744e-01 1.33658447e+00 1.63348846e+00
-4.04242171e-01 1.24396257e+00 1.13995636e-01 1.56077956e-01
4.29892571e-01 -2.39289326e-01 7.55437299e-01 -1.35220485e-01
4.13112184e-01 1.69808593e+00 8.45655139e-01 -3.05053132e-01
4.26313358e-01 2.01935897e-01 -8.95808938e-02 -1.19706029e-01
8.58620660e-01 9.59342393e-02 6.90601959e-01 -9.52093790e-02
2.40653407e+00 1.26924728e+00 1.12005766e+00 -6.04110426e-01
6.64593790e-01 1.13045660e+00 3.73053754e-01 2.23601520e-01
-1.83664534e-01 -1.34208051e-01 4.52265923e-01 -1.95617572e-02
-1.09954830e+00 9.14058618e-01 4.16648849e-01 -1.73232268e-01
5.54256279e-01 6.43481094e-01 -6.14527995e-01 -9.87756033e-01
3.97245967e-01 -6.42933978e-02 1.14324979e+00 -5.75599318e-01
2.42005373e-01 -6.40143947e-01 2.95192002e-01 -7.13038483e-01
1.85032144e-02 -3.71692793e-01 6.69838053e-01 9.63435135e-01
-7.09443979e-01 1.12105308e-01 8.40726109e-01 5.08524168e-01
1.75758555e-01 1.44432107e-01 -2.55235895e-01 -4.54393729e-01
-5.18120965e-01 4.48156373e-01 -1.44818035e+00 1.51757130e-01
1.40229798e-01 -8.22383805e-01 5.12547787e-01 -5.62853223e-01
7.14130048e-01 5.20936783e-01 7.34849473e-01 8.70674020e-01
4.74195393e-01 7.28794927e-02 -1.08662671e-01 -1.28023393e+00
-1.21562850e-01 -7.30747051e-02 -6.98371195e-01 9.99403058e-01
2.21572245e-01 5.06539721e-01 -4.67786005e-02 -2.60209096e-01
3.52071509e-01 -7.90130862e-01 -4.07834390e-01 2.54070128e-01]
Problem 85 compressed to 300 dimensions, so the 300 numbers corresponding to "United_States" are the resulting word vectors.
Note that the elements and context words no longer correspond because the elements are reconstructed during dimensional compression. It's just a line of numbers for the 300 elements that have been reconstructed so that the information isn't diminished.
That's all for the 87th knock. If you have any mistakes, I would appreciate it if you could point them out.
Recommended Posts