Introduction

I wanted to use the vector of word2vec in the embedding layer such as seq2seq, so I summarized the useful functions of word2vec. In the example, we will explain using the learned vector of Tohoku University (http://www.cl.ecei.tohoku.ac.jp/~m-suzuki/jawiki_vector/).

Model loading

First, load the model. Enter the downloaded (entity_vector.model.bin) path in model_dir.

from gensim.models import KeyedVectors
model_dir = 'entity_vector.model.bin path'
model = KeyedVectors.load_word2vec_format(model_dir, binary=True)

Vector output

The word vector can be obtained as follows.

print(model['I'])

`output`


[-0.9154563   0.97780323 -0.43780354 -0.6441212  -1.6350892   0.8619687
  0.41775486 -1.094953    0.74489385 -1.6742945  -0.34408432  0.5552686
 -3.9321985   0.3369842   1.5550056   1.3112832  -0.64914817  0.5996928
  1.6174266   0.8126016  -0.75741744  1.7818885   2.1305397   1.8607832
  3.0353768  -0.8547809  -0.87280065 -0.54103154  0.752979    3.8159864
 -1.4066186   0.78604376  1.2102902   3.9960158   2.9654515  -2.6870391
 -1.3829899   0.993459    0.86303824  0.29373714  4.0691266  -1.4974884
 -1.5601915   1.4936608   0.550254    2.678553    0.53790915 -1.7294457
 -0.46390963 -0.34776476 -1.2361934  -2.433244   -0.21348757  0.0488718
  0.8197853  -0.59188586  1.7276062   0.9713122  -0.06519659  2.4763248
 -0.93890053  0.36217824  1.887851   -0.0175399  -0.21866432 -0.81253886
 -3.9667509   2.5340643   0.02985824  0.338091   -1.3745826  -2.3509336
 -1.5615206   0.8273324  -1.263886   -1.2259561   0.9079302   2.0258994
 -0.8576754  -2.5492477  -2.45557    -0.5216256  -1.3474834   2.3590422
  1.0459667   2.0919168   1.6904455   1.7064931   0.7376105   0.2567448
 -0.8194208   0.8788849  -0.89287275 -0.22960001  1.8320689  -1.7200342
  0.8977642   1.5119879  -0.3325551   0.7429934  -1.2087826   0.5350336
 -0.03887295 -1.9642036   1.0406445  -0.80972534  0.49987233  2.419521
 -0.30317742  0.96494234  0.6184119   1.2633535   2.688754   -0.7226699
 -2.8695397  -0.8986926   0.1258761  -0.75310475  1.099076    0.90656924
  0.24586082  0.44014114  0.85891217  0.34273988  0.07071286 -0.71412176
  1.4705397   3.6965442  -2.5951867  -2.4142458   1.2733719  -0.22638321
  0.15742263 -0.717717    2.2888887   3.3045793  -0.8173686   1.368556
  0.34260234  1.1644434   2.2652006  -0.47847173  1.5130697   3.481819
 -1.5247481   2.166555    0.7633031   0.61121356 -0.11627229  1.0461875
  1.4994645  -2.8477156  -2.9415505  -0.86640745 -1.1220155   0.10772963
 -1.6050811  -2.519997   -0.13945188 -0.06943721  0.83996797  0.29909992
  0.7927955  -1.1932545  -0.375592    0.4437512  -1.4635806  -0.16438413
  0.93455386 -0.4142645  -0.92249537 -1.0754105   0.07403489  1.0781559
  1.7206618  -0.69100255 -2.6112185   1.4985414  -1.8344582  -0.75036854
  1.6177907  -0.47727013  0.88055164 -1.057859   -2.0196638  -3.5305111
  1.1221203   3.3149185   0.859528    2.3817215  -1.1856595  -0.03347144
 -0.84533554  2.201596   -2.1573794  -0.6228852   0.12370715  3.030279
 -1.9215534   0.09835044]

In this way, a 200-dimensional vector is output.

Get all word2vec vectors as numpy array

You can get the vector of all words with the following code.

w2v_vector = model.wv.syn0

Check the number of dimensions.

print(w2v_vector.shape)

`output`


(1015474, 200)

From this output, we can see that 1015474 words are stored as a 200-dimensional vector.

Get index of learned words

I was able to get the vectors for all the words, but I still don't know which vector the words correspond to. So, let's get the word ID on word2vec.

print(model.vocab['I'].index)

`output`

It turns out that the word "I" corresponds to the 1027th. Now, let's output the 1027th vector.

print(w2v_vector[model.vocab['I'].index])

`output`


[-0.9154563   0.97780323 -0.43780354 -0.6441212  -1.6350892   0.8619687
  0.41775486 -1.094953    0.74489385 -1.6742945  -0.34408432  0.5552686
 -3.9321985   0.3369842   1.5550056   1.3112832  -0.64914817  0.5996928
  1.6174266   0.8126016  -0.75741744  1.7818885   2.1305397   1.8607832
  3.0353768  -0.8547809  -0.87280065 -0.54103154  0.752979    3.8159864
 -1.4066186   0.78604376  1.2102902   3.9960158   2.9654515  -2.6870391
 -1.3829899   0.993459    0.86303824  0.29373714  4.0691266  -1.4974884
 -1.5601915   1.4936608   0.550254    2.678553    0.53790915 -1.7294457
 -0.46390963 -0.34776476 -1.2361934  -2.433244   -0.21348757  0.0488718
  0.8197853  -0.59188586  1.7276062   0.9713122  -0.06519659  2.4763248
 -0.93890053  0.36217824  1.887851   -0.0175399  -0.21866432 -0.81253886
 -3.9667509   2.5340643   0.02985824  0.338091   -1.3745826  -2.3509336
 -1.5615206   0.8273324  -1.263886   -1.2259561   0.9079302   2.0258994
 -0.8576754  -2.5492477  -2.45557    -0.5216256  -1.3474834   2.3590422
  1.0459667   2.0919168   1.6904455   1.7064931   0.7376105   0.2567448
 -0.8194208   0.8788849  -0.89287275 -0.22960001  1.8320689  -1.7200342
  0.8977642   1.5119879  -0.3325551   0.7429934  -1.2087826   0.5350336
 -0.03887295 -1.9642036   1.0406445  -0.80972534  0.49987233  2.419521
 -0.30317742  0.96494234  0.6184119   1.2633535   2.688754   -0.7226699
 -2.8695397  -0.8986926   0.1258761  -0.75310475  1.099076    0.90656924
  0.24586082  0.44014114  0.85891217  0.34273988  0.07071286 -0.71412176
  1.4705397   3.6965442  -2.5951867  -2.4142458   1.2733719  -0.22638321
  0.15742263 -0.717717    2.2888887   3.3045793  -0.8173686   1.368556
  0.34260234  1.1644434   2.2652006  -0.47847173  1.5130697   3.481819
 -1.5247481   2.166555    0.7633031   0.61121356 -0.11627229  1.0461875
  1.4994645  -2.8477156  -2.9415505  -0.86640745 -1.1220155   0.10772963
 -1.6050811  -2.519997   -0.13945188 -0.06943721  0.83996797  0.29909992
  0.7927955  -1.1932545  -0.375592    0.4437512  -1.4635806  -0.16438413
  0.93455386 -0.4142645  -0.92249537 -1.0754105   0.07403489  1.0781559
  1.7206618  -0.69100255 -2.6112185   1.4985414  -1.8344582  -0.75036854
  1.6177907  -0.47727013  0.88055164 -1.057859   -2.0196638  -3.5305111
  1.1221203   3.3149185   0.859528    2.3817215  -1.1856595  -0.03347144
 -0.84533554  2.201596   -2.1573794  -0.6228852   0.12370715  3.030279
 -1.9215534   0.09835044]

If you compare it with the first vector output result, you can see that the output result is the same.

Finally

Now that we are ready to use the word2vec vector, next time we will try to learn using the word2vec vector in the Embedding layer of the seq2seq / Encoder model.

Use the vector learned by word2vec in the Embedding layer of LSTM

Introduction

Model loading

Vector output

output

Get all word2vec vectors as numpy array

output

Get index of learned words

output

output

Finally

`output`

`output`

`output`

`output`