I wanted to use the vector of word2vec in the embedding layer such as seq2seq, so I summarized the useful functions of word2vec. In the example, we will explain using the learned vector of Tohoku University (http://www.cl.ecei.tohoku.ac.jp/~m-suzuki/jawiki_vector/).
First, load the model. Enter the downloaded (entity_vector.model.bin) path in model_dir.
from gensim.models import KeyedVectors
model_dir = 'entity_vector.model.bin path'
model = KeyedVectors.load_word2vec_format(model_dir, binary=True)
The word vector can be obtained as follows.
print(model['I'])
output
[-0.9154563 0.97780323 -0.43780354 -0.6441212 -1.6350892 0.8619687
0.41775486 -1.094953 0.74489385 -1.6742945 -0.34408432 0.5552686
-3.9321985 0.3369842 1.5550056 1.3112832 -0.64914817 0.5996928
1.6174266 0.8126016 -0.75741744 1.7818885 2.1305397 1.8607832
3.0353768 -0.8547809 -0.87280065 -0.54103154 0.752979 3.8159864
-1.4066186 0.78604376 1.2102902 3.9960158 2.9654515 -2.6870391
-1.3829899 0.993459 0.86303824 0.29373714 4.0691266 -1.4974884
-1.5601915 1.4936608 0.550254 2.678553 0.53790915 -1.7294457
-0.46390963 -0.34776476 -1.2361934 -2.433244 -0.21348757 0.0488718
0.8197853 -0.59188586 1.7276062 0.9713122 -0.06519659 2.4763248
-0.93890053 0.36217824 1.887851 -0.0175399 -0.21866432 -0.81253886
-3.9667509 2.5340643 0.02985824 0.338091 -1.3745826 -2.3509336
-1.5615206 0.8273324 -1.263886 -1.2259561 0.9079302 2.0258994
-0.8576754 -2.5492477 -2.45557 -0.5216256 -1.3474834 2.3590422
1.0459667 2.0919168 1.6904455 1.7064931 0.7376105 0.2567448
-0.8194208 0.8788849 -0.89287275 -0.22960001 1.8320689 -1.7200342
0.8977642 1.5119879 -0.3325551 0.7429934 -1.2087826 0.5350336
-0.03887295 -1.9642036 1.0406445 -0.80972534 0.49987233 2.419521
-0.30317742 0.96494234 0.6184119 1.2633535 2.688754 -0.7226699
-2.8695397 -0.8986926 0.1258761 -0.75310475 1.099076 0.90656924
0.24586082 0.44014114 0.85891217 0.34273988 0.07071286 -0.71412176
1.4705397 3.6965442 -2.5951867 -2.4142458 1.2733719 -0.22638321
0.15742263 -0.717717 2.2888887 3.3045793 -0.8173686 1.368556
0.34260234 1.1644434 2.2652006 -0.47847173 1.5130697 3.481819
-1.5247481 2.166555 0.7633031 0.61121356 -0.11627229 1.0461875
1.4994645 -2.8477156 -2.9415505 -0.86640745 -1.1220155 0.10772963
-1.6050811 -2.519997 -0.13945188 -0.06943721 0.83996797 0.29909992
0.7927955 -1.1932545 -0.375592 0.4437512 -1.4635806 -0.16438413
0.93455386 -0.4142645 -0.92249537 -1.0754105 0.07403489 1.0781559
1.7206618 -0.69100255 -2.6112185 1.4985414 -1.8344582 -0.75036854
1.6177907 -0.47727013 0.88055164 -1.057859 -2.0196638 -3.5305111
1.1221203 3.3149185 0.859528 2.3817215 -1.1856595 -0.03347144
-0.84533554 2.201596 -2.1573794 -0.6228852 0.12370715 3.030279
-1.9215534 0.09835044]
In this way, a 200-dimensional vector is output.
You can get the vector of all words with the following code.
w2v_vector = model.wv.syn0
Check the number of dimensions.
print(w2v_vector.shape)
output
(1015474, 200)
From this output, we can see that 1015474 words are stored as a 200-dimensional vector.
I was able to get the vectors for all the words, but I still don't know which vector the words correspond to. So, let's get the word ID on word2vec.
print(model.vocab['I'].index)
output
1027
It turns out that the word "I" corresponds to the 1027th. Now, let's output the 1027th vector.
print(w2v_vector[model.vocab['I'].index])
output
[-0.9154563 0.97780323 -0.43780354 -0.6441212 -1.6350892 0.8619687
0.41775486 -1.094953 0.74489385 -1.6742945 -0.34408432 0.5552686
-3.9321985 0.3369842 1.5550056 1.3112832 -0.64914817 0.5996928
1.6174266 0.8126016 -0.75741744 1.7818885 2.1305397 1.8607832
3.0353768 -0.8547809 -0.87280065 -0.54103154 0.752979 3.8159864
-1.4066186 0.78604376 1.2102902 3.9960158 2.9654515 -2.6870391
-1.3829899 0.993459 0.86303824 0.29373714 4.0691266 -1.4974884
-1.5601915 1.4936608 0.550254 2.678553 0.53790915 -1.7294457
-0.46390963 -0.34776476 -1.2361934 -2.433244 -0.21348757 0.0488718
0.8197853 -0.59188586 1.7276062 0.9713122 -0.06519659 2.4763248
-0.93890053 0.36217824 1.887851 -0.0175399 -0.21866432 -0.81253886
-3.9667509 2.5340643 0.02985824 0.338091 -1.3745826 -2.3509336
-1.5615206 0.8273324 -1.263886 -1.2259561 0.9079302 2.0258994
-0.8576754 -2.5492477 -2.45557 -0.5216256 -1.3474834 2.3590422
1.0459667 2.0919168 1.6904455 1.7064931 0.7376105 0.2567448
-0.8194208 0.8788849 -0.89287275 -0.22960001 1.8320689 -1.7200342
0.8977642 1.5119879 -0.3325551 0.7429934 -1.2087826 0.5350336
-0.03887295 -1.9642036 1.0406445 -0.80972534 0.49987233 2.419521
-0.30317742 0.96494234 0.6184119 1.2633535 2.688754 -0.7226699
-2.8695397 -0.8986926 0.1258761 -0.75310475 1.099076 0.90656924
0.24586082 0.44014114 0.85891217 0.34273988 0.07071286 -0.71412176
1.4705397 3.6965442 -2.5951867 -2.4142458 1.2733719 -0.22638321
0.15742263 -0.717717 2.2888887 3.3045793 -0.8173686 1.368556
0.34260234 1.1644434 2.2652006 -0.47847173 1.5130697 3.481819
-1.5247481 2.166555 0.7633031 0.61121356 -0.11627229 1.0461875
1.4994645 -2.8477156 -2.9415505 -0.86640745 -1.1220155 0.10772963
-1.6050811 -2.519997 -0.13945188 -0.06943721 0.83996797 0.29909992
0.7927955 -1.1932545 -0.375592 0.4437512 -1.4635806 -0.16438413
0.93455386 -0.4142645 -0.92249537 -1.0754105 0.07403489 1.0781559
1.7206618 -0.69100255 -2.6112185 1.4985414 -1.8344582 -0.75036854
1.6177907 -0.47727013 0.88055164 -1.057859 -2.0196638 -3.5305111
1.1221203 3.3149185 0.859528 2.3817215 -1.1856595 -0.03347144
-0.84533554 2.201596 -2.1573794 -0.6228852 0.12370715 3.030279
-1.9215534 0.09835044]
If you compare it with the first vector output result, you can see that the output result is the same.
Now that we are ready to use the word2vec vector, next time we will try to learn using the word2vec vector in the Embedding layer of the seq2seq / Encoder model.
Recommended Posts