A sequel to Word2Vec using the Microsoft Cognitive Toolkit (CNTK).
Also in Part3, Word2Vec by CNTK will be performed using the Japanese corpus prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.
Natural Language: Word2Vec Part2 --Skip-gram model dealt with the Skip-gram model, so Part 3 has another Continuous bag-of-words (CBOW). ) I trained the model and compared it with Skip-gram.
Creating CBOW, another model proposed in Word2Vec [1], requires only minor changes to the code used in Part2.
The embedded layer has a dimension of 100, the output layer bias term is rejected, and the window size is 5.
The loss function, optimization algorithm, and hyperparameters used for training are exactly the same as in Part 2.
In order to train CBOW, the text file read by CTFDeserializer also needs to be slightly modified. All you have to do is change the last part of word2vec_corpus.py that you ran in Part 2 with the following code.
word2vec_corpus.py
...
#
# CBOW
#
targets = corpus[num_window:-num_window]
words = []
for i in range(num_window, len(corpus) - num_window):
word_list = []
for j in range(-num_window, num_window + 1):
if j == 0:
continue
word_list.append(corpus[i + j])
words.append(word_list)
words, targets = np.array(words, dtype=int), np.array(targets, dtype=int)
print("\nCBOW\n")
num_samples = 0
with open("./cbow_corpus.txt", "w") as word_file:
for i in range(len(words)):
word_file.write("{} |word {}:1\t|target {}:1\n".format(i, words[i, 0], targets[i]))
for j in range(1, num_window * 2):
word_file.write("{} |word {}:1\n".format(i, words[i, j]))
num_samples += 1
if num_samples % 10000 == 0:
print("Now %d samples..." % num_samples)
print("\nNumber of samples", num_samples)
The CBOW we are training this time considers 5 words before and after, so the CTFDeserializer looks like this:
cbow_corpus.txt
0 |word 982:1 |target 254:1
0 |word 3368:1
0 |word 2178:1
0 |word 3368:1
0 |word 2179:1
0 |word 545:1
0 |word 2180:1
0 |word 3368:1
0 |word 2181:1
0 |word 254:1
1 |word 3368:1 |target 545:1
1 |word 2178:1
1 |word 3368:1
1 |word 2179:1
1 |word 254:1
1 |word 2180:1
1 |word 3368:1
1 |word 2181:1
1 |word 254:1
1 |word 169:1
...
Unlike in Skip-gram, there is one target word for every 10 input words.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz ・ GPU NVIDIA GeForce GTX 1060 6GB
・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Pandas 0.25.0
The training program is available on GitHub.
word2vec_training.py
I will extract and supplement some parts of the program to be executed. We hope it helps you understand the implementation of CNTK and training techniques.
This time I used Dynamic Axis, which is the advantage of CNTK, to create the CBOW model.
word2vec_training.py
input = C.sequence.input_variable(shape=(num_word,))
When you declare your input this way, CNTK interprets the input variable as ([#, \ *], [num_word]). Where # stands for batch size and * stands for Dynamic Axis. Originally, it is a useful function when dealing with variable length data, and this time Dynamic Axis is fixed at 10, but I applied it because it was easier to implement the CBOW model.
The major difference between CBOW and the Skip-gram implementation is the averaging of output from the Embedding layer.
word2vec_training.py
embed = C.sequence.reduce_sum(Embedding(num_hidden)(input)) / (num_window * 2)
The function C.sequence.reduce_sum that appears here calculates the sum for Dynamic Axis. This will reduce the Dynamic Axis and the output will be ([#], [num_hidden]).
Subsequent processing is exactly the same.
I tried the same verification using the distributed representation of the words acquired in the CBOW training.
[similarity]magic
transfiguration:0.54
Produced:0.48
Slaughter:0.47
use:0.46
Fluctuation:0.39
Unlike in Skip-gram, the word that has a high degree of similarity to "magic" is "transformation."
[analogy]Hazuki-lotus+Jin= ?
directed by:0.57
Confluence:0.50
Extra:0.48
woman:0.47
You:0.45
The word analogy also gave a slightly different result than in Skip-gram.
As with Skip-gram, I used t-distribution Stochastic Neighbor Embedding (t-SNE) [2] to visualize the word embedding layer acquired by the CBOW model in two dimensions. .. Perplexity, which is one of the parameters of t-SNE, is arranged from the left as 5, 10, 20, 30, 50. The upper figure is the CBOW model, and the lower figure is the Skip-gram model.
I get the impression that both CBOW and Skip-gram have similar distributions.
Natural Language : Word2Vec Part1 - Japanese Corpus Natural Language : Word2Vec Part2 - Skip-gram model
Recommended Posts