Target

A sequel to Word2Vec using the Microsoft Cognitive Toolkit (CNTK).

Also in Part3, Word2Vec by CNTK will be performed using the Japanese corpus prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.

Introduction

Natural Language: Word2Vec Part2 --Skip-gram model dealt with the Skip-gram model, so Part 3 has another Continuous bag-of-words (CBOW). ) I trained the model and compared it with Skip-gram.

CBOW model and training settings

Creating CBOW, another model proposed in Word2Vec [1], requires only minor changes to the code used in Part2.

The embedded layer has a dimension of 100, the output layer bias term is rejected, and the window size is 5.

The loss function, optimization algorithm, and hyperparameters used for training are exactly the same as in Part 2.

CBOW inputs and outputs

In order to train CBOW, the text file read by CTFDeserializer also needs to be slightly modified. All you have to do is change the last part of word2vec_corpus.py that you ran in Part 2 with the following code.

`word2vec_corpus.py`


    ...
    
    #
    # CBOW
    #
    targets = corpus[num_window:-num_window]
    words = []
    for i in range(num_window, len(corpus) - num_window):
        word_list = []
        for j in range(-num_window, num_window + 1):
            if j == 0:
                continue
            word_list.append(corpus[i + j])
        words.append(word_list)

    words, targets = np.array(words, dtype=int), np.array(targets, dtype=int)

    print("\nCBOW\n")

    num_samples = 0
    with open("./cbow_corpus.txt", "w") as word_file:
        for i in range(len(words)):
            word_file.write("{} |word {}:1\t|target {}:1\n".format(i, words[i, 0], targets[i]))
            for j in range(1, num_window * 2):
                word_file.write("{} |word {}:1\n".format(i, words[i, j]))

            num_samples += 1
            if num_samples % 10000 == 0:
                print("Now %d samples..." % num_samples)

    print("\nNumber of samples", num_samples)

The CBOW we are training this time considers 5 words before and after, so the CTFDeserializer looks like this:

`cbow_corpus.txt`


0 |word 982:1	|target 254:1
0 |word 3368:1
0 |word 2178:1
0 |word 3368:1
0 |word 2179:1
0 |word 545:1
0 |word 2180:1
0 |word 3368:1
0 |word 2181:1
0 |word 254:1
1 |word 3368:1	|target 545:1
1 |word 2178:1
1 |word 3368:1
1 |word 2179:1
1 |word 254:1
1 |word 2180:1
1 |word 3368:1
1 |word 2181:1
1 |word 254:1
1 |word 169:1
...

Unlike in Skip-gram, there is one target word for every 10 input words.

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz ・ GPU NVIDIA GeForce GTX 1060 6GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Pandas 0.25.0

Program to run

The training program is available on GitHub.

`word2vec_training.py`

Commentary

I will extract and supplement some parts of the program to be executed. We hope it helps you understand the implementation of CNTK and training techniques.

This time I used Dynamic Axis, which is the advantage of CNTK, to create the CBOW model.

`word2vec_training.py`


input = C.sequence.input_variable(shape=(num_word,))

When you declare your input this way, CNTK interprets the input variable as ([#, \ *], [num_word]). Where # stands for batch size and * stands for Dynamic Axis. Originally, it is a useful function when dealing with variable length data, and this time Dynamic Axis is fixed at 10, but I applied it because it was easier to implement the CBOW model.

The major difference between CBOW and the Skip-gram implementation is the averaging of output from the Embedding layer.

`word2vec_training.py`


embed = C.sequence.reduce_sum(Embedding(num_hidden)(input)) / (num_window * 2)

The function C.sequence.reduce_sum that appears here calculates the sum for Dynamic Axis. This will reduce the Dynamic Axis and the output will be ([#], [num_hidden]).

Subsequent processing is exactly the same.

result

I tried the same verification using the distributed representation of the words acquired in the CBOW training.

Word similarity and word analogy

[similarity]magic
transfiguration:0.54
Produced:0.48
Slaughter:0.47
use:0.46
Fluctuation:0.39

Unlike in Skip-gram, the word that has a high degree of similarity to "magic" is "transformation."

[analogy]Hazuki-lotus+Jin= ?
directed by:0.57
Confluence:0.50
Extra:0.48
woman:0.47
You:0.45

The word analogy also gave a slightly different result than in Skip-gram.

Visualization of embedded layer by t-SNE

As with Skip-gram, I used t-distribution Stochastic Neighbor Embedding (t-SNE) [2] to visualize the word embedding layer acquired by the CBOW model in two dimensions. .. Perplexity, which is one of the parameters of t-SNE, is arranged from the left as 5, 10, 20, 30, 50. The upper figure is the CBOW model, and the lower figure is the Skip-gram model.

I get the impression that both CBOW and Skip-gram have similar distributions.

reference

Natural Language : Word2Vec Part1 - Japanese Corpus Natural Language : Word2Vec Part2 - Skip-gram model

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space", arXiv preprint arXiv:1301.3781 (2013).
Laurens van der Maaten and Geoffrey Hinton. "Visualizing Data using t-SNE", Journal of Machine Learning Research. 2008, 9 Nov: 2579-2605.

Natural Language: Word2Vec Part3 --CBOW model

Target

Introduction

CBOW model and training settings

CBOW inputs and outputs

word2vec_corpus.py

cbow_corpus.txt

Implementation

Execution environment

hardware

software

Program to run

word2vec_training.py

Commentary

word2vec_training.py

word2vec_training.py

result

Word similarity and word analogy

Visualization of embedded layer by t-SNE

reference

`word2vec_corpus.py`

`cbow_corpus.txt`

`word2vec_training.py`

`word2vec_training.py`

`word2vec_training.py`