Target

This is a continuation of Word2Vec using the Microsoft Cognitive Toolkit (CNTK).

In Part2, Word2Vec by CNTK will be performed using the Japanese corpus prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.

Introduction

Natural Language: Word2Vec Part1 --Japanese Corpus has prepared a Japanese corpus.

In Part 2, we will create and train a Skip-gram model, which is famous as a neural language model.

Word2Vec Word2Vec [1] proposes two models: Continuous bag-of-words (CBOW) and Skip-gram.

The CBOW model uses peripheral words as input to predict the central word. The Skip-gram model, on the other hand, predicts the words that appear around a word. The number of words before and after is called the window size, and 2 to 5 are adopted.

The dimension of the embedded layer is 100, and the bias term of the output layer is not adopted.

This time I would like to train a Skip-gram model with a window size of 5 to get a distributed representation of words.

Settings in training

The default value for each parameter uses the CNTK default settings. In most cases, it has a uniform distribution of Glorot [2].

Since Word2Vec is considered a classification problem that predicts which word will appear for an input word, the first thing that comes to mind is to apply the Softmax function to the output layer and the Cross Entropy Error for the loss function. .. However, if the number of words is very large, the calculation of the Softmax function will take time. Therefore, various methods [3] that approximate the Softmax function have been devised to speed up the output layer. This time, I chose Sampled Softmax [4] from them.

Adam [5] was used as the optimization algorithm. Adam's learning rate is 0.01, hyperparameters $ β_1 $ are set to 0.9, and $ β_2 $ is set to the default value of CNTK.

Model training performed 100 Epoch with mini-batch training of mini-batch size 128.

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz ・ GPU NVIDIA GeForce GTX 1060 6GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Pandas 0.25.0

Program to run

The training program is available on GitHub.

`word2vec_training.py`

result

I tried various verifications using the distributed representation of the words acquired in the Skip-gram training.

Word similarity and word analogy

[similarity]magic
Yu:0.80
Hiding:0.79
Produced:0.77
beneficial:0.77
New:0.76

The five most similar to "magic" are displayed. The word "magic" here is an expression within the work, so it has a different meaning than the general one.

[analogy]Hazuki-lotus+Jin= ?
directed by:0.27
Confluence:0.25
Role:0.25
building:0.24
You:0.23

This is the result of analogizing words from the relationships between the characters. If you pull the lotus from the main character Hazuki and add Jin who is hostile to them, you will become a director. This gave reasonable results.

Visualization of embedded layer by t-SNE

The word embedding layer acquired by the Skip-gram model is high-dimensional data and is difficult to grasp intuitively. Therefore, t-distribution Stochastic Neighbor Embedding (t-SNE) [6] is famous as a method for converting high-dimensional data into 2D or 3D space and visualizing it.

This time, I changed Perplexity, which is one of the t-SNE parameters that indicate how much neighborhood is considered, to 5, 10, 20, 30, 50 and visualized it in a two-dimensional space.

Speed comparison

This time I used Sampled Softmax as an approximation of the Softmax function. The number of words in the corpus prepared in Part 1 was 3,369, but I tried to see how much faster it would be if the number of words was larger.

The average execution speed excluding the beginning and the end when executing 10 epoch on a corpus with 500,000 words is shown below. Sampled Softmax has 5 samples.

	mean speed per epoch
full softmax	17.5s
sampled softmax	8.3s

Sampled Softmax seems to be about twice as fast.

reference

CNTK 207: Sampled Softmax Deep learning library that builds on and extends Microsoft CNTK

Natural Language : Word2Vec Part1 - Japanese Corpus

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space", arXiv preprint arXiv:1301.3781 (2013).
Xaiver Glorot and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks", Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010, pp 249-256.
Tomas Mikolovm Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. "Distribution Representations of Words and Phrases and their Compositionality", In Advances in Neural Information Processing Systems (NIPS). 2013, pp 3111-3119.
Sebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. "On Using Very Large Target Vocabulary for Neural Machine Translation", arXiv preprint arXiv:1412.2007 (2014).
Diederik P. Kingma and Jimmy Lei Ba. "Adam: A method for stochastic optimization", arXiv preprint arXiv:1412.6980 (2014).
Laurens van der Maaten and Geoffrey Hinton. "Visualizing Data using t-SNE", Journal of Machine Learning Research. 2008, 9 Nov: 2579-2605.

Natural Language: Word2Vec Part2 --Skip-gram model