Natural Language: ChatBot Part2-Sequence To Sequence Attention

Target

This is a continuation of the chatbot using the Microsoft Cognitive Toolkit (CNTK).

In Part2, we will train chatbots by CNTK using the conversation data prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.

Introduction

Natural Language: ChatBot Part1-Twitter API Corpus has prepared a conversation dataset for tweets and replies from the Twitter API.

In Part2, we will create and train a chatbot using the series transformation model [1].

Seq2Seq with Attention For the overall structure, I referred to GNMT [2]. As a component of the recursive neural network, both Encoder and Decoder are composed of 5 layers of LSTM [3], and Attention Mechanism [4] is introduced.

stsa.png

Both Encoder and Decoder apply Layer Normalization [5] to the output of LSTM.

Encoder adopts Bidirectional RNN [6] to concatenate forward and backward outputs.

Also, in Encoder, after applying Layer Normalization to forward and backward respectively, and then concatenating, in Decoder, after applying Layer Normalization, Dropout [[7]](# reference) is inserted to improve generalization performance. To do.

In addition, both Encoder and Decoder use Residual Connection [8] to improve the gradient disappearance that accompanies the deepening of LSTM. However, the output from the Embedding layer is not residual connected.

Settings in training

The initial value of each parameter was set to a uniform distribution [2] of [-0.04, 0.04].

Since it is a classification problem that predicts the next word, we set the loss function as Cross Entropy Error and adopted Adam [9] as the optimization algorithm. Adam's hyperparameters $ \ beta_1 $ are set to 0.9 and $ \ beta_2 $ are set to the default values of CNTK.

For the learning rate, use the Cyclical Learning Rate (CLR) [10], the maximum learning rate is 0.01, the base learning rate is 1e-4, the step size is 4 times the number of epochs, and the strategy is exp_range. Set up.

Model training performed 100 Epoch by mini-batch learning.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 5000 16GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.33 ・ Pandas 0.25.0

Program to run

The training program is available on GitHub.

stsa_training.py


Commentary

I will supplement the main contents of this implementation.

Attention Mechanism In the series transformation model, as shown in the figure below, the input series information is encoded by the Encoder into a fixed-length vector, and the hidden state at the last time is the initial hidden state of the Decoder. However, the longer the input sequence, the more difficult it is to compress the information.

Also, the information at each time should be included in the hidden state at each time ($ h ^ E_1, h ^ E_2, ..., h ^ E_ {S-1} $ in the figure below), but with the naive Seq2Seq Decoder can only receive $ h ^ E_S $ at the last time.

seq2seq.png

The attention mechanism was proposed as a remedy for these problems and improved the performance of the series transformation model.

Here, the input series is $ x_s = (x_1, x_2, ..., x_S) $, the output series is $ y_t = (y_1, y_2, ..., y_T) $, and the transition function of RNN is $ \ Psi ^. If E $, $ \ Psi ^ D $, the hidden state $ h ^ E_s, h ^ D_t $ at each time of Encoder and Decoder can be expressed as follows.

h^E_s = \Psi^E(x_s, h^E_{s-1}) \\
h^D_t = \Psi^D(y_t, h^D_{t-1})

Next, define a function $ \ Omega $ to find the weight between $ h ^ E_s and h ^ D_t $. This time, it is defined as follows using the parameters $ W_ {encoder}, W_ {decoder}, W_ {tanh} $. This is called additive caution [4], and other inner product cautions [11] have been proposed.

\Omega(h^E_s, h^D_t) = W_{tanh} \cdot tanh \left( W_{decoder} \cdot h^D_t + W_{encoder} \cdot h^E_s \right)

Then use the Softmax function to normalize the sum to 1 to calculate the importance of each time to the input series.

a_s = \frac{\exp \left( \Omega(h^E_s, h^D_{t-1}) \right)}{\sum \exp \left( \Omega(h^E_s, h^D_{t-1}) \right)}

Then, this weighting factor $ a_s $ is used to find the weighted sum $ \ overline {h} $ of the input series.

\overline{h} = \sum^S_{s=1} a_s h^E_s

Finally, concatenate this weighted average with the first layer of Decoder.

h^D_t = \Psi^D \left( \left[ \overline{h}, y_t \right], h^D_{t-1} \right)

By doing this, the hidden state information at each time can be effectively utilized, and the compressed information based on the time that should be emphasized in the input series can be passed to the Decoder.

Bidirectional RNN Bi-directional RNNs are the forward direction of the input series $ x_t = (x_1, x_2, ..., x_T) $ and the reverse direction of the input series $ x_ {T-t + 1} = (x_T, x_ {T-1}, By using ..., x_1) $ together, you can consider the information of the entire input series. However, bidirectional RNNs must have the entire input sequence up to time $ T $.

The implementation itself is simple, with two RNNs, a forward RNN and a reverse RNN, that calculate the forward and reverse directions, respectively. Where $ \ overrightarrow {h \ strut} _t and \ overleftarrow {h} _t $ are the forward and reverse hidden layers at time $ t $, respectively, and $ b and W $ are the bias and weight with respect to the current time. $ H $ represents the weight for the hidden layer at the previous time.

\overrightarrow{h\strut}_t = \overrightarrow{b\strut} + x_t \overrightarrow{W\strut} + \overrightarrow{h\strut}_{t-1} \overrightarrow{H\strut} \\
\overleftarrow{h}_t = \overleftarrow{b} + x_{T-t+1} \overleftarrow{W} + \overleftarrow{h}_{t-1} \overleftarrow{H}

There are several candidates for connecting the output of the bidirectional RNN, but this time, the outputs of the forward and backward LSTMs are concatenated.

In CNTK, you can implement it by simply setting the Recurrence function go_backwards to True.

sequence_to_sequence_attention


h_enc_forward = Recurrence(lstm_forward[i])(h_enc)
h_enc_backward = Recurrence(lstm_backward[i], go_backwards=True))(h_enc)

Residual Connection Residual Connection has been widely used since ResNet [8] showed that it was possible to train networks with more than 1,000 layers. ResNet proposes a Residual Connection that spans two or more layers with a convolutional neural network, but the theory of Residual Connection is simple.

Here, when the input is $ h $ as the function $ f $ of the layer $ l $, the output of the network of the layer $ l $ can be expressed as follows.

f^{(l)}(h)

The operation of adding the output of the previous layer, that is, the input $ h $, to this is the residual connection.

f^{(l)}(h) + h

Then the derivative for $ h $ in this equation is

\begin{align}
\frac{\partial (f^{(l)}(h) + h)}{\partial h} &= \frac{\partial f^{(l)}(h)}{\partial h} + \frac{\partial h}{\partial h} \\
&= w^{(l)} + 1
\end{align}

Therefore, even if the error propagation from the lower layer is a small value, it will be close to 1, so the gradient disappearance can be suppressed.

result

Training loss and perplexity

The figure below is a visualization of the loss function and false recognition rate logs during training. The graph on the left shows the loss function, the graph on the right shows the Perplexity, the horizontal axis shows the number of epochs, and the vertical axis shows the value and Perplexity of the loss function, respectively.

stsa_logging.png

Both the value of the loss function and Perplexity are still large, so there seems to be something to be improved.

Conversation with chatbot

Shows a conversation with a trained chatbot. The sentence starting with> is the input, and the sentence starting with >> is the chatbot's response to the input.

>Hello
>>Hello!
>thank you for your hard work
>>There is a tsu!
>Please follow me
>>Followed!
>What is your name?
>> w
>What is your hobby?
>>Is a hobby!
>It's nice weather today, is not it
>>It is good weather!
>Thank you for yesterday
>>I'm the one who should be thanking you!
>quit

It seems that he has a simple answer, but he has grown grass for his name and has not learned the concept of hobbies.

Visualization of Attention

Attention not only improves the performance of the naive Seq2Seq, but also allows you to visualize the relationship between input and predicted word strings using Attention maps. The figure below shows an example of an Attention map, where the horizontal axis represents the input word sequence, the vertical axis represents the predicted word sequence, and the color map is displayed as hot.

attention.png

In this example, the input word "good weather" seems to be more important than the other words.

reference

CNTK 204: Sequence to Sequence Networks with Text Data

Natural Language : ChatBot Part1 - Twitter API Corpus

  1. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. "Sequence to Sequence Learning with Neural Networks", Advances in neural information processing systems. 2014, pp 3104-3112.
  2. Yougnhui Wu, Mike Schuster, Zhifen Chen, Quoc V. Le, Mohammad Norouzi, et. al. "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", arXiv preprint arXiv:1609.08144, 2016.
  3. Sepp Hochreiter, and Jürgen Schmidhuber. "Long Short-Term Memory", Neural Computation. 1997, pp 1735-1780.
  4. Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. "Neural Machine Translation by Jointly Learning to Align and Translate", arXiv preprint arXiv:1409.0473, 2014.
  5. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer Normalization", arXiv preprint arXiv:1607.06450 (2016).
  6. Mike Schuster and Luldip K. Paliwal. "Bidirectional Recurrent Neural Networks", IEEE transactions on Signal Processing, 45(11), 1997, pp 2673-2681.
  7. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E.Hinton. "ImageNet Classification with Deep Convolutional Neural Networks", Advances in neural information processing systems. 2012, pp 1097-1105.
  8. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for Image Recognition", the IEEE conference on computer vision and pattern recognition. 2016. pp 770-778.
  9. Diederik P. Kingma and Jimmy Lei Ba. "Adam: A method for stochastic optimization", arXiv preprint arXiv:1412.6980 (2014).
  10. Leslie N. Smith. "Cyclical Learning Rates for Training Neural Networks", 2017 IEEE Winter Conference on Applications of Computer Vision. 2017, pp 464-472.
  11. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. "Effective Approaches to Attention-based Neural Machine Translation", arXiv preprint arXiv:1508.04025, 2015.

Recommended Posts

Natural Language: ChatBot Part2-Sequence To Sequence Attention
Preparing to start natural language processing
Natural Language: ChatBot Part1-Twitter API Corpus
Unbearable shortness of Attention in natural language processing
Python: Natural language processing
Introduction to Python language
RNN_LSTM2 Natural language processing
[Python] Try to classify ramen shops by natural language processing