This is a continuation of the chatbot using the Microsoft Cognitive Toolkit (CNTK).
In Part2, we will train chatbots by CNTK using the conversation data prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.
Natural Language: ChatBot Part1-Twitter API Corpus has prepared a conversation dataset for tweets and replies from the Twitter API.
In Part2, we will create and train a chatbot using the series transformation model [1].
Seq2Seq with Attention For the overall structure, I referred to GNMT [2]. As a component of the recursive neural network, both Encoder and Decoder are composed of 5 layers of LSTM [3], and Attention Mechanism [4] is introduced.
Both Encoder and Decoder apply Layer Normalization [5] to the output of LSTM.
Encoder adopts Bidirectional RNN [6] to concatenate forward and backward outputs.
Also, in Encoder, after applying Layer Normalization to forward and backward respectively, and then concatenating, in Decoder, after applying Layer Normalization, Dropout [[7]](# reference) is inserted to improve generalization performance. To do.
In addition, both Encoder and Decoder use Residual Connection [8] to improve the gradient disappearance that accompanies the deepening of LSTM. However, the output from the Embedding layer is not residual connected.
The initial value of each parameter was set to a uniform distribution [2] of [-0.04, 0.04].
Since it is a classification problem that predicts the next word, we set the loss function as Cross Entropy Error and adopted Adam [9] as the optimization algorithm. Adam's hyperparameters $ \ beta_1 $ are set to 0.9 and $ \ beta_2 $ are set to the default values of CNTK.
For the learning rate, use the Cyclical Learning Rate (CLR) [10], the maximum learning rate is 0.01, the base learning rate is 1e-4, the step size is 4 times the number of epochs, and the strategy is exp_range. Set up.
Model training performed 100 Epoch by mini-batch learning.
・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 5000 16GB
・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.33 ・ Pandas 0.25.0
The training program is available on GitHub.
stsa_training.py
I will supplement the main contents of this implementation.
Attention Mechanism In the series transformation model, as shown in the figure below, the input series information is encoded by the Encoder into a fixed-length vector, and the hidden state at the last time is the initial hidden state of the Decoder. However, the longer the input sequence, the more difficult it is to compress the information.
Also, the information at each time should be included in the hidden state at each time ($ h ^ E_1, h ^ E_2, ..., h ^ E_ {S-1} $ in the figure below), but with the naive Seq2Seq Decoder can only receive $ h ^ E_S $ at the last time.
The attention mechanism was proposed as a remedy for these problems and improved the performance of the series transformation model.
Here, the input series is $ x_s = (x_1, x_2, ..., x_S) $, the output series is $ y_t = (y_1, y_2, ..., y_T) $, and the transition function of RNN is $ \ Psi ^. If E $, $ \ Psi ^ D $, the hidden state $ h ^ E_s, h ^ D_t $ at each time of Encoder and Decoder can be expressed as follows.
h^E_s = \Psi^E(x_s, h^E_{s-1}) \\
h^D_t = \Psi^D(y_t, h^D_{t-1})
Next, define a function $ \ Omega $ to find the weight between $ h ^ E_s and h ^ D_t $. This time, it is defined as follows using the parameters $ W_ {encoder}, W_ {decoder}, W_ {tanh} $. This is called additive caution [4], and other inner product cautions [11] have been proposed.
\Omega(h^E_s, h^D_t) = W_{tanh} \cdot tanh \left( W_{decoder} \cdot h^D_t + W_{encoder} \cdot h^E_s \right)
Then use the Softmax function to normalize the sum to 1 to calculate the importance of each time to the input series.
a_s = \frac{\exp \left( \Omega(h^E_s, h^D_{t-1}) \right)}{\sum \exp \left( \Omega(h^E_s, h^D_{t-1}) \right)}
Then, this weighting factor $ a_s $ is used to find the weighted sum $ \ overline {h} $ of the input series.
\overline{h} = \sum^S_{s=1} a_s h^E_s
Finally, concatenate this weighted average with the first layer of Decoder.
h^D_t = \Psi^D \left( \left[ \overline{h}, y_t \right], h^D_{t-1} \right)
By doing this, the hidden state information at each time can be effectively utilized, and the compressed information based on the time that should be emphasized in the input series can be passed to the Decoder.
Bidirectional RNN Bi-directional RNNs are the forward direction of the input series $ x_t = (x_1, x_2, ..., x_T) $ and the reverse direction of the input series $ x_ {T-t + 1} = (x_T, x_ {T-1}, By using ..., x_1) $ together, you can consider the information of the entire input series. However, bidirectional RNNs must have the entire input sequence up to time $ T $.
The implementation itself is simple, with two RNNs, a forward RNN and a reverse RNN, that calculate the forward and reverse directions, respectively. Where $ \ overrightarrow {h \ strut} _t and \ overleftarrow {h} _t $ are the forward and reverse hidden layers at time $ t $, respectively, and $ b and W $ are the bias and weight with respect to the current time. $ H $ represents the weight for the hidden layer at the previous time.
\overrightarrow{h\strut}_t = \overrightarrow{b\strut} + x_t \overrightarrow{W\strut} + \overrightarrow{h\strut}_{t-1} \overrightarrow{H\strut} \\
\overleftarrow{h}_t = \overleftarrow{b} + x_{T-t+1} \overleftarrow{W} + \overleftarrow{h}_{t-1} \overleftarrow{H}
There are several candidates for connecting the output of the bidirectional RNN, but this time, the outputs of the forward and backward LSTMs are concatenated.
In CNTK, you can implement it by simply setting the Recurrence function go_backwards to True.
sequence_to_sequence_attention
h_enc_forward = Recurrence(lstm_forward[i])(h_enc)
h_enc_backward = Recurrence(lstm_backward[i], go_backwards=True))(h_enc)
Residual Connection Residual Connection has been widely used since ResNet [8] showed that it was possible to train networks with more than 1,000 layers. ResNet proposes a Residual Connection that spans two or more layers with a convolutional neural network, but the theory of Residual Connection is simple.
Here, when the input is $ h $ as the function $ f $ of the layer $ l $, the output of the network of the layer $ l $ can be expressed as follows.
f^{(l)}(h)
The operation of adding the output of the previous layer, that is, the input $ h $, to this is the residual connection.
f^{(l)}(h) + h
Then the derivative for $ h $ in this equation is
\begin{align}
\frac{\partial (f^{(l)}(h) + h)}{\partial h} &= \frac{\partial f^{(l)}(h)}{\partial h} + \frac{\partial h}{\partial h} \\
&= w^{(l)} + 1
\end{align}
Therefore, even if the error propagation from the lower layer is a small value, it will be close to 1, so the gradient disappearance can be suppressed.
Training loss and perplexity
The figure below is a visualization of the loss function and false recognition rate logs during training. The graph on the left shows the loss function, the graph on the right shows the Perplexity, the horizontal axis shows the number of epochs, and the vertical axis shows the value and Perplexity of the loss function, respectively.
Both the value of the loss function and Perplexity are still large, so there seems to be something to be improved.
Shows a conversation with a trained chatbot. The sentence starting with> is the input, and the sentence starting with >> is the chatbot's response to the input.
>Hello
>>Hello!
>thank you for your hard work
>>There is a tsu!
>Please follow me
>>Followed!
>What is your name?
>> w
>What is your hobby?
>>Is a hobby!
>It's nice weather today, is not it
>>It is good weather!
>Thank you for yesterday
>>I'm the one who should be thanking you!
>quit
It seems that he has a simple answer, but he has grown grass for his name and has not learned the concept of hobbies.
Attention not only improves the performance of the naive Seq2Seq, but also allows you to visualize the relationship between input and predicted word strings using Attention maps. The figure below shows an example of an Attention map, where the horizontal axis represents the input word sequence, the vertical axis represents the predicted word sequence, and the color map is displayed as hot.
In this example, the input word "good weather" seems to be more important than the other words.
CNTK 204: Sequence to Sequence Networks with Text Data
Natural Language : ChatBot Part1 - Twitter API Corpus
Recommended Posts