WHY
I think that many people are interested in deep learning, so I will describe the implementation of deep learning in dialogue.
Since the chat response uses Chainer, I will focus on that part. However, please note that the version is old.
The version that has been confirmed to work is 1.5.1.
There may be some mistakes. There was a part I wanted to understand deeply, so I'm following some Chainer code. We apologize for the inconvenience, but we would appreciate it if you could point out any mistakes.
――The content announced at PyCon 2016 is rather a concept and outline base, so there is no explanation of the code actually implemented, so it is better to mean a reflection of yourself.
――So I wrote this article because I wanted more people to understand and use it by adding a code explanation. (I hope there will be more stars on github if possible)
Docker Hub
https://hub.docker.com/r/masayaresearch/dialogue/
github
https://github.com/SnowMasaya/Chainer-Slack-Twitter-Dialogue
There are many other areas such as question answering, topic classification, and parallelization of data acquisition, so I will write that part if requested.
WHAT
We are training on the classified data. Attention model is used even in deep learning. What is an attention model?
In the task of machine translation of neural networks, the sequence-to-sequence model has a problem that the importance of the first word is diminished by the accumulation of derivatives when it is aggregated into one vector in the input of a long sentence. Especially in English, the first word becomes more important.
In order to solve this, in the past, the translation accuracy was improved by inputting in the opposite direction. However, in the case of Japanese and Chinese, on the contrary, the last word is important, so it is not an essential solution.
Therefore, the attention model was proposed as a model that predicts the output of each decoding by weighted averaging the hidden layer and the encoding input corresponding to the decoding without separating the input encoding and decoding. Originally it was successful in the field of images, but now it is producing results in machine translation and sentence summarization tasks.
image
To predict "mo", it is the posterior probability when the input "I" ("I am an engineer") is obtained. The posterior probability is the score of the previous word (me), the state of the hidden layer, and the context vector ("I am an engineer"). Ignore the context vector for now. I will explain it later. The function g is generally a softmax function
As shown in the above figure, the formula used when predicting from the current state and context in consideration of the prior output is as follows.
p(y_i|y_1,...y_{i_1}, \vec{x}) = g(y_{i-1}, s_i, c_i)
Here, the state of the hidden layer at time t can be as follows. (State for predicting "mo") This is determined by the context vector of the previous word "I", the previous state, and the previous "I am an engineer". The function f is generally a sigmoid
s_i=f(s_{i-1}, y_{i-1},c_i)
The context vector is determined by the sum of the hidden layer of the encoder part ("I am an engineer") and the weight $ a $.
c_{i} = \sum^{T_x}_{j=1}\alpha_{ij}h_{j}
Then, how to find the weight defined earlier, the weight obtained from the hidden layer h called e and the state s immediately before the output side ("I" in the case of "") is used as the score. This shape is because h in the encoder part has a special shape. This point will be described later. The score e is a small value because of the probability. It is made a large value by the exp function and divided by all the input parts to calculate the weight that matches the pair of input and output.
\alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})} \\
e_{ij} = a(s_{i-1}, h_j)
So what is special about the hidden layer h? In fact, it differs from a normal sequence to sequence in that it combines forward and backward. Define forward and backward as shown below and express them in a concatenated manner. This is the hidden layer of the encoding input "I'm an engineer".
(\vec{h_1},...\vec{h_{T_x}})\\
(\overleftarrow{h_1},...\overleftarrow{h_{T_x}})\\
h_j = [\vec{h_j^T};\overleftarrow{h_j^T}]^T
Now let's follow how this formula is actually realized in the code base.
src_embed.py --This is the part that moves language data into the space of the neural network.
attention_encoder.py --This is the part that propagates the information transferred to the space of the neural network of the input language. (It corresponds to the utterance part of the user in the dialogue)
attention.py --The part that creates context information
attention_decoder.py --This is the neural network part of the output language. It even outputs context information, target language, and propagates hidden layers.
attention_dialogue.py --Load model --Save model --Weight initialization --Embedding weights --Encoding process --Decoding process
It consists of the above five.
HOW
src_embed.py
I will explain from the part that embeds the information of the input language. Sets the input language vocabulary and the number of embedded layers in the neural network. The input language vocabulary is the user's utterance in the case of dialogue.
def __init__(self, vocab_size, embed_size):
super(SrcEmbed, self).__init__(
weight_xi=links.EmbedID(vocab_size, embed_size),
)
It will be the contents of specific processing.
W (~ chainer.Variable)
is an embedded matrix of chainer.Variable.
Uses the initial weights generated from a normal distribution with mean 0 and variance 1.0.
def __init__(self, in_size, out_size, initialW=None, ignore_label=None):
super(EmbedID, self).__init__(W=(in_size, out_size))
if initialW is None:
initialW = initializers.Normal(1.0)
initializers.init_weight(self.W.data, initialW)
self.ignore_label = ignore_label
Specifically, it is the part that generates data from the normal distribution.
Since the xp
part is when using gpu
, we are using xp.random.normal
instead of numpy.random.normal
.
reference https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal.html
class Normal(initializer.Initializer):
def __init__(self, scale=0.05, dtype=None):
self.scale = scale
super(Normal, self).__init__(dtype)
def __call__(self, array):
xp = cuda.get_array_module(array)
array[...] = xp.random.normal(
loc=0.0, scale=self.scale, size=array.shape)
The initial weight to be returned here is set below.
The data in ʻinitializer is set in the data or class in
numpy.ndarrayor in the class in
cupy.ndarray`.
def init_weight(weights, initializer, scale=1.0):
if initializer is None:
initializer = HeNormal(1 / numpy.sqrt(2))
elif numpy.isscalar(initializer):
initializer = Constant(initializer)
elif isinstance(initializer, numpy.ndarray):
initializer = Constant(initializer)
assert callable(initializer)
initializer(weights)
weights *= scale
When ʻinitializer is other than
None`, it returns whether it is a gpu format array or a normal array.
class Constant(initializer.Initializer):
def __init__(self, fill_value, dtype=None):
self.fill_value = fill_value
super(Constant, self).__init__(dtype)
def __call__(self, array):
if self.dtype is not None:
assert array.dtype == self.dtype
xp = cuda.get_array_module(array)
array[...] = xp.asarray(self.fill_value)
The parts that are specifically judged and returned are as follows.
def get_array_module(*args):
if available:
return cupy.get_array_module(*args)
else:
return numpy
The __call__
function calls src_embed to embed the input language in the neural net space.
It is mapped to a differentiable space using hyperbolic functions in functions.tanh
. If it is a differentiable space, learning is possible by error back propagation.
def __call__(self, source):
return functions.tanh(self.weight_xi(source))
attention_encoder.py
The input layer mapped in the space of the neural network is passed to the hidden layer. Why 4 times
Input gate Oblivion gate Output gate A gate that considers previous inputs
This is because the above four are taken into consideration. The reason why it is necessary is mentioned in other materials, so I will not explain it in detail, but this device prevents overfitting.
def __init__(self, embed_size, hidden_size):
super(AttentionEncoder, self).__init__(
source_to_hidden=links.Linear(embed_size, 4 * hidden_size),
hidden_to_hidden=links.Linear(hidden_size, 4 * hidden_size),
)
It is a specific links.Liner process.
--Weight initialization --Weight parameter assignment --Bias initialization --Bias parameter addition
def __init__(self, in_size, out_size, wscale=1, bias=0, nobias=False,
initialW=None, initial_bias=None):
super(Linear, self).__init__()
self.initialW = initialW
self.wscale = wscale
self.out_size = out_size
self._W_initializer = initializers._get_initializer(initialW, math.sqrt(wscale))
if in_size is None:
self.add_uninitialized_param('W')
else:
self._initialize_params(in_size)
if nobias:
self.b = None
else:
if initial_bias is None:
initial_bias = bias
bias_initializer = initializers._get_initializer(initial_bias)
self.add_param('b', out_size, initializer=bias_initializer)
def _initialize_params(self, in_size):
self.add_param('W', (self.out_size, in_size), initializer=self._W_initializer)
This is a specific initialization process.
Since scale is 1 by default, multiply it to create an Array.
With the Constant
that came out earlier, the initial value is initialized with a fixed value and scaled.
class _ScaledInitializer(initializer.Initializer):
def __init__(self, initializer, scale=1.0):
self.initializer = initializer
self.scale = scale
dtype = getattr(initializer, 'dtype', None)
super(Identity, self).__init__(dtype)
def __call__(self, array):
self.initializer(array)
array *= self.scale
def _get_initializer(initializer, scale=1.0):
if initializer is None:
return HeNormal(scale / numpy.sqrt(2))
if numpy.isscalar(initializer):
return Constant(initializer * scale)
if isinstance(initializer, numpy.ndarray):
return Constant(initializer * scale)
assert callable(initializer)
if scale == 1.0:
return initializer
return _ScaledInitializer(initializer, scale)
We are passing the current state, the value of the previous hidden layer, and the value of the input layer.
def __call__(self, source, current, hidden):
return functions.lstm(current, self.source_to_hidden(source) + self.hidden_to_hidden(hidden))
The process called when the forward process is performed in the above lstm is as follows.
The file will be chainer / functions / activation / lstm.py
.
The input is divided into four lstm gates.
len (x)
: Get line length
x.shape [1]
: Get column length
x.shape [2:]
: Used for 3D or larger data
def _extract_gates(x):
r = x.reshape((len(x), x.shape[1] // 4, 4) + x.shape[2:])
return [r[:, :, i] for i in six.moves.range(4)]
cpu processing
--Get input and status --Get 4 pieces of information based on lstm --Since the value of lstm is based on the paper, the input can be realized with hyperbolic tangent except for attention. --Randomly initialize the next state --By talking about the product of input and attention, forgetting, and the product of past states, the next state is given for the batch size. --The hidden layer value is given by the output and the hyperbolic tangent in the next state.
The processing of gpu is the same. However, since C ++ is used, it is read using the following definition. It is the same as lstm defined on python, but it is written for processing in C ++.
_preamble = '''
template <typename T> __device__ T sigmoid(T x) {
const T half = 0.5;
return tanh(x * half) * half + half;
}
template <typename T> __device__ T grad_sigmoid(T y) { return y * (1 - y); }
template <typename T> __device__ T grad_tanh(T y) { return 1 - y * y; }
#define COMMON_ROUTINE \
T aa = tanh(a); \
T ai = sigmoid(i_); \
T af = sigmoid(f); \
T ao = sigmoid(o);
'''
def forward(self, inputs):
c_prev, x = inputs
a, i, f, o = _extract_gates(x)
batch = len(x)
if isinstance(x, numpy.ndarray):
self.a = numpy.tanh(a)
self.i = _sigmoid(i)
self.f = _sigmoid(f)
self.o = _sigmoid(o)
c_next = numpy.empty_like(c_prev)
c_next[:batch] = self.a * self.i + self.f * c_prev[:batch]
h = self.o * numpy.tanh(c_next[:batch])
else:
c_next = cuda.cupy.empty_like(c_prev)
h = cuda.cupy.empty_like(c_next[:batch])
cuda.elementwise(
'T c_prev, T a, T i_, T f, T o', 'T c, T h',
'''
COMMON_ROUTINE;
c = aa * ai + af * c_prev;
h = ao * tanh(c);
''',
'lstm_fwd', preamble=_preamble)(
c_prev[:batch], a, i, f, o, c_next[:batch], h)
c_next[batch:] = c_prev[batch:]
self.c = c_next[:batch]
return c_next, h
The processing of gpu is as follows. The process for calling the contents of cuda is as follows. I'm using cupy. About cupy
http://docs.chainer.org/en/stable/cupy-reference/overview.html
Create a kernel function below, cache it in cuda's memory, and link the result with cuda's device. See below for the reason why you have to link the values calculated in the memory space of gpu.
http://www.nvidia.com/docs/io/116711/sc11-cuda-c-basics.pdf
@memoize(for_each_device=True)
def elementwise(in_params, out_params, operation, name, **kwargs):
check_cuda_available()
return cupy.ElementwiseKernel(
in_params, out_params, operation, name, **kwargs)
In the case of backward, the processing is as follows.
Chainer is helpful because it hides the processing around here.
Same as forward processing, but the difference is that it uses not only the input but also the gradient output.
With gc_prev [: batch]
, the product of the hidden layer and the output layer is updated by adding the gradient to the batch size.
The gradient is calculated and updated with _grad_tanh
and _grad_sigmoid
.
co = numpy.tanh(self.c)
gc_prev = numpy.empty_like(c_prev)
# multiply f later
gc_prev[:batch] = gh * self.o * _grad_tanh(co) + gc_update
gc = gc_prev[:batch]
ga[:] = gc * self.i * _grad_tanh(self.a)
gi[:] = gc * self.a * _grad_sigmoid(self.i)
gf[:] = gc * c_prev[:batch] * _grad_sigmoid(self.f)
go[:] = gh * co * _grad_sigmoid(self.o)
gc_prev[:batch] *= self.f # multiply f here
gc_prev[batch:] = gc_rest
It is the processing part of gpu.
Same as cpu treatment, but calculated using cuda.elementwise
to pass C ++.
a, i, f, o = _extract_gates(x)
gc_prev = xp.empty_like(c_prev)
cuda.elementwise(
'T c_prev, T c, T gc, T gh, T a, T i_, T f, T o',
'T gc_prev, T ga, T gi, T gf, T go',
'''
COMMON_ROUTINE;
T co = tanh(c);
T temp = gh * ao * grad_tanh(co) + gc;
ga = temp * ai * grad_tanh(aa);
gi = temp * aa * grad_sigmoid(ai);
gf = temp * c_prev * grad_sigmoid(af);
go = gh * co * grad_sigmoid(ao);
gc_prev = temp * af;
''',
'lstm_bwd', preamble=_preamble)(
c_prev[:batch], self.c, gc_update, gh, a, i, f, o,
gc_prev[:batch], ga, gi, gf, go)
gc_prev[batch:] = gc_rest
attention.py
This is the part that holds the context information.
--ʻAnnotion_weight is the weight of the forward part --
back_weight is the weight of the backward part, --
pwis the weight of the current layer --Settings to allow
weight_exponential` to process the exp function in the neural network
def __init__(self, hidden_size):
super(Attention, self).__init__(
annotion_weight=links.Linear(hidden_size, hidden_size),
back_weight=links.Linear(hidden_size, hidden_size),
pw=links.Linear(hidden_size, hidden_size),
weight_exponential=links.Linear(hidden_size, 1),
)
self.hidden_size = hidden_size
ʻAnnotion_listis a list of forward words
back_word_listis a list of words in the backward word
p` is the weight of the current layer
def __call__(self, annotion_list, back_word_list, p):
Initialization for batch processing
batch_size = p.data.shape[0]
exponential_list = []
sum_exponential = XP.fzeros((batch_size, 1))
Create a weight that combines the forward word list, back_word word list, and current layer state Equivalent to the following
e_{ij} = a(s_{i-1}, h_j)
List each value so that the value obtained there can be processed by the exp function. Also calculate the total value
\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x}\exp(e_{ik})} \\
Since processing is performed from both directions, the annotation list from the front direction and the backward list from the back direction are acquired and the weight is calculated including the current weight. Create a weight list for the exp function. Calculate the sum of exp functions
for annotion, back_word in zip(annotion_list, back_word_list):
weight = functions.tanh(self.annotion_weight(annotion) + self.back_weight(back_word) + self.pw(p))
exponential = functions.exp(self.weight_exponential(weight))
exponential_list.append(exponential)
sum_exponential += exponential
Initialization is performed, the forward and backward weights are calculated, and the values calculated by the forward and backward matrix are prepared and returned for the batch size. Matrix calculation is done with functions.batch_matmul
.
ʻAis the left matrix
bis the right matrix If there is
transa, transpose the matrix on the left. Transpose the right matrix when there is a
transb`
def batch_matmul(a, b, transa=False, transb=False):
return BatchMatMul(transa=transa, transb=transb)(a, b)
Contents of actual matrix calculation --Convert the matrix into a computable form Is transforming the matrix so that it can be calculated element by element
a = a.reshape(a.shape[:2] + (-1,))
When there is a line like the one below
array([[1, 2, 3],
[4, 5, 6],
[3, 4, 5]])
It is converted as follows.
array([[[1],
[2],
[3]],
[[4],
[5],
[6]],
[[3],
[4],
[5]]])
--If transposition is necessary, process it.
--Randomly initialize the matrix returned as the answer
--If it is numpy, it is calculated for each element of the matrix, and if it is cupy for gpu, it is calculated using matmul
. matmul
does not allow scalar computation and stacks matrices for processing.
def _batch_matmul(a, b, transa=False, transb=False, transout=False):
a = a.reshape(a.shape[:2] + (-1,))
b = b.reshape(b.shape[:2] + (-1,))
trans_axis = (0, 2, 1)
if transout:
transa, transb = not transb, not transa
a, b = b, a
if transa:
a = a.transpose(trans_axis)
if transb:
b = b.transpose(trans_axis)
xp = cuda.get_array_module(a)
if xp is numpy:
ret = numpy.empty(a.shape[:2] + b.shape[2:], dtype=a.dtype)
for i in six.moves.range(len(a)):
ret[i] = numpy.dot(a[i], b[i])
return ret
return xp.matmul(a, b)
Initializes with a zero matrix for the batch size and matrix size, and returns the sum calculated by ʻannotion and
back_word`.
ZEROS = XP.fzeros((batch_size, self.hidden_size))
annotion_value = ZEROS
back_word_value = ZEROS
# Calculate the Convolution Value each annotion and back word
for annotion, back_word, exponential in zip(annotion_list, back_word_list, exponential_list):
exponential /= sum_exponential
annotion_value += functions.reshape(functions.batch_matmul(annotion, exponential), (batch_size, self.hidden_size))
back_word_value += functions.reshape(functions.batch_matmul(back_word, exponential), (batch_size, self.hidden_size))
return annotion_value, back_word_value
attention_decoder.py
It will be the output part. In the case of dialogue, it is the system response.
It's more complicated than typing.
ʻEmbed_vocab: The part that maps the output language to the space of the neural network ʻEmbed_hidden
: The part that propagates the value of the neural network to the LSTM
hidden_hidden
: Propagation part of hidden layer
ʻAnnotation_hidden: Forward type context vector
back_word_hidden: Backword type context vector
hidden_embed: Propagation from hidden layer to output layer (corresponding to system response) ʻEmbded_target
: Propagation from the output layer to the system output (corresponding to the system response)
super(AttentionDecoder, self).__init__(
embed_vocab=links.EmbedID(vocab_size, embed_size),
embed_hidden=links.Linear(embed_size, 4 * hidden_size),
hidden_hidden=links.Linear(hidden_size, 4 * hidden_size),
annotation_hidden=links.Linear(embed_size, 4 * hidden_size),
back_word_hidden=links.Linear(hidden_size, 4 * hidden_size),
hidden_embed=links.Linear(hidden_size, embed_size),
embded_target=links.Linear(embed_size, vocab_size),
)
Uses a differentiable bipolar function that maps the output word to a hidden layer Predict the state and hidden layer by giving lsm the sum of the hidden layer, hidden layer, context vector forward, and context vector backward of the output word. Predict the hidden layer for output with a differentiable bipolar function using the hidden layer predicted earlier Predict output words using hidden layer for output, return current state, hidden layer
embed = functions.tanh(self.embed_vocab(target))
current, hidden = functions.lstm(current, self.embed_hidden(embed) + self.hidden_hidden(hidden) +
self.annotation_hidden(annotation) + self.back_word_hidden(back_word))
embed_hidden = functions.tanh(self.hidden_embed(hidden))
return self.embded_target(embed_hidden), current, hidden
attention_dialogue.py
This is the part that performs specific dialogue processing.
We will use the four models described earlier.
ʻEmbmaps the input language into the space of the neural network.
forward_encode: Forward encodes and prepares the context vector for creation.
back_encdode: Backward-encoded to prepare the context vector for creation. ʻAttention
: Prepared for attention
dec
: Prepared for output words
It determines the size of the vocabulary, the size to map to the space of the neural network, the size of the hidden layer, and whether to use the gpu on the XP.
super(AttentionDialogue, self).__init__(
emb=SrcEmbed(vocab_size, embed_size),
forward_encode=AttentionEncoder(embed_size, hidden_size),
back_encdode=AttentionEncoder(embed_size, hidden_size),
attention=Attention(hidden_size),
dec=AttentionDecoder(vocab_size, embed_size, hidden_size),
)
self.vocab_size = vocab_size
self.embed_size = embed_size
self.hidden_size = hidden_size
self.XP = XP
It is initialized to zero gradient.
def reset(self):
self.zerograds()
self.source_list = []
The input language (user's utterance) is held as a word list.
def embed(self, source):
self.source_list.append(self.emb(source))
ʻEncodeThis is the part for processing. Only the one-dimensional part of the input language is taken to get the batch size. Figure I am initializing, but since the initialization value is different between gpu and cpu, I am using
self.XP.fzeros`.
I'm getting a list of forwards to create a forward context vector.
Backward does the same.
def encode(self):
batch_size = self.source_list[0].data.shape[0]
ZEROS = self.XP.fzeros((batch_size, self.hidden_size))
context = ZEROS
annotion = ZEROS
annotion_list = []
# Get the annotion list
for source in self.source_list:
context, annotion = self.forward_encode(source, context, annotion)
annotion_list.append(annotion)
context = ZEROS
back_word = ZEROS
back_word_list = []
# Get the back word list
for source in reversed(self.source_list):
context, back_word = self.back_encdode(source, context, back_word)
back_word_list.insert(0, back_word)
self.annotion_list = annotion_list
self.back_word_list = back_word_list
self.context = ZEROS
self.hidden = ZEROS
Get the context vector for each of the forward, backward, and hidden layers of attention. Returns the output word with the target word, context (obtained by dec), hidden layer value, forward value, and backward value.
def decode(self, target_word):
annotion_value, back_word_value = self.attention(self.annotion_list, self.back_word_list, self.hidden)
target_word, self.context, self.hidden = self.dec(target_word, self.context, self.hidden, annotion_value, back_word_value)
return target_word
Save model It stores the vocabulary size, the mapping size of the latent layer, and the size of the hidden layer.
def save_spec(self, filename):
with open(filename, 'w') as fp:
print(self.vocab_size, file=fp)
print(self.embed_size, file=fp)
print(self.hidden_size, file=fp)
The loading part of the model. The value read here is acquired and passed to the model.
def load_spec(filename, XP):
with open(filename) as fp:
vocab_size = int(next(fp))
embed_size = int(next(fp))
hidden_size = int(next(fp))
return AttentionDialogue(vocab_size, embed_size, hidden_size, XP)
EncoderDecoderModelAttention.py
Actually use the module explained earlier in this part Various parameters are set.
def __init__(self, parameter_dict):
self.parameter_dict = parameter_dict
self.source = parameter_dict["source"]
self.target = parameter_dict["target"]
self.test_source = parameter_dict["test_source"]
self.test_target = parameter_dict["test_target"]
self.vocab = parameter_dict["vocab"]
self.embed = parameter_dict["embed"]
self.hidden = parameter_dict["hidden"]
self.epoch = parameter_dict["epoch"]
self.minibatch = parameter_dict["minibatch"]
self.generation_limit = parameter_dict["generation_limit"]
self.word2vec = parameter_dict["word2vec"]
self.word2vecFlag = parameter_dict["word2vecFlag"]
self.model = parameter_dict["model"]
self.attention_dialogue = parameter_dict["attention_dialogue"]
XP.set_library(False, 0)
self.XP = XP
Implementation of forward processing. I'm getting the size of the target and source and getting the index for each.
def forward_implement(self, src_batch, trg_batch, src_vocab, trg_vocab, attention, is_training, generation_limit):
batch_size = len(src_batch)
src_len = len(src_batch[0])
trg_len = len(trg_batch[0]) if trg_batch else 0
src_stoi = src_vocab.stoi
trg_stoi = trg_vocab.stoi
trg_itos = trg_vocab.itos
attention.reset()
The input language is entered from the opposite direction. If you input from the opposite direction, the machine translation result will be improved, so the dialogue is made in the same way, but I think that it has no effect.
x = self.XP.iarray([src_stoi('</s>') for _ in range(batch_size)])
attention.embed(x)
for l in reversed(range(src_len)):
x = self.XP.iarray([src_stoi(src_batch[k][l]) for k in range(batch_size)])
attention.embed(x)
attention.encode()
Initialize the target language string you want to get with <s>
.
t = self.XP.iarray([trg_stoi('<s>') for _ in range(batch_size)])
hyp_batch = [[] for _ in range(batch_size)]
This is the learning part.
Language information cannot be learned unless it is index information, so use stoi
to change the language to index information.
Get the target (in this case the output of the dialogue) and compare it with the correct data to calculate the cross entropy.
Since the cross entropy gives the distance between the probability distributions, it can be seen that the smaller the loss in this calculation, the closer the output result is to the target.
It returns hypothesis candidates and calculated losses.
if is_training:
loss = self.XP.fzeros(())
for l in range(trg_len):
y = attention.decode(t)
t = self.XP.iarray([trg_stoi(trg_batch[k][l]) for k in range(batch_size)])
loss += functions.softmax_cross_entropy(y, t)
output = cuda.to_cpu(y.data.argmax(1))
for k in range(batch_size):
hyp_batch[k].append(trg_itos(output[k]))
return hyp_batch, loss
This is the test part.
Neural networks can generate infinite candidates, and especially in the case of lstm models, they use past states and are limited because they may enter an infinite loop.
Output using the initialized target word string.
The maximum value of the output data is output and t
is updated.
The candidates output for the batch size are converted from index information to language information.
break
the process if all candidates end with a</ s>
termination symbol.
else:
while len(hyp_batch[0]) < generation_limit:
y = attention.decode(t)
output = cuda.to_cpu(y.data.argmax(1))
t = self.XP.iarray(output)
for k in range(batch_size):
hyp_batch[k].append(trg_itos(output[k]))
if all(hyp_batch[k][-1] == '</s>' for k in range(batch_size)):
break
return hyp_batch
It is the processing of the entire learning.
Input utterance and output utterance are initialized.
self.vocab
generates a generator with gens.word_list
in the whole vocabulary.
src_vocab = Vocabulary.new(gens.word_list(self.source), self.vocab)
trg_vocab = Vocabulary.new(gens.word_list(self.target), self.vocab)
I am creating vocabulary information for input and output utterances with Vocabulary.new ()
.
Create the following generator
withgens.word_list (self.source)
. The input file name is given in self.source
.
def word_list(filename):
with open(filename) as fp:
for l in fp:
yield l.split()
The process of converting vocabulary information to index information is performed in the following part.
<Unk>
it is 0 in the unknown word, <s>
1 with the prefix, </ s>
has set the 2 at the end of the phrase.
Since we have set values for these in advance, we add +3 so that the index is after the reserved word.
@staticmethod
def new(list_generator, size):
self = Vocabulary()
self.__size = size
word_freq = defaultdict(lambda: 0)
for words in list_generator:
for word in words:
word_freq[word] += 1
self.__stoi = defaultdict(lambda: 0)
self.__stoi['<unk>'] = 0
self.__stoi['<s>'] = 1
self.__stoi['</s>'] = 2
self.__itos = [''] * self.__size
self.__itos[0] = '<unk>'
self.__itos[1] = '<s>'
self.__itos[2] = '</s>'
for i, (k, v) in zip(range(self.__size - 3), sorted(word_freq.items(), key=lambda x: -x[1])):
self.__stoi[k] = i + 3
self.__itos[i + 3] = k
return self
Creating an attention model.
Gives vocabulary, embedded layer, hidden layer and XP
.
XP
is the part that performs cpu calculation and gpu calculation.
trace('making model ...')
self.attention_dialogue = AttentionDialogue(self.vocab, self.embed, self.hidden, self.XP)
It will be part of transfer learning. Here, the weight created by word2vec is transferred.
Since the name of the weight of the model created by word2vec is weight_xi
, the input utterance is unified, but the output utterance part is different in ʻembded_target`, so the following processing is included.
The [0] part is the name of the weight
The [1] part is the value.
if dst["embded_target"] and child.name == "weight_xi" and self.word2vecFlag:
for a, b in zip(child.namedparams(), dst["embded_target"].namedparams()):
b[1].data = a[1].data
This is a copy part of the weight.
Iterates the original part and copies the weights if the conditions are met.
Condition 1: There is something that matches the name given to the weight
Condition 2: The weight types are the same
Condition 3: The part of link.Link
, that is, the model part has been reached.
Condition 4: The length of the model weight matrix is the same
def copy_model(self, src, dst, dec_flag=False):
print("start copy")
for child in src.children():
if dec_flag:
if dst["embded_target"] and child.name == "weight_xi" and self.word2vecFlag:
for a, b in zip(child.namedparams(), dst["embded_target"].namedparams()):
b[1].data = a[1].data
print('Copy weight_jy')
if child.name not in dst.__dict__: continue
dst_child = dst[child.name]
if type(child) != type(dst_child): continue
if isinstance(child, link.Chain):
self.copy_model(child, dst_child)
if isinstance(child, link.Link):
match = True
for a, b in zip(child.namedparams(), dst_child.namedparams()):
if a[0] != b[0]:
match = False
break
if a[1].data.shape != b[1].data.shape:
match = False
break
if not match:
print('Ignore %s because of parameter mismatch' % child.name)
continue
for a, b in zip(child.namedparams(), dst_child.namedparams()):
b[1].data = a[1].data
print('Copy %s' % child.name)
if self.word2vecFlag:
self.copy_model(self.word2vec, self.attention_dialogue.emb)
self.copy_model(self.word2vec, self.attention_dialogue.dec, dec_flag=True)
Create a generator for input and output utterances.
gen1 = gens.word_list(self.source)
gen2 = gens.word_list(self.target)
gen3 = gens.batch(gens.sorted_parallel(gen1, gen2, 100 * self.minibatch), self.minibatch)
Create both for the batch size. Create the batch size in tuple format below.
def batch(generator, batch_size):
batch = []
is_tuple = False
for l in generator:
is_tuple = isinstance(l, tuple)
batch.append(l)
if len(batch) == batch_size:
yield tuple(list(x) for x in zip(*batch)) if is_tuple else batch
batch = []
if batch:
yield tuple(list(x) for x in zip(*batch)) if is_tuple else batch
Input utterances and output utterances are created and sorted for batch sizes.
def sorted_parallel(generator1, generator2, pooling, order=1):
gen1 = batch(generator1, pooling)
gen2 = batch(generator2, pooling)
for batch1, batch2 in zip(gen1, gen2):
#yield from sorted(zip(batch1, batch2), key=lambda x: len(x[1]))
for x in sorted(zip(batch1, batch2), key=lambda x: len(x[order])):
yield x
Adagrad is used for optimization. It is a method that the update width becomes smaller as the number of updates accumulates.
r ← r + g^2_{\vec{w}}\\
w ← w - \frac{\alpha}{r + }g^2_{\vec{w}}
ʻOptimizer.GradientClipping (5)` uses L2 regularization to keep the gradient within a certain range.
opt = optimizers.AdaGrad(lr = 0.01)
opt.setup(self.attention_dialogue)
opt.add_hook(optimizer.GradientClipping(5))
In the following, input user utterances and corresponding user utterances are filled with *
by fill_batch
to make deep learning possible.
def fill_batch(batch, token='</s>'):
max_len = max(len(x) for x in batch)
return [x + [token] * (max_len - len(x) + 1) for x in batch]
Backward processing is performed using the loss obtained in the forward processing, and the weight is updated.
Backward processing depends on the activation function.
The updated part is as follows.
Change whether the data is processed by gpu or cpu
The optimization by the loss function is changed by changing the way of giving data in tuple
, dict
, and others.
def update_core(self):
batch = self._iterators['main'].next()
in_arrays = self.converter(batch, self.device)
optimizer = self._optimizers['main']
loss_func = self.loss_func or optimizer.target
if isinstance(in_arrays, tuple):
in_vars = tuple(variable.Variable(x) for x in in_arrays)
optimizer.update(loss_func, *in_vars)
elif isinstance(in_arrays, dict):
in_vars = {key: variable.Variable(x)
for key, x in six.iteritems(in_arrays)}
optimizer.update(loss_func, **in_vars)
else:
in_var = variable.Variable(in_arrays)
optimizer.update(loss_func, in_var)
for src_batch, trg_batch in gen3:
src_batch = fill_batch(src_batch)
trg_batch = fill_batch(trg_batch)
K = len(src_batch)
hyp_batch, loss = self.forward_implement(src_batch, trg_batch, src_vocab, trg_vocab, self.attention_dialogue, True, 0)
loss.backward()
opt.update()
Saving the trained model save and save_spec do not exist in the chainer standard, but are created separately to save information about the language.
save
saves utterance data information
save_spec
saves vocabulary size, embedded layer size, hidden layer size
save_hdf5
saves the model in hdf5 format
trace('saving model ...')
prefix = self.model
model_path = APP_ROOT + "/model/" + prefix
src_vocab.save(model_path + '.srcvocab')
trg_vocab.save(model_path + '.trgvocab')
self.attention_dialogue.save_spec(model_path + '.spec')
serializers.save_hdf5(model_path + '.weights', self.attention_dialogue)
This is the test part. The model output during learning is read and the user's utterance content for the input utterance is output.
def test(self):
trace('loading model ...')
prefix = self.model
model_path = APP_ROOT + "/model/" + prefix
src_vocab = Vocabulary.load(model_path + '.srcvocab')
trg_vocab = Vocabulary.load(model_path + '.trgvocab')
self.attention_dialogue = AttentionDialogue.load_spec(model_path + '.spec', self.XP)
serializers.load_hdf5(model_path + '.weights', self.attention_dialogue)
trace('generating translation ...')
generated = 0
with open(self.test_target, 'w') as fp:
for src_batch in gens.batch(gens.word_list(self.source), self.minibatch):
src_batch = fill_batch(src_batch)
K = len(src_batch)
trace('sample %8d - %8d ...' % (generated + 1, generated + K))
hyp_batch = self.forward_implement(src_batch, None, src_vocab, trg_vocab, self.attention_dialogue, False, self.generation_limit)
source_cuont = 0
for hyp in hyp_batch:
hyp.append('</s>')
hyp = hyp[:hyp.index('</s>')]
print("src : " + "".join(src_batch[source_cuont]).replace("</s>", ""))
print('hyp : ' +''.join(hyp))
print(' '.join(hyp), file=fp)
source_cuont = source_cuont + 1
generated += K
trace('finished.')
The content was announced at PyCon 2016, but if you think that it is still a part, it seems that there is a long way to go if you include the explanation of other parts. At present, the range that can be handled by simple deep learning is limited, so we are using multiple technologies. Since there are many models for dialogue in deep learning, I think that it will lead to performance improvement by determining the evaluation index and changing the model of deep learning.
Attention and Memory in Deep Learning and NLP
Recommended Posts