WHY

I think that many people are interested in deep learning, so I will describe the implementation of deep learning in dialogue.

Since the chat response uses Chainer, I will focus on that part. However, please note that the version is old.

The version that has been confirmed to work is 1.5.1.

There may be some mistakes. There was a part I wanted to understand deeply, so I'm following some Chainer code. We apologize for the inconvenience, but we would appreciate it if you could point out any mistakes.

――The content announced at PyCon 2016 is rather a concept and outline base, so there is no explanation of the code actually implemented, so it is better to mean a reflection of yourself.

――So I wrote this article because I wanted more people to understand and use it by adding a code explanation. (I hope there will be more stars on github if possible)

Screen Shot 2016-12-26 at 8.04.52 AM.png

Docker Hub

https://hub.docker.com/r/masayaresearch/dialogue/

github

https://github.com/SnowMasaya/Chainer-Slack-Twitter-Dialogue

There are many other areas such as question answering, topic classification, and parallelization of data acquisition, so I will write that part if requested.

WHAT

Chat response

We are training on the classified data. Attention model is used even in deep learning. What is an attention model?

In the task of machine translation of neural networks, the sequence-to-sequence model has a problem that the importance of the first word is diminished by the accumulation of derivatives when it is aggregated into one vector in the input of a long sentence. Especially in English, the first word becomes more important.

In order to solve this, in the past, the translation accuracy was improved by inputting in the opposite direction. However, in the case of Japanese and Chinese, on the contrary, the last word is important, so it is not an essential solution.

Therefore, the attention model was proposed as a model that predicts the output of each decoding by weighted averaging the hidden layer and the encoding input corresponding to the decoding without separating the input encoding and decoding. Originally it was successful in the field of images, but now it is producing results in machine translation and sentence summarization tasks.

image

To predict "mo", it is the posterior probability when the input "I" ("I am an engineer") is obtained. The posterior probability is the score of the previous word (me), the state of the hidden layer, and the context vector ("I am an engineer"). Ignore the context vector for now. I will explain it later. The function g is generally a softmax function

Screen Shot 2016-12-26 at 9.38.00 AM.png

As shown in the above figure, the formula used when predicting from the current state and context in consideration of the prior output is as follows.

p(y_i|y_1,...y_{i_1}, \vec{x}) = g(y_{i-1}, s_i, c_i)

Here, the state of the hidden layer at time t can be as follows. (State for predicting "mo") This is determined by the context vector of the previous word "I", the previous state, and the previous "I am an engineer". The function f is generally a sigmoid

s_i=f(s_{i-1}, y_{i-1},c_i)

The context vector is determined by the sum of the hidden layer of the encoder part ("I am an engineer") and the weight $ a $.

c_{i} = \sum^{T_x}_{j=1}\alpha_{ij}h_{j}

Then, how to find the weight defined earlier, the weight obtained from the hidden layer h called e and the state s immediately before the output side ("I" in the case of "") is used as the score. This shape is because h in the encoder part has a special shape. This point will be described later. The score e is a small value because of the probability. It is made a large value by the exp function and divided by all the input parts to calculate the weight that matches the pair of input and output.

\alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})} \\
e_{ij} = a(s_{i-1}, h_j)

So what is special about the hidden layer h? In fact, it differs from a normal sequence to sequence in that it combines forward and backward. Define forward and backward as shown below and express them in a concatenated manner. This is the hidden layer of the encoding input "I'm an engineer".

(\vec{h_1},...\vec{h_{T_x}})\\
(\overleftarrow{h_1},...\overleftarrow{h_{T_x}})\\
h_j = [\vec{h_j^T};\overleftarrow{h_j^T}]^T

Now let's follow how this formula is actually realized in the code base.

src_embed.py --This is the part that moves language data into the space of the neural network.
attention_encoder.py --This is the part that propagates the information transferred to the space of the neural network of the input language. (It corresponds to the utterance part of the user in the dialogue)
attention.py --The part that creates context information
attention_decoder.py --This is the neural network part of the output language. It even outputs context information, target language, and propagates hidden layers.
attention_dialogue.py --Load model --Save model --Weight initialization --Embedding weights --Encoding process --Decoding process

It consists of the above five.

HOW

src_embed.py

I will explain from the part that embeds the information of the input language. Sets the input language vocabulary and the number of embedded layers in the neural network. The input language vocabulary is the user's utterance in the case of dialogue.

    def __init__(self, vocab_size, embed_size):
        super(SrcEmbed, self).__init__(
            weight_xi=links.EmbedID(vocab_size, embed_size),
        )

It will be the contents of specific processing. W (~ chainer.Variable) is an embedded matrix of chainer.Variable. Uses the initial weights generated from a normal distribution with mean 0 and variance 1.0.

    def __init__(self, in_size, out_size, initialW=None, ignore_label=None):
        super(EmbedID, self).__init__(W=(in_size, out_size))
        if initialW is None:
            initialW = initializers.Normal(1.0)
        initializers.init_weight(self.W.data, initialW)
        self.ignore_label = ignore_label

Specifically, it is the part that generates data from the normal distribution. Since the xp part is when using gpu, we are using xp.random.normal instead of numpy.random.normal.

reference https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal.html

class Normal(initializer.Initializer):
    def __init__(self, scale=0.05, dtype=None):
        self.scale = scale
        super(Normal, self).__init__(dtype)

    def __call__(self, array):
        xp = cuda.get_array_module(array)
        array[...] = xp.random.normal(
             loc=0.0, scale=self.scale, size=array.shape)

The initial weight to be returned here is set below. The data in ʻinitializer is set in the data or class in numpy.ndarrayor in the class incupy.ndarray`.

def init_weight(weights, initializer, scale=1.0):

    if initializer is None:
        initializer = HeNormal(1 / numpy.sqrt(2))
    elif numpy.isscalar(initializer):
        initializer = Constant(initializer)
    elif isinstance(initializer, numpy.ndarray):
        initializer = Constant(initializer)

    assert callable(initializer)
    initializer(weights)
    weights *= scale

When ʻinitializer is other than None`, it returns whether it is a gpu format array or a normal array.


class Constant(initializer.Initializer):

    def __init__(self, fill_value, dtype=None):
        self.fill_value = fill_value
        super(Constant, self).__init__(dtype)

    def __call__(self, array):
        if self.dtype is not None:
            assert array.dtype == self.dtype
        xp = cuda.get_array_module(array)
        array[...] = xp.asarray(self.fill_value)

The parts that are specifically judged and returned are as follows.

def get_array_module(*args):
    if available:
        return cupy.get_array_module(*args)
    else:
        return numpy

The __call__ function calls src_embed to embed the input language in the neural net space. It is mapped to a differentiable space using hyperbolic functions in functions.tanh. If it is a differentiable space, learning is possible by error back propagation.

    def __call__(self, source):
        return functions.tanh(self.weight_xi(source))

attention_encoder.py

The input layer mapped in the space of the neural network is passed to the hidden layer. Why 4 times

Input gate Oblivion gate Output gate A gate that considers previous inputs

This is because the above four are taken into consideration. The reason why it is necessary is mentioned in other materials, so I will not explain it in detail, but this device prevents overfitting.

    def __init__(self, embed_size, hidden_size):
        super(AttentionEncoder, self).__init__(
            source_to_hidden=links.Linear(embed_size, 4 * hidden_size),
            hidden_to_hidden=links.Linear(hidden_size, 4 * hidden_size),
        )

It is a specific links.Liner process.

--Weight initialization --Weight parameter assignment --Bias initialization --Bias parameter addition

    def __init__(self, in_size, out_size, wscale=1, bias=0, nobias=False,
                 initialW=None, initial_bias=None):
        super(Linear, self).__init__()

        self.initialW = initialW
        self.wscale = wscale

        self.out_size = out_size
        self._W_initializer = initializers._get_initializer(initialW, math.sqrt(wscale))

        if in_size is None:
            self.add_uninitialized_param('W')
        else:
            self._initialize_params(in_size)

        if nobias:
            self.b = None
        else:
            if initial_bias is None:
                initial_bias = bias
            bias_initializer = initializers._get_initializer(initial_bias)
            self.add_param('b', out_size, initializer=bias_initializer)

    def _initialize_params(self, in_size):
        self.add_param('W', (self.out_size, in_size), initializer=self._W_initializer)

This is a specific initialization process. Since scale is 1 by default, multiply it to create an Array. With the Constant that came out earlier, the initial value is initialized with a fixed value and scaled.

class _ScaledInitializer(initializer.Initializer):

    def __init__(self, initializer, scale=1.0):
        self.initializer = initializer
        self.scale = scale
        dtype = getattr(initializer, 'dtype', None)
        super(Identity, self).__init__(dtype)

    def __call__(self, array):
        self.initializer(array)
        array *= self.scale


def _get_initializer(initializer, scale=1.0):
    if initializer is None:
        return HeNormal(scale / numpy.sqrt(2))
    if numpy.isscalar(initializer):
        return Constant(initializer * scale)
    if isinstance(initializer, numpy.ndarray):
        return Constant(initializer * scale)

    assert callable(initializer)
    if scale == 1.0:
        return initializer
    return _ScaledInitializer(initializer, scale)

We are passing the current state, the value of the previous hidden layer, and the value of the input layer.


    def __call__(self, source, current, hidden):
        return functions.lstm(current, self.source_to_hidden(source) + self.hidden_to_hidden(hidden))

The process called when the forward process is performed in the above lstm is as follows. The file will be chainer / functions / activation / lstm.py. The input is divided into four lstm gates. len (x): Get line length x.shape [1]: Get column length x.shape [2:]: Used for 3D or larger data

    def _extract_gates(x):
        r = x.reshape((len(x), x.shape[1] // 4, 4) + x.shape[2:])
        return [r[:, :, i] for i in six.moves.range(4)]

cpu processing

--Get input and status --Get 4 pieces of information based on lstm --Since the value of lstm is based on the paper, the input can be realized with hyperbolic tangent except for attention. --Randomly initialize the next state --By talking about the product of input and attention, forgetting, and the product of past states, the next state is given for the batch size. --The hidden layer value is given by the output and the hyperbolic tangent in the next state.

The processing of gpu is the same. However, since C ++ is used, it is read using the following definition. It is the same as lstm defined on python, but it is written for processing in C ++.

_preamble = '''
template <typename T> __device__ T sigmoid(T x) {
    const T half = 0.5;
    return tanh(x * half) * half + half;
}
template <typename T> __device__ T grad_sigmoid(T y) { return y * (1 - y); }
template <typename T> __device__ T grad_tanh(T y) { return 1 - y * y; }
#define COMMON_ROUTINE \
    T aa = tanh(a); \
    T ai = sigmoid(i_); \
    T af = sigmoid(f); \
    T ao = sigmoid(o);
'''

    def forward(self, inputs):
        c_prev, x = inputs
        a, i, f, o = _extract_gates(x)
        batch = len(x)

        if isinstance(x, numpy.ndarray):
            self.a = numpy.tanh(a)
            self.i = _sigmoid(i)
            self.f = _sigmoid(f)
            self.o = _sigmoid(o)

            c_next = numpy.empty_like(c_prev)
            c_next[:batch] = self.a * self.i + self.f * c_prev[:batch]
            h = self.o * numpy.tanh(c_next[:batch])
        else:
            c_next = cuda.cupy.empty_like(c_prev)
            h = cuda.cupy.empty_like(c_next[:batch])
            cuda.elementwise(
                'T c_prev, T a, T i_, T f, T o', 'T c, T h',
                '''
                    COMMON_ROUTINE;
                    c = aa * ai + af * c_prev;
                    h = ao * tanh(c);
                ''',
                'lstm_fwd', preamble=_preamble)(
                    c_prev[:batch], a, i, f, o, c_next[:batch], h)

        c_next[batch:] = c_prev[batch:]
        self.c = c_next[:batch]
        return c_next, h

The processing of gpu is as follows. The process for calling the contents of cuda is as follows. I'm using cupy. About cupy

http://docs.chainer.org/en/stable/cupy-reference/overview.html

Create a kernel function below, cache it in cuda's memory, and link the result with cuda's device. See below for the reason why you have to link the values calculated in the memory space of gpu.

http://www.nvidia.com/docs/io/116711/sc11-cuda-c-basics.pdf

@memoize(for_each_device=True)
def elementwise(in_params, out_params, operation, name, **kwargs):
    check_cuda_available()
    return cupy.ElementwiseKernel(
        in_params, out_params, operation, name, **kwargs)

In the case of backward, the processing is as follows. Chainer is helpful because it hides the processing around here. Same as forward processing, but the difference is that it uses not only the input but also the gradient output. With gc_prev [: batch], the product of the hidden layer and the output layer is updated by adding the gradient to the batch size. The gradient is calculated and updated with _grad_tanh and _grad_sigmoid.

            co = numpy.tanh(self.c)
            gc_prev = numpy.empty_like(c_prev)
            # multiply f later
            gc_prev[:batch] = gh * self.o * _grad_tanh(co) + gc_update
            gc = gc_prev[:batch]
            ga[:] = gc * self.i * _grad_tanh(self.a)
            gi[:] = gc * self.a * _grad_sigmoid(self.i)
            gf[:] = gc * c_prev[:batch] * _grad_sigmoid(self.f)
            go[:] = gh * co * _grad_sigmoid(self.o)
            gc_prev[:batch] *= self.f  # multiply f here
            gc_prev[batch:] = gc_rest

It is the processing part of gpu. Same as cpu treatment, but calculated using cuda.elementwise to pass C ++.

            a, i, f, o = _extract_gates(x)
            gc_prev = xp.empty_like(c_prev)
            cuda.elementwise(
                'T c_prev, T c, T gc, T gh, T a, T i_, T f, T o',
                'T gc_prev, T ga, T gi, T gf, T go',
                '''
                    COMMON_ROUTINE;
                    T co = tanh(c);
                    T temp = gh * ao * grad_tanh(co) + gc;
                    ga = temp * ai * grad_tanh(aa);
                    gi = temp * aa * grad_sigmoid(ai);
                    gf = temp * c_prev * grad_sigmoid(af);
                    go = gh * co * grad_sigmoid(ao);
                    gc_prev = temp * af;
                ''',
                'lstm_bwd', preamble=_preamble)(
                    c_prev[:batch], self.c, gc_update, gh, a, i, f, o,
                    gc_prev[:batch], ga, gi, gf, go)
            gc_prev[batch:] = gc_rest

attention.py

This is the part that holds the context information.

--ʻAnnotion_weight is the weight of the forward part --back_weight is the weight of the backward part, --pwis the weight of the current layer --Settings to allowweight_exponential` to process the exp function in the neural network

    def __init__(self, hidden_size):
        super(Attention, self).__init__(
            annotion_weight=links.Linear(hidden_size, hidden_size),
            back_weight=links.Linear(hidden_size, hidden_size),
            pw=links.Linear(hidden_size, hidden_size),
            weight_exponential=links.Linear(hidden_size, 1),
        )
        self.hidden_size = hidden_size

ʻAnnotion_listis a list of forward words back_word_listis a list of words in the backward word p` is the weight of the current layer


    def __call__(self, annotion_list, back_word_list, p):

Initialization for batch processing


        batch_size = p.data.shape[0]
        exponential_list = []
        sum_exponential = XP.fzeros((batch_size, 1))

Create a weight that combines the forward word list, back_word word list, and current layer state Equivalent to the following

e_{ij} = a(s_{i-1}, h_j)

List each value so that the value obtained there can be processed by the exp function. Also calculate the total value

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x}\exp(e_{ik})} \\

Since processing is performed from both directions, the annotation list from the front direction and the backward list from the back direction are acquired and the weight is calculated including the current weight. Create a weight list for the exp function. Calculate the sum of exp functions


        for annotion, back_word in zip(annotion_list, back_word_list):
            weight = functions.tanh(self.annotion_weight(annotion) + self.back_weight(back_word) + self.pw(p))
            exponential = functions.exp(self.weight_exponential(weight))
            exponential_list.append(exponential)
            sum_exponential += exponential

Initialization is performed, the forward and backward weights are calculated, and the values calculated by the forward and backward matrix are prepared and returned for the batch size. Matrix calculation is done with functions.batch_matmul. ʻAis the left matrix bis the right matrix If there istransa, transpose the matrix on the left. Transpose the right matrix when there is a transb`

def batch_matmul(a, b, transa=False, transb=False):
    return BatchMatMul(transa=transa, transb=transb)(a, b)

Contents of actual matrix calculation --Convert the matrix into a computable form Is transforming the matrix so that it can be calculated element by element

a = a.reshape(a.shape[:2] + (-1,))

When there is a line like the one below

array([[1, 2, 3],
       [4, 5, 6],
       [3, 4, 5]])

It is converted as follows.

array([[[1],
        [2],
        [3]],

       [[4],
        [5],
        [6]],

       [[3],
        [4],
        [5]]])

--If transposition is necessary, process it. --Randomly initialize the matrix returned as the answer --If it is numpy, it is calculated for each element of the matrix, and if it is cupy for gpu, it is calculated using matmul. matmul does not allow scalar computation and stacks matrices for processing.

def _batch_matmul(a, b, transa=False, transb=False, transout=False):
    a = a.reshape(a.shape[:2] + (-1,))
    b = b.reshape(b.shape[:2] + (-1,))
    trans_axis = (0, 2, 1)
    if transout:
        transa, transb = not transb, not transa
        a, b = b, a
    if transa:
        a = a.transpose(trans_axis)
    if transb:
        b = b.transpose(trans_axis)
    xp = cuda.get_array_module(a)
    if xp is numpy:
        ret = numpy.empty(a.shape[:2] + b.shape[2:], dtype=a.dtype)
        for i in six.moves.range(len(a)):
            ret[i] = numpy.dot(a[i], b[i])
        return ret
    return xp.matmul(a, b)

Initializes with a zero matrix for the batch size and matrix size, and returns the sum calculated by ʻannotion and back_word`.


        ZEROS = XP.fzeros((batch_size, self.hidden_size))
        annotion_value = ZEROS
        back_word_value = ZEROS
        # Calculate the Convolution Value each annotion and back word
        for annotion, back_word, exponential in zip(annotion_list, back_word_list, exponential_list):
            exponential /= sum_exponential
            annotion_value += functions.reshape(functions.batch_matmul(annotion, exponential), (batch_size, self.hidden_size))
            back_word_value += functions.reshape(functions.batch_matmul(back_word, exponential), (batch_size, self.hidden_size))
        return annotion_value, back_word_value

attention_decoder.py

It will be the output part. In the case of dialogue, it is the system response. It's more complicated than typing. ʻEmbed_vocab: The part that maps the output language to the space of the neural network ʻEmbed_hidden: The part that propagates the value of the neural network to the LSTM hidden_hidden: Propagation part of hidden layer ʻAnnotation_hidden: Forward type context vector back_word_hidden: Backword type context vector hidden_embed: Propagation from hidden layer to output layer (corresponding to system response) ʻEmbded_target: Propagation from the output layer to the system output (corresponding to the system response)

        super(AttentionDecoder, self).__init__(
            embed_vocab=links.EmbedID(vocab_size, embed_size),
            embed_hidden=links.Linear(embed_size, 4 * hidden_size),
            hidden_hidden=links.Linear(hidden_size, 4 * hidden_size),
            annotation_hidden=links.Linear(embed_size, 4 * hidden_size),
            back_word_hidden=links.Linear(hidden_size, 4 * hidden_size),
            hidden_embed=links.Linear(hidden_size, embed_size),
            embded_target=links.Linear(embed_size, vocab_size),
        )

Uses a differentiable bipolar function that maps the output word to a hidden layer Predict the state and hidden layer by giving lsm the sum of the hidden layer, hidden layer, context vector forward, and context vector backward of the output word. Predict the hidden layer for output with a differentiable bipolar function using the hidden layer predicted earlier Predict output words using hidden layer for output, return current state, hidden layer

        embed = functions.tanh(self.embed_vocab(target))
        current, hidden = functions.lstm(current, self.embed_hidden(embed) + self.hidden_hidden(hidden) +
                                         self.annotation_hidden(annotation) + self.back_word_hidden(back_word))
        embed_hidden = functions.tanh(self.hidden_embed(hidden))
        return self.embded_target(embed_hidden), current, hidden

attention_dialogue.py

This is the part that performs specific dialogue processing. We will use the four models described earlier. ʻEmbmaps the input language into the space of the neural network. forward_encode: Forward encodes and prepares the context vector for creation. back_encdode: Backward-encoded to prepare the context vector for creation. ʻAttention: Prepared for attention dec: Prepared for output words

It determines the size of the vocabulary, the size to map to the space of the neural network, the size of the hidden layer, and whether to use the gpu on the XP.


        super(AttentionDialogue, self).__init__(
            emb=SrcEmbed(vocab_size, embed_size),
            forward_encode=AttentionEncoder(embed_size, hidden_size),
            back_encdode=AttentionEncoder(embed_size, hidden_size),
            attention=Attention(hidden_size),
            dec=AttentionDecoder(vocab_size, embed_size, hidden_size),
        )
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.XP = XP

It is initialized to zero gradient.

    def reset(self):
        self.zerograds()
        self.source_list = []

The input language (user's utterance) is held as a word list.


    def embed(self, source):
        self.source_list.append(self.emb(source))

ʻEncodeThis is the part for processing. Only the one-dimensional part of the input language is taken to get the batch size. Figure I am initializing, but since the initialization value is different between gpu and cpu, I am usingself.XP.fzeros`. I'm getting a list of forwards to create a forward context vector. Backward does the same.


    def encode(self):
        batch_size = self.source_list[0].data.shape[0]
        ZEROS = self.XP.fzeros((batch_size, self.hidden_size))
        context = ZEROS
        annotion = ZEROS
        annotion_list = []
        # Get the annotion list
        for source in self.source_list:
            context, annotion = self.forward_encode(source, context, annotion)
            annotion_list.append(annotion)
        context = ZEROS
        back_word = ZEROS
        back_word_list = []
        # Get the back word list
        for source in reversed(self.source_list):
            context, back_word = self.back_encdode(source, context, back_word)
            back_word_list.insert(0, back_word)
        self.annotion_list = annotion_list
        self.back_word_list = back_word_list
        self.context = ZEROS
        self.hidden = ZEROS

Get the context vector for each of the forward, backward, and hidden layers of attention. Returns the output word with the target word, context (obtained by dec), hidden layer value, forward value, and backward value.


    def decode(self, target_word):
        annotion_value, back_word_value = self.attention(self.annotion_list, self.back_word_list, self.hidden)
        target_word, self.context, self.hidden = self.dec(target_word, self.context, self.hidden, annotion_value, back_word_value)
        return target_word

Save model It stores the vocabulary size, the mapping size of the latent layer, and the size of the hidden layer.


    def save_spec(self, filename):
        with open(filename, 'w') as fp:
            print(self.vocab_size, file=fp)
            print(self.embed_size, file=fp)
            print(self.hidden_size, file=fp)

The loading part of the model. The value read here is acquired and passed to the model.

    def load_spec(filename, XP):
        with open(filename) as fp:
            vocab_size = int(next(fp))
            embed_size = int(next(fp))
            hidden_size = int(next(fp))
        return AttentionDialogue(vocab_size, embed_size, hidden_size, XP)

EncoderDecoderModelAttention.py

Actually use the module explained earlier in this part Various parameters are set.

    def __init__(self, parameter_dict):
        self.parameter_dict       = parameter_dict
        self.source               = parameter_dict["source"]
        self.target               = parameter_dict["target"]
        self.test_source          = parameter_dict["test_source"]
        self.test_target          = parameter_dict["test_target"]
        self.vocab                = parameter_dict["vocab"]
        self.embed                = parameter_dict["embed"]
        self.hidden               = parameter_dict["hidden"]
        self.epoch                = parameter_dict["epoch"]
        self.minibatch            = parameter_dict["minibatch"]
        self.generation_limit     = parameter_dict["generation_limit"]
        self.word2vec = parameter_dict["word2vec"]
        self.word2vecFlag = parameter_dict["word2vecFlag"]
        self.model = parameter_dict["model"]
        self.attention_dialogue   = parameter_dict["attention_dialogue"]
        XP.set_library(False, 0)
        self.XP = XP

Implementation of forward processing. I'm getting the size of the target and source and getting the index for each.


    def forward_implement(self, src_batch, trg_batch, src_vocab, trg_vocab, attention, is_training, generation_limit):
        batch_size = len(src_batch)
        src_len = len(src_batch[0])
        trg_len = len(trg_batch[0]) if trg_batch else 0
        src_stoi = src_vocab.stoi
        trg_stoi = trg_vocab.stoi
        trg_itos = trg_vocab.itos
        attention.reset()

The input language is entered from the opposite direction. If you input from the opposite direction, the machine translation result will be improved, so the dialogue is made in the same way, but I think that it has no effect.


        x = self.XP.iarray([src_stoi('</s>') for _ in range(batch_size)])
        attention.embed(x)
        for l in reversed(range(src_len)):
            x = self.XP.iarray([src_stoi(src_batch[k][l]) for k in range(batch_size)])
            attention.embed(x)

        attention.encode()

Initialize the target language string you want to get with <s>.


        t = self.XP.iarray([trg_stoi('<s>') for _ in range(batch_size)])
        hyp_batch = [[] for _ in range(batch_size)]

This is the learning part. Language information cannot be learned unless it is index information, so use stoi to change the language to index information. Get the target (in this case the output of the dialogue) and compare it with the correct data to calculate the cross entropy. Since the cross entropy gives the distance between the probability distributions, it can be seen that the smaller the loss in this calculation, the closer the output result is to the target. It returns hypothesis candidates and calculated losses.


        if is_training:
            loss = self.XP.fzeros(())
            for l in range(trg_len):
                y = attention.decode(t)
                t = self.XP.iarray([trg_stoi(trg_batch[k][l]) for k in range(batch_size)])
                loss += functions.softmax_cross_entropy(y, t)
                output = cuda.to_cpu(y.data.argmax(1))
                for k in range(batch_size):
                    hyp_batch[k].append(trg_itos(output[k]))
            return hyp_batch, loss

This is the test part. Neural networks can generate infinite candidates, and especially in the case of lstm models, they use past states and are limited because they may enter an infinite loop. Output using the initialized target word string. The maximum value of the output data is output and t is updated. The candidates output for the batch size are converted from index information to language information. break the process if all candidates end with a</ s>termination symbol.


        else:
            while len(hyp_batch[0]) < generation_limit:
                y = attention.decode(t)
                output = cuda.to_cpu(y.data.argmax(1))
                t = self.XP.iarray(output)
                for k in range(batch_size):
                    hyp_batch[k].append(trg_itos(output[k]))
                if all(hyp_batch[k][-1] == '</s>' for k in range(batch_size)):
                    break

        return hyp_batch

It is the processing of the entire learning. Input utterance and output utterance are initialized. self.vocab generates a generator with gens.word_list in the whole vocabulary.

        src_vocab = Vocabulary.new(gens.word_list(self.source), self.vocab)
        trg_vocab = Vocabulary.new(gens.word_list(self.target), self.vocab)

I am creating vocabulary information for input and output utterances with Vocabulary.new (). Create the following generator withgens.word_list (self.source). The input file name is given in self.source.

def word_list(filename):
    with open(filename) as fp:
        for l in fp:
            yield l.split()

The process of converting vocabulary information to index information is performed in the following part. <Unk> it is 0 in the unknown word, <s> 1 with the prefix, </ s> has set the 2 at the end of the phrase. Since we have set values for these in advance, we add +3 so that the index is after the reserved word.

    @staticmethod
    def new(list_generator, size):
        self = Vocabulary()
        self.__size = size

        word_freq = defaultdict(lambda: 0)
        for words in list_generator:
            for word in words:
                word_freq[word] += 1

        self.__stoi = defaultdict(lambda: 0)
        self.__stoi['<unk>'] = 0
        self.__stoi['<s>'] = 1
        self.__stoi['</s>'] = 2
        self.__itos = [''] * self.__size
        self.__itos[0] = '<unk>'
        self.__itos[1] = '<s>'
        self.__itos[2] = '</s>'

        for i, (k, v) in zip(range(self.__size - 3), sorted(word_freq.items(), key=lambda x: -x[1])):
            self.__stoi[k] = i + 3
            self.__itos[i + 3] = k

        return self

Creating an attention model. Gives vocabulary, embedded layer, hidden layer and XP. XP is the part that performs cpu calculation and gpu calculation.


        trace('making model ...')
        self.attention_dialogue = AttentionDialogue(self.vocab, self.embed, self.hidden, self.XP)

It will be part of transfer learning. Here, the weight created by word2vec is transferred. Since the name of the weight of the model created by word2vec is weight_xi, the input utterance is unified, but the output utterance part is different in ʻembded_target`, so the following processing is included. The [0] part is the name of the weight The [1] part is the value.

                if dst["embded_target"] and child.name == "weight_xi" and self.word2vecFlag:
                    for a, b in zip(child.namedparams(), dst["embded_target"].namedparams()):
                        b[1].data = a[1].data

This is a copy part of the weight. Iterates the original part and copies the weights if the conditions are met. Condition 1: There is something that matches the name given to the weight Condition 2: The weight types are the same Condition 3: The part of link.Link, that is, the model part has been reached. Condition 4: The length of the model weight matrix is the same


    def copy_model(self, src, dst, dec_flag=False):
        print("start copy")
        for child in src.children():
            if dec_flag:
                if dst["embded_target"] and child.name == "weight_xi" and self.word2vecFlag:
                    for a, b in zip(child.namedparams(), dst["embded_target"].namedparams()):
                        b[1].data = a[1].data
                    print('Copy weight_jy')
            if child.name not in dst.__dict__: continue
            dst_child = dst[child.name]
            if type(child) != type(dst_child): continue
            if isinstance(child, link.Chain):
                self.copy_model(child, dst_child)
            if isinstance(child, link.Link):
                match = True
                for a, b in zip(child.namedparams(), dst_child.namedparams()):
                    if a[0] != b[0]:
                        match = False
                        break
                    if a[1].data.shape != b[1].data.shape:
                        match = False
                        break
                if not match:
                    print('Ignore %s because of parameter mismatch' % child.name)
                    continue
                for a, b in zip(child.namedparams(), dst_child.namedparams()):
                    b[1].data = a[1].data
                print('Copy %s' % child.name)


        if self.word2vecFlag:
            self.copy_model(self.word2vec, self.attention_dialogue.emb)
            self.copy_model(self.word2vec, self.attention_dialogue.dec, dec_flag=True)

Create a generator for input and output utterances.

            gen1 = gens.word_list(self.source)
            gen2 = gens.word_list(self.target)
            gen3 = gens.batch(gens.sorted_parallel(gen1, gen2, 100 * self.minibatch), self.minibatch)

Create both for the batch size. Create the batch size in tuple format below.

def batch(generator, batch_size):
    batch = []
    is_tuple = False
    for l in generator:
        is_tuple = isinstance(l, tuple)
        batch.append(l)
        if len(batch) == batch_size:
            yield tuple(list(x) for x in zip(*batch)) if is_tuple else batch
            batch = []
    if batch:
        yield tuple(list(x) for x in zip(*batch)) if is_tuple else batch

Input utterances and output utterances are created and sorted for batch sizes.


def sorted_parallel(generator1, generator2, pooling, order=1):
    gen1 = batch(generator1, pooling)
    gen2 = batch(generator2, pooling)
    for batch1, batch2 in zip(gen1, gen2):
        #yield from sorted(zip(batch1, batch2), key=lambda x: len(x[1]))
        for x in sorted(zip(batch1, batch2), key=lambda x: len(x[order])):
            yield x

Adagrad is used for optimization. It is a method that the update width becomes smaller as the number of updates accumulates.

r ← r + g^2_{\vec{w}}\\
w ← w - \frac{\alpha}{r + }g^2_{\vec{w}}

ʻOptimizer.GradientClipping (5)` uses L2 regularization to keep the gradient within a certain range.


            opt = optimizers.AdaGrad(lr = 0.01)
            opt.setup(self.attention_dialogue)
            opt.add_hook(optimizer.GradientClipping(5))

In the following, input user utterances and corresponding user utterances are filled with * by fill_batch to make deep learning possible.

def fill_batch(batch, token='</s>'):
    max_len = max(len(x) for x in batch)
    return [x + [token] * (max_len - len(x) + 1) for x in batch]

Backward processing is performed using the loss obtained in the forward processing, and the weight is updated. Backward processing depends on the activation function. The updated part is as follows. Change whether the data is processed by gpu or cpu The optimization by the loss function is changed by changing the way of giving data in tuple, dict, and others.


    def update_core(self):
        batch = self._iterators['main'].next()
        in_arrays = self.converter(batch, self.device)

        optimizer = self._optimizers['main']
        loss_func = self.loss_func or optimizer.target

        if isinstance(in_arrays, tuple):
            in_vars = tuple(variable.Variable(x) for x in in_arrays)
            optimizer.update(loss_func, *in_vars)
        elif isinstance(in_arrays, dict):
            in_vars = {key: variable.Variable(x)
                       for key, x in six.iteritems(in_arrays)}
            optimizer.update(loss_func, **in_vars)
        else:
            in_var = variable.Variable(in_arrays)
            optimizer.update(loss_func, in_var)

            for src_batch, trg_batch in gen3:
                src_batch = fill_batch(src_batch)
                trg_batch = fill_batch(trg_batch)
                K = len(src_batch)
                hyp_batch, loss = self.forward_implement(src_batch, trg_batch, src_vocab, trg_vocab, self.attention_dialogue, True, 0)
                loss.backward()
                opt.update()

Saving the trained model save and save_spec do not exist in the chainer standard, but are created separately to save information about the language.

save saves utterance data information save_spec saves vocabulary size, embedded layer size, hidden layer size save_hdf5 saves the model in hdf5 format


        trace('saving model ...')
        prefix = self.model
        model_path = APP_ROOT + "/model/" + prefix
        src_vocab.save(model_path + '.srcvocab')
        trg_vocab.save(model_path + '.trgvocab')
        self.attention_dialogue.save_spec(model_path + '.spec')
        serializers.save_hdf5(model_path + '.weights', self.attention_dialogue)

This is the test part. The model output during learning is read and the user's utterance content for the input utterance is output.


    def test(self):
        trace('loading model ...')
        prefix = self.model
        model_path = APP_ROOT + "/model/" + prefix
        src_vocab = Vocabulary.load(model_path + '.srcvocab')
        trg_vocab = Vocabulary.load(model_path + '.trgvocab')
        self.attention_dialogue = AttentionDialogue.load_spec(model_path + '.spec', self.XP)
        serializers.load_hdf5(model_path + '.weights', self.attention_dialogue)

        trace('generating translation ...')
        generated = 0

        with open(self.test_target, 'w') as fp:
            for src_batch in gens.batch(gens.word_list(self.source), self.minibatch):
                src_batch = fill_batch(src_batch)
                K = len(src_batch)

                trace('sample %8d - %8d ...' % (generated + 1, generated + K))
                hyp_batch = self.forward_implement(src_batch, None, src_vocab, trg_vocab, self.attention_dialogue, False, self.generation_limit)

                source_cuont = 0
                for hyp in hyp_batch:
                    hyp.append('</s>')
                    hyp = hyp[:hyp.index('</s>')]
                    print("src : " + "".join(src_batch[source_cuont]).replace("</s>", ""))
                    print('hyp : ' +''.join(hyp))
                    print(' '.join(hyp), file=fp)
                    source_cuont = source_cuont + 1

                generated += K

        trace('finished.')

Summary

The content was announced at PyCon 2016, but if you think that it is still a part, it seems that there is a long way to go if you include the explanation of other parts. At present, the range that can be handled by simple deep learning is limited, so we are using multiple technologies. Since there are many models for dialogue in deep learning, I think that it will lead to performance improvement by determining the evaluation index and changing the model of deep learning.

reference

Attention and Memory in Deep Learning and NLP

Explanation of Chat Bot announced at PyCon 2016 from the code base (chat response using Chainer)

Chat response

Summary

reference