NStep LSTM was implemented in the release of Chainer 1.16.0. As the name implies, NStep LSTM is a model that can easily realize multi-layered LSTM. Internally, an RNN optimized with cuDNN is used, and it operates faster than conventional LSTMs. Furthermore, with NStep LSTM, ** it is no longer necessary to match the length of the data in the mini-batch **, and you can enter each sample in the list as it is. You no longer need to pad with -1 and use ignore_label = -1 and where, or transpose and enter a list sorted by data length.
So, this time, I tried to learn series labeling using this NStep LSTM.
Since NStep LSTM has different input / output from conventional LSTM, it is not possible to simply replace the model implemented so far with NStep LSTM.
The input / output of \ _ \ _ init \ _ \ _ () and \ _ \ _ call \ _ \ _ () of NStepLSTM is as follows.
NStepLSTM.__init__(n_layers, in_size, out_size, dropout, use_cudnn=True)
"""
n_layers (int): Number of layers.
in_size (int): Dimensionality of input vectors.
out_size (int): Dimensionality of hidden states and output vectors.
dropout (float): Dropout ratio.
use_cudnn (bool): Use cuDNN.
"""
...
NStepLSTM.__call__(hx, cx, xs, train=True)
"""
hx (~chainer.Variable): Initial hidden states.
cx (~chainer.Variable): Initial cell states.
xs (list of ~chianer.Variable): List of input sequences.
Each element ``xs[i]`` is a :class:`chainer.Variable` holding a sequence.
"""
...
return hy, cy, ys
On the other hand, the conventional LSTM was as follows.
LSTM.__init__(in_size, out_size, **kwargs)
"""
in_size (int) – Dimension of input vectors. If None, parameter initialization will be deferred until the first forward data pass at which time the size will be determined.
out_size (int) – Dimensionality of output vectors.
lateral_init – A callable that takes numpy.ndarray or cupy.ndarray and edits its value.
It is used for initialization of the lateral connections.
Maybe be None to use default initialization.
upward_init – A callable that takes numpy.ndarray or cupy.ndarray and edits its value.
It is used for initialization of the upward connections.
Maybe be None to use default initialization.
bias_init – A callable that takes numpy.ndarray or cupy.ndarray and edits its value.
It is used for initialization of the biases of cell input, input gate and output gate, and gates of the upward connection.
Maybe a scalar, in that case, the bias is initialized by this value.
Maybe be None to use default initialization.
forget_bias_init – A callable that takes numpy.ndarray or cupy.ndarray and edits its value.
It is used for initialization of the biases of the forget gate of the upward connection.
Maybe a scalar, in that case, the bias is initialized by this value.
Maybe be None to use default initialization.
"""
...
LSTM.__call__(x)
"""
x (~chainer.Variable): A new batch from the input sequence.
"""
...
return y
Therefore, NStep LSTM is handled differently from LSTM in the following points.
-Specify the number of layers and dropout ratio with \ _ \ _ init () \ _ \ _ -\ _ \ _ call () \ _ \ _ must pass ** initial hidden states ** and ** initial cell states ** -The input of \ _ \ _ call () \ _ \ _ is not chainer.Variable but chainer.Variable ** list ** -The return value of \ _ \ _ call () \ _ \ _ is the ** list ** of hidden states, cell states and output (chainer.Variable) after the series forward calculation is completed.
The big difference is that the call to \ _ \ _call () \ _ \ _ is given the initial hidden states and cell states, and the I / O is a list.
Implement subclasses to bring NStep LSTM initialization and calls as close to LSTM as possible.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from chainer import Variable
import chainer.links as L
import numpy as np
class LSTM(L.NStepLSTM):
def __init__(self, in_size, out_size, dropout=0.5, use_cudnn=True):
n_layers = 1
super(LSTM, self).__init__(n_layers, in_size, out_size, dropout, use_cudnn)
self.state_size = out_size
self.reset_state()
def to_cpu(self):
super(LSTM, self).to_cpu()
if self.cx is not None:
self.cx.to_cpu()
if self.hx is not None:
self.hx.to_cpu()
def to_gpu(self, device=None):
super(LSTM, self).to_gpu(device)
if self.cx is not None:
self.cx.to_gpu(device)
if self.hx is not None:
self.hx.to_gpu(device)
def set_state(self, cx, hx):
assert isinstance(cx, Variable)
assert isinstance(hx, Variable)
cx_ = cx
hx_ = hx
if self.xp == np:
cx_.to_cpu()
hx_.to_cpu()
else:
cx_.to_gpu()
hx_.to_gpu()
self.cx = cx_
self.hx = hx_
def reset_state(self):
self.cx = self.hx = None
def __call__(self, xs, train=True):
batch = len(xs)
if self.hx is None:
xp = self.xp
self.hx = Variable(
xp.zeros((self.n_layers, batch, self.state_size), dtype=xs[0].dtype),
volatile='auto')
if self.cx is None:
xp = self.xp
self.cx = Variable(
xp.zeros((self.n_layers, batch, self.state_size), dtype=xs[0].dtype),
volatile='auto')
hy, cy, ys = super(LSTM, self).__call__(self.hx, self.cx, xs, train)
self.hx, self.cx = hy, cy
return ys
In the above class, \ _ \ _ init () \ _ \ _ is specified only in_size and out_size as before (default value of dropout is 0.5, fixed to n_layers = 1 without LSTM multi-layering). .. \ _ \ _ Call () \ _ \ _ automatically initializes cx and hx, and inputs / outputs only to the list of chainer.Variable.
Implement Bi-directional LSTM using NStep LSTM. Make a list for backward-LSTM input by reversing each sample of the chainer.Variable list to pass to forward-LSTM. After calculating the output with forward-LSTM and backward-LSTM, ** align each sample in the list of each output and ** concatenate to make one vector. In the class below, a linear operation is added so that out_size is the number of labels for series labeling.
class BLSTMBase(Chain):
def __init__(self, embeddings, n_labels, dropout=0.5, train=True):
vocab_size, embed_size = embeddings.shape
feature_size = embed_size
super(BLSTMBase, self).__init__(
embed=L.EmbedID(
in_size=vocab_size,
out_size=embed_size,
initialW=embeddings,
),
f_lstm=LSTM(feature_size, feature_size, dropout),
b_lstm=LSTM(feature_size, feature_size, dropout),
linear=L.Linear(feature_size * 2, n_labels),
)
self._dropout = dropout
self._n_labels = n_labels
self.train = train
def reset_state(self):
self.f_lstm.reset_state()
self.b_lstm.reset_state()
def __call__(self, xs):
self.reset_state()
xs_f = []
xs_b = []
for x in xs:
_x = self.embed(self.xp.array(x))
xs_f.append(_x)
xs_b.append(_x[::-1])
hs_f = self.f_lstm(xs_f, self.train)
hs_b = self.b_lstm(xs_b, self.train)
ys = [self.linear(F.dropout(F.concat([h_f, h_b[::-1]]), ratio=self._dropout, train=self.train)) for h_f, h_b in zip(hs_f, hs_b)]
return ys
Let's actually use the model implemented above and apply it to the task of series labeling. We chose the Chinese Word Segmentation as a series labeling issue where Bi-directional LSTMs are often used. Unlike English, Chinese does not separate words with spaces, so you need to identify word boundaries before processing the text.
Example)
Winter, Noh (can) Wear (amount) Wear (wear) Less (amount); Summer, Noh (can) Wear (wear) Many (more) Little (little) Wear (wear) Many (more) Little (little).
[Chen+, 2015]
The above example has different meanings depending on whether it is divided into "some" or "many" and "small". Since the sentence structure is almost the same, the delimiter is judged in the context of the surrounding words.
B (Begin, the beginning of a word of two or more letters), M (Middle, the middle of a word of two or more letters), E (End) for a string to learn Chinese word splitting as a series labeling problem. , End of two or more letters), S (Single, one letter word). Using the text data with this label, we will learn the label assigned to each character from the context information of the word string.
PKU (Peking University corpus, standard dataset for benchmarking Chinese Word Segmentation)
[Yao +, 2016] * Very similar to the model above [^ 1]
The learning process is described below as it is.
hiroki-t:/private/work/blstm-cws$ python app/train.py --save -e 10 --gpu 0
2016-12-03 09:34:06.27 JST 13a653 [info] LOG Start with ACCESSID=[13a653] UNIQUEID=[UNIQID] ACCESSTIME=[2016-12-03 09:34:06.026907 JST]
2016-12-03 09:34:06.27 JST 13a653 [info] *** [START] ***
2016-12-03 09:34:06.27 JST 13a653 [info] initialize preprocessor with /private/work/blstm-cws/app/../data/zhwiki-embeddings-100.txt
2016-12-03 09:34:06.526 JST 13a653 [info] load train dataset from /private/work/blstm-cws/app/../data/icwb2-data/training/pku_training.utf8
2016-12-03 09:34:14.134 JST 13a653 [info] load test dataset from /private/work/blstm-cws/app/../data/icwb2-data/gold/pku_test_gold.utf8
2016-12-03 09:34:14.589 JST 13a653 [trace]
2016-12-03 09:34:14.589 JST 13a653 [trace] initialize ...
2016-12-03 09:34:14.589 JST 13a653 [trace] --------------------------------
2016-12-03 09:34:14.589 JST 13a653 [info] # Minibatch-size: 20
2016-12-03 09:34:14.589 JST 13a653 [info] # epoch: 10
2016-12-03 09:34:14.589 JST 13a653 [info] # gpu: 0
2016-12-03 09:34:14.589 JST 13a653 [info] # hyper-parameters: {'adagrad_lr': 0.2, 'dropout_ratio': 0.2, 'weight_decay': 0.0001}
2016-12-03 09:34:14.590 JST 13a653 [trace] --------------------------------
2016-12-03 09:34:14.590 JST 13a653 [trace]
100% (19054 of 19054) |#######################################| Elapsed Time: 0:07:50 Time: 0:07:50
2016-12-03 09:42:05.642 JST 13a653 [info] [training] epoch 1 - #samples: 19054, loss: 9.640346, accuracy: 0.834476
100% (1944 of 1944) |#########################################| Elapsed Time: 0:00:29 Time: 0:00:29
2016-12-03 09:42:34.865 JST 13a653 [info] [evaluation] epoch 1 - #samples: 1944, loss: 6.919845, accuracy: 0.890557
2016-12-03 09:42:34.866 JST 13a653 [trace] -
100% (19054 of 19054) |#######################################| Elapsed Time: 0:07:40 Time: 0:07:40
2016-12-03 09:50:15.258 JST 13a653 [info] [training] epoch 2 - #samples: 19054, loss: 5.526157, accuracy: 0.903373
100% (1944 of 1944) |#########################################| Elapsed Time: 0:00:24 Time: 0:00:24
2016-12-03 09:50:39.400 JST 13a653 [info] [evaluation] epoch 2 - #samples: 1944, loss: 6.233129, accuracy: 0.900318
2016-12-03 09:50:39.401 JST 13a653 [trace] -
100% (19054 of 19054) |#######################################| Elapsed Time: 0:08:41 Time: 0:08:41
2016-12-03 09:59:21.301 JST 13a653 [info] [training] epoch 3 - #samples: 19054, loss: 4.217260, accuracy: 0.921377
100% (1944 of 1944) |#########################################| Elapsed Time: 0:00:24 Time: 0:00:24
2016-12-03 09:59:45.587 JST 13a653 [info] [evaluation] epoch 3 - #samples: 1944, loss: 5.650668, accuracy: 0.913843
2016-12-03 09:59:45.587 JST 13a653 [trace] -
100% (19054 of 19054) |#######################################| Elapsed Time: 0:07:25 Time: 0:07:25
2016-12-03 10:07:11.451 JST 13a653 [info] [training] epoch 4 - #samples: 19054, loss: 3.488712, accuracy: 0.931668
100% (1944 of 1944) |#########################################| Elapsed Time: 0:00:26 Time: 0:00:26
2016-12-03 10:07:37.889 JST 13a653 [info] [evaluation] epoch 4 - #samples: 1944, loss: 5.342249, accuracy: 0.917103
2016-12-03 10:07:37.890 JST 13a653 [trace] -
100% (19054 of 19054) |#######################################| Elapsed Time: 0:07:26 Time: 0:07:26
2016-12-03 10:15:03.919 JST 13a653 [info] [training] epoch 5 - #samples: 19054, loss: 2.995683, accuracy: 0.938305
100% (1944 of 1944) |#########################################| Elapsed Time: 0:00:15 Time: 0:00:15
2016-12-03 10:15:19.749 JST 13a653 [info] [evaluation] epoch 5 - #samples: 1944, loss: 5.320374, accuracy: 0.921863
2016-12-03 10:15:19.750 JST 13a653 [trace] -
100% (19054 of 19054) |########################################| Elapsed Time: 0:07:29 Time: 0:07:29
2016-12-03 10:22:49.393 JST 13a653 [info] [training] epoch 6 - #samples: 19054, loss: 2.680496, accuracy: 0.943861
100% (1944 of 1944) |##########################################| Elapsed Time: 0:00:27 Time: 0:00:27
2016-12-03 10:23:16.985 JST 13a653 [info] [evaluation] epoch 6 - #samples: 1944, loss: 5.326864, accuracy: 0.924161
2016-12-03 10:23:16.986 JST 13a653 [trace] -
100% (19054 of 19054) |########################################| Elapsed Time: 0:07:28 Time: 0:07:28
2016-12-03 10:30:45.772 JST 13a653 [info] [training] epoch 7 - #samples: 19054, loss: 2.425466, accuracy: 0.947673
100% (1944 of 1944) |##########################################| Elapsed Time: 0:00:22 Time: 0:00:22
2016-12-03 10:31:08.448 JST 13a653 [info] [evaluation] epoch 7 - #samples: 1944, loss: 5.270019, accuracy: 0.925341
2016-12-03 10:31:08.449 JST 13a653 [trace] -
100% (19054 of 19054) |########################################| Elapsed Time: 0:08:39 Time: 0:08:39
2016-12-03 10:39:47.461 JST 13a653 [info] [training] epoch 8 - #samples: 19054, loss: 2.233068, accuracy: 0.950928
100% (1944 of 1944) |##########################################| Elapsed Time: 0:00:26 Time: 0:00:26
2016-12-03 10:40:14.2 JST 13a653 [info] [evaluation] epoch 8 - #samples: 1944, loss: 5.792994, accuracy: 0.924707
2016-12-03 10:40:14.2 JST 13a653 [trace] -
100% (19054 of 19054) |########################################| Elapsed Time: 0:07:10 Time: 0:07:10
2016-12-03 10:47:24.806 JST 13a653 [info] [training] epoch 9 - #samples: 19054, loss: 2.066807, accuracy: 0.953524
100% (1944 of 1944) |##########################################| Elapsed Time: 0:00:26 Time: 0:00:26
2016-12-03 10:47:51.745 JST 13a653 [info] [evaluation] epoch 9 - #samples: 1944, loss: 5.864374, accuracy: 0.925294
2016-12-03 10:47:51.746 JST 13a653 [trace] -
100% (19054 of 19054) |########################################| Elapsed Time: 0:08:43 Time: 0:08:43
2016-12-03 10:56:34.758 JST 13a653 [info] [training] epoch 10 - #samples: 19054, loss: 1.946193, accuracy: 0.955782
100% (1944 of 1944) |##########################################| Elapsed Time: 0:00:22 Time: 0:00:22
2016-12-03 10:56:57.641 JST 13a653 [info] [evaluation] epoch 10 - #samples: 1944, loss: 5.284819, accuracy: 0.930201
2016-12-03 10:56:57.642 JST 13a653 [trace] -
2016-12-03 10:56:57.642 JST 13a653 [info] saving the model to /private/work/blstm-cws/app/../output/cws.model ...
2016-12-03 10:56:58.520 JST 13a653 [info] *** [DONE] ***
2016-12-03 10:56:58.521 JST 13a653 [info] LOG End with ACCESSID=[13a653] UNIQUEID=[UNIQID] ACCESSTIME=[2016-12-03 09:34:06.026907 JST] PROCESSTIME=[4972.494370000]
It is not the Precision, Recall, F value but the Accuracy value, but it is 93.0 at the 10th epoch. The processing time was 10 epoch, which was a little over 80 minutes.
[Yao+, 2016] [^2]
All the models are trained on NVIDIA GTX Geforce 970, it took about 16 to 17 hours to train a model on GPU while more than 4 days to train on CPU, in contrast.
[Yao+, 2016]
There are some differences from the previous research, such as the initialization of Embeddings, but the accuracy and processing time of the 1-layer BLSTM are reasonable results.
Decoding
hiroki-t:/private/work/blstm-cws$ python app/parse.py
2016-12-03 11:01:13.343 JST 549e15 [info] LOG Start with ACCESSID=[549e15] UNIQUEID=[UNIQID] ACCESSTIME=[2016-12-03 11:01:13.343412 JST]
2016-12-03 11:01:13.343 JST 549e15 [info] *** [START] ***
2016-12-03 11:01:13.344 JST 549e15 [info] initialize preprocessor with /private/work/blstm-cws/app/../data/zhwiki-embeddings-100.txt
2016-12-03 11:01:13.834 JST 549e15 [trace]
2016-12-03 11:01:13.834 JST 549e15 [trace] initialize ...
2016-12-03 11:01:13.834 JST 549e15 [trace]
2016-12-03 11:01:13.914 JST 549e15 [info] loading a model from /private/work/blstm-cws/app/../output/cws.model ...
Input a Chinese sentence! (use 'q' to exit)
The third step of modernization and construction for the completion of the Chinese people's entry.
B E B E B E S S B M E B E B E S B E B E B E S S B E S
The entry of the Chinese people into the modernized construction, the third step strategy, and the progressive new conquest.
-
q
2016-12-03 11:02:08.961 JST 549e15 [info] *** [DONE] ***
2016-12-03 11:02:08.962 JST 549e15 [info] LOG End with ACCESSID=[549e15] UNIQUEID=[UNIQID] ACCESSTIME=[2016-12-03 11:01:13.343412 JST] PROCESSTIME=[55.618552000]
# ^note[gold]The entry of the Chinese people into the modernized construction, the third step strategy, and the progressive new conquest.
When decoding is performed based on the learning result, the correct label sequence and word division result are returned from the undivided character string.
I learned sequence labeling with Bi-directional LSTM using Chainer's NStep LSTM. With variable length mini-batch + cuDNN support, input data processing has become easier and operations have become faster than before. The model implemented this time can be used not only for Chinese word division but also for series learning, so it may be interesting to apply it to other tasks such as part-of-speech tagging.
The source code is available on GitHub. https://github.com/chantera/blstm-cws
In addition to the BLSTM introduced above, the repository contains the code that I actually use in combination with Chainer for BLSTM + CRF implementation and NLP research, so I hope you find it helpful.
--Do a mini-batch of variable length data in chainer where --studylog / North cloud http://studylog.hateblo.jp/entry/2016/02/04/020547 --The beginning of Chainer's cuDNN-RNN (NStepLSTM) --studylog / Northern clouds http://studylog.hateblo.jp/entry/2016/10/03/095406 --Chainer's NStep LSTM predicts comments on Nico Nico Douga. --Monthly Hacker's Blog http://www.monthly-hack.com/entry/2016/10/24/200000
written by chantera at NAIST cllab
[^ 1]: [Yao +, 2016] returns the vector v ∈ R ^ 2d of the output of BLSTM to the d dimension by the matrix of W ∈ R ^ d * 2d. [^ 2]: In [Yao +, 2016], the dimension of Word Embeddings is set to 200 dimensions, and a dictionary is created from the characters of the training set without pretraining.
Recommended Posts