Document classification with Sentence Piece

I applied Sentence Piece, a tokenizer for neural language processing, to document classification.

Trigger

The other day, I learned about Sentence Piece as a word splitting option in natural language processing. It seems that machine translation achieved a score that exceeded the conventional word division method, and I was simply interested in it, and I wondered what would happen with the document classification I am currently working on, so I tried it.

SentencePiece(GitHub) Article by author taku910 (Qiita)

data

I used KNB Analyzed Blog Corpus. This is an analyzed blog corpus of 4,186 sentences divided into four categories, "Kyoto sightseeing", "mobile phone", "sports", and "gourmet", and includes morphemes and cases. This time, we will use only categories and sentences to solve the classification problem of which category each sentence belongs to. 10% of all data was divided as test data, and the remaining 90% was used for SentencePiece and neural network learning.

Implementation

I ran it on Python 3.6.1 on Bash on Windows. Please refer to requirement.txt for the detailed version of the Python module.

The code is summarized on GitHub. I am still immature, so I would appreciate it if you could point out any mistakes or give me advice.

separator.py

Basically, SentencePiece seems to be used from the command line, but I wanted to use it from Python & I wanted to use it easily with mecab, so it's not very smart, but I called it from subprocess.

def train_sentencepiece(self, vocab_size):
    cmd = "spm_train --input=" + self.native_text + \
            " --model_prefix=" + self.model_path + \
            " --vocab_size=" + str(vocab_size)
    p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout_data, stderr_data = p.communicate()

When writing separately using the learned model, it takes too much time to process each line, so I write it to a text file once and do it all at once. I haven't used it this time, but I've also added a character-by-character split function for learning at the character level.

This time we will compare Sentence Piece and mecab + neologd

net.py

I wanted to use LSTM for the network, but it takes a lot of time to learn, so I chose CNN this time. We implemented a network with three types of filters: 3, 4, and 5 words.

class CNN(Chain):

    def __init__(self, n_vocab, n_units, n_out, filter_size=(3, 4, 5), stride=1, use_dropout=0.5, ignore_label=-1):
        super(CNN, self).__init__()
        initializer = initializers.HeNormal()
        with self.init_scope():
            self.word_embed=L.EmbedID(n_vocab, n_units, ignore_label=-1)
            self.conv1 = L.Convolution2D(None, n_units, (filter_size[0], n_units), stride, pad=(filter_size[0], 0), initialW=initializer)
            self.conv2 = L.Convolution2D(None, n_units, (filter_size[1], n_units), stride, pad=(filter_size[1], 0), initialW=initializer)
            self.conv3 = L.Convolution2D(None, n_units, (filter_size[2], n_units), stride, pad=(filter_size[2], 0), initialW=initializer)
            self.norm1 = L.BatchNormalization(n_units)
            self.norm2 = L.BatchNormalization(n_units)
            self.norm3 = L.BatchNormalization(n_units)
            self.l1 = L.Linear(None, n_units)
            self.l2 = L.Linear(None, n_out)
        self.use_dropout = use_dropout
        self.filter_size = filter_size

    def forward(self, x, train):
        with using_config('train', train):
            x = Variable(x)
            x = self.word_embed(x)
            x = F.dropout(x, ratio=self.use_dropout)
            x = F.expand_dims(x, axis=1)
            x1 = F.relu(self.norm1(self.conv1(x)))
            x1 = F.max_pooling_2d(x1, self.filter_size[0])
            x2 = F.relu(self.norm2(self.conv2(x)))
            x2 = F.max_pooling_2d(x2, self.filter_size[1])
            x3 = F.relu(self.norm3(self.conv3(x)))
            x3 = F.max_pooling_2d(x3, self.filter_size[2])
            x = F.concat((x1, x2, x3), axis=2)
            x = F.dropout(F.relu(self.l1(x)), ratio=self.use_dropout)
            x = self.l2(x)
        return x

Other parameters were set as follows.

Number of units	Mini batch size	max epoch	WeightDecay	GradientClipping	Optimizer
256	32	30	0.001	5.0	Adam

result

Tokenizer	mecab+neologd	SentencePiece
Best accuracy	0.68496418	0.668257773

Hmmm ... not very good Since only the text data for the train was used for learning SentencePiece, was the amount of sentences too small? I tried learning the model of Sentence Piece at jawiki (2017/05/01 latest version).

Tokenizer	mecab+neologd	SentencePiece	SentencePiece(Learn with jawiki)
Best accuracy	0.68496418	0.668257773	0.758949876

This time it looks good.

The accuracy for each epoch is as follows.

Word-separation

I tried sampling some of the sentences that were actually divided

【SentencePiece】
 It's too small / te / press / press / spicy /.
 How / how much / spear / take / ga / continuation / ku / kana / a / ♪
 Another / one / stubborn / tension / ri /, / I think / I think /.

 [Sentence Piece (learning on jawiki)]
 Small / Sa / Too / Te / Button / Press / Spicy / Spicy / of /.
 Do / no / ku / rai / ya / ri / take / continue / no / kana / a / ♪
 A / and / Already / One / Stubborn / Zhang / Ri /, / Shi / Yo / Ka / To / Think / U /.

【mecab + neologd】
 Small / too / te / button / press / spicy / of / is / is /.
 How much / how much / exchange / is / continues / / or / hey / ♪
 Ah / and another / one / do my best /, / let's / or / think / think /.

Even with the same Sentence Piece, I feel that learning with jawiki is more detailed. Mecab + neologd is sensuously close to humans, but it is interesting that it does not mean that learning neural networks will give good results.

from now on

This time, all parameter adjustments such as the number of units were fixed, so I think it is necessary to adjust properly and compare with each best. Also, as I mentioned a little in the section on separators, I would like to try a comparison with character-level learning. After that, I used a sentence different from the train data for learning Sentence Piece, this time jawiki, but I would like to check how other sentences affect the accuracy.

Note

At first I tried it on CentOS (docker), but I couldn't install Sentence Piece successfully. I gave up, but it seems that it can be installed on CentOS if you include the following in addition to the procedure on GitHub.

$ yum install protobuf-devel boost-devel gflags-devel lmdb-devel