Overview

Last time is the basic usage of Chainer, implementation of MLP (multilayer perceptron) and the number of nodes of fully connected layer from the convolution layer of CNN. I introduced the calculation formula necessary for this.

This time, I will read the actual Twitter data and build a CNN.

Convolutional neural network

The explanation of the convolutional neural network was easy to understand here.

A filter is applied to the 2D image data to compress the features. After that, pooling is performed to further extract the features. Probably, there is not one type of filter, but different filters are applied for the number of sheets you want to output.

Target

Refer to this article and refer to this paper. I think about imitating it.

Outline of processing

The outline of the process is as follows.

Read Tweet data
Word Embed (Word2Vec is used this time)
Learning / test data division
Definition of CNN
Learning on CNN
Forecast on CNN

The definition and learning of CNN uses the following method described in the dissertation.

This figure depicts the convolution and pooling process for a sentence, with $ d $ being the dimension of the Word vector and $ s $ being the number of words in the sentence in the large matrix on the far left. And the filter size is not a symmetric matrix, but an asymmetric matrix of $ d × m $ dimensions.

However, I did not understand even after reading the paper, but $ s $ is different for each sentence, so if I was wondering how to do it, this article In / items / 93fcb2bc27d7b268cbe6), the maximum number of words for each sentence that appears in all Tweets was taken, so imitate this.

Data acquisition

Get the Tweet data (https://raw.githubusercontent.com/satwantrana/CharSCNN/master/tweets_clean.txt). In the data, the first column is the flag of [0,1], and the second column is the Tweet in English.

Word Embed

I want to keep the sentence in a two-dimensional image format, so I will change it to a distributed representation. This time, the Embed ID that came with Chainer didn't work, so I'll use Word2Vec from the gensim package.

First, assign an ID to Word from the text data.

#! -*- coding:utf-8 -*-

def read(inp_file,num_sent=None):
        f_in = open(inp_file, 'r')
        lines = f_in.readlines()

        words_map = {}
        word_cnt = 0

        k_wrd = 5 #Word context window

        y = []
        x_wrd = []

        if num_sent is None:
            num_sent = len(lines)
            max_sen_len = 0
        else:
            max_sen_len, num_sent = 0, num_sent


        words_vocab_mat = []

        token_list = []

        for line in lines[:num_sent]:
            words = line[:-1].split()
            tokens = words[1:]
            y.append(int(float(words[0])))
            max_sen_len = max(max_sen_len,len(tokens))
            for token in tokens:
                if token not in words_map:
                    words_map[token] = word_cnt
                    token_list.append(token)
                    word_cnt += 1
            words_vocab_mat.append(tokens)

        cnt = 0
        for line in lines[:num_sent]:
            words = line[:-1].split()
            cnt += 1
            tokens = words[1:]
            word_mat = [-1] * (max_sen_len+k_wrd-1)

            for i in xrange(len(tokens)):
                word_mat[(k_wrd/2)+i] = words_map[tokens[i]]
            x_wrd.append(word_mat)
        max_sen_len += k_wrd-1

        # num_sent:Number of documents
        # word_cnt:Number of word types
        # max_sen_len:Maximum length of document
        # x_wrd:Number of rows in id column of input word:Number of sentences(num_sent)Number of columns:Maximum length of document(max_sen_len)
        # k_wrd: window size
        # words_map : key = word,value = id
        # y: 1 or 0 (i.e., positive or negative)
        # words_vocab_mat :Decomposed sentence, number of rows is number of sentences, number of columns is variable and number of words
        # token_list :Token list, index corresponds to id
        data = (num_sent, word_cnt, max_sen_len, k_wrd, x_wrd, y,words_map,words_vocab_mat,token_list)
        return data

(num_sent, word_cnt, max_sen_len, k_wrd, x_wrd, y,words_map,sentences,token_list) = load.read("data/tweets_clean.txt",10000)

x_wrd is a matrix of number of sentences x maximum document length, and each element is the ID of the word that appears. You'll also need words_map, token_list, and words_vocab_mat, as you'll need them later.

Next, use Word2Vec to get a vector representation of each word, and then create a "sentence image matrix" (arbitrarily attached).

    """Create a word vector space in Word2Vec"""
    word_dimension = 200
    from gensim.models import Word2Vec
    model_w2v = Word2Vec(sentences,seed=123,size=word_dimension,min_count=0,window=5)
    sentence_image_matrix = np.zeros((len(sentences),1,word_dimension,max_sen_len)) #Initialization of sentence image matrix for convolution

    """x_Generate a vector for wrd"""
    for i in range(0,len(x_wrd)):
        tmp_id_list = x_wrd[i,:]
        for j in range(0,len(tmp_id_list)):
            """Turn for one line"""
            id = tmp_id_list[j]
            if id == -1:
                """No information"""
                sentence_image_matrix[i,0,:,j] = [0.] * word_dimension #Insert 0 vector
            else:
                target_word = token_list[id]
                sentence_image_matrix[i,0,:,j] = model_w2v[target_word]

sentence_image_matrix is defined as a 4-dimensional tensor with the size (number of sentences, 1, vector dimension = 200, maximum sentence length).

Learning / test data division

As I learned for the first time, scikit train_test_split can also be used for 4D tensors. Probably because I'm only looking at the first dimension.

    """Divide into training data and test data"""
    sentence_image_matrix = np.array(sentence_image_matrix,dtype=np.float32)
    N = len(sentence_image_matrix)
    t_n = 0.33
    x_train,x_test,y_train,y_test = train_test_split(sentence_image_matrix,y,test_size=t_n,random_state=123)

Definition of CNN

The problem is the definition of CNN. In this paper, we use an asymmetric filter, and the pooling is also asymmetric, so we have to take that into consideration.

Then it looks like this.

class CNNFiltRow(ChainerClassifier):
    """
A pattern that moves all the row directions of CNN in the column direction as a filter
    """

    def _setup_network(self, **params):
        self.input_dim = params["input_dim"] #Column-wise dimensions of a single image
        self.in_channels = params["in_channels"] #input channels : default = 1
        self.out_channels = params["out_channels"] #out_channels :Any
        self.row_dim = params["row_dim"] #Row direction dimension of one image=It becomes the number of lines of Filter
        self.filt_clm = params["filt_clm"] #Number of Filter columns
        self.pooling_row = params["pooling_row"] if params.has_key("pooling_row") else 1 #Number of pooling lines: default = 1
        self.pooling_clm = params["pooling_clm"] if params.has_key("pooling_clm") else int(self.input_dim - 2 * math.floor(self.filt_clm/2.)) #Number of Pooling columns: default = math.floor((self.input_dim - 2 * math.floor(self.filt_clm/2.))
        self.batch_size = params["batch_size"] if params.has_key("batch_size") else 100
        self.hidden_dim = params["hidden_dim"]
        self.n_classes = params["n_classes"]

        self.conv1_out_dim = math.floor((self.input_dim - 2 * math.floor(self.filt_clm/2.))/self.pooling_clm)
        network = FunctionSet(
            conv1 = F.Convolution2D(self.in_channels,self.out_channels,(self.row_dim,self.filt_clm)), #Made the Filter asymmetric
            l1=F.Linear(self.conv1_out_dim*self.out_channels, self.hidden_dim),
            l2=F.Linear(self.hidden_dim, self.hidden_dim),
            l3=F.Linear(self.hidden_dim, self.n_classes),
        )
        return network

    def forward(self, x, train=True):
        h = F.max_pooling_2d(F.relu(self.network.conv1(x)), (self.pooling_row,self.pooling_clm))
        h1 = F.dropout(F.relu(self.network.l1(h)),train=train)
        h2 = F.dropout(F.relu(self.network.l2(h1)),train=train)
        y = self.network.l3(h2)
        return y

    def output_func(self, h):
        return F.softmax(h)

    def loss_func(self, y, t):
        return F.softmax_cross_entropy(y, t)

    def fit(self, x_data, y_data):
        batchsize = self.batch_size
        N = len(y_data)
        for loop in range(self.n_iter):
            perm = np.random.permutation(N)
            sum_accuracy = 0
            sum_loss = 0
            for i in six.moves.range(0, N, batchsize):
                x_batch = x_data[perm[i:i + batchsize]]
                y_batch = y_data[perm[i:i + batchsize]]
                x = Variable(x_batch)
                y = Variable(y_batch)
                self.optimizer.zero_grads()
                yp = self.forward(x)
                loss = self.loss_func(yp,y)
                loss.backward()
                self.optimizer.update()
                sum_loss += loss.data * len(y_batch)
                sum_accuracy += F.accuracy(yp,y).data * len(y_batch)
            if self.report > 0 and loop % self.report == 0:
                print('loop={}, train mean loss={} , train mean accuracy={}'.format(loop, sum_loss / N,sum_accuracy / N))

        return self

    def fit_test(self, x_data, y_data,x_test,y_test):
        batchsize = self.batch_size
        N = len(y_data)
        Nt = len(y_test)
        train_ac = []
        test_ac = []
        for loop in range(self.n_iter):
            perm = np.random.permutation(N)
            permt = np.random.permutation(Nt)
            sum_accuracy = 0
            sum_loss = 0

            sum_accuracy_t = 0

            """Learning phase"""
            for i in six.moves.range(0, N, batchsize):
                x_batch = x_data[perm[i:i + batchsize]]
                y_batch = y_data[perm[i:i + batchsize]]
                x = Variable(x_batch)
                y = Variable(y_batch)
                self.optimizer.zero_grads()
                yp = self.forward(x)
                loss = self.loss_func(yp,y)
                loss.backward()
                self.optimizer.update()
                sum_loss += loss.data * len(y_batch)
                sum_accuracy += F.accuracy(yp,y).data * len(y_batch)

            """Test phase"""
            for i in six.moves.range(0,Nt,batchsize):
                x_batch = x_test[permt[i:i + batchsize]]
                y_batch = y_test[permt[i:i + batchsize]]
                x = Variable(x_batch)
                y = Variable(y_batch)
                yp = self.forward(x,False)
                sum_accuracy_t += F.accuracy(yp,y).data * len(y_batch)

            if self.report > 0 and loop % self.report == 0:
                print('loop={}, train mean loss={} , train mean accuracy={} , test mean accuracy={}'.format(loop, sum_loss / N,sum_accuracy / N,sum_accuracy_t / Nt))

            train_ac.append(sum_accuracy / N)
            test_ac.append(sum_accuracy_t / Nt)

        return self,train_ac,test_ac

Please refer to Previous article for Chainer Classifier.

Learning CNN

I also wanted to see the accuracy of the test data, so I added the fit_test method.

    """Learning CNN Filter Row"""
    n_iter = 200
    report = 5
    params = {"input_dim":max_sen_len,"in_channels":1,"out_channels":20,"row_dim":word_dimension,"filt_clm":3,"batch_size":100,"hidden_dim":300,"n_classes":2}
    cnn = CNNFiltRow(n_iter=n_iter,report=report,**params)
    cnn,train_ac,test_ac = cnn.fit_test(x_train,y_train,x_test,y_test)

Below is a plot of the accuracy during training and the accuracy in the test data. It seems that overfitting starts from around 100. Still, the impression that generalization performance is high.

CNN forecast

When test data is put into the finally completed model and each index is put out, it becomes as follows.

[CNN]P AUC: 0.80 Pres: 0.66 Recl: 0.89 Fscr: 0.76

The F score was 0.76 and the AUC was 0.8, which was pretty good.

Compare with other models

Do the same with Random Forest and MLP (Multilayer Perceptron) as benchmarks. Since the input data is not two-dimensional in this case, it is corrected to one-dimensional like MNIST.

As a result, various indicators for similar test data are as follows.

[RF ]P AUC: 0.71 Pres: 0.65 Recl: 0.60 Fscr: 0.62
[MLP]P AUC: 0.71 Pres: 0.64 Recl: 0.69 Fscr: 0.67
[CNN]P AUC: 0.80 Pres: 0.66 Recl: 0.89 Fscr: 0.76

This overwhelming performance difference of CNN ...

Summary

I understand the behavior of CNN in general.
Some of them were done with BoW, but the accuracy was not so high.
I think we need an appropriate distributed representation.
I did the parameters every time this time, but I want to do it if cross-validation is possible (computer power is insufficient)
Is it an extension to Japanese?
I would also like to try RNN using LSTM.

Let's analyze the emotions of Tweet using Chainer (2nd)