[Text classification] I tried using the Attention mechanism for Convolutional Neural Networks.

Introduction

Convolutional Neural Networks (CNN) is also used for image processing and natural language processing, but why not combine it with an Attention mechanism? So I tried it.

Attention mechanism

Simply put, it's a feature that allows you to focus more on the important parts of the input (here the sentence).

She is beautiful and has a good style, but she has the worst personality.

For example, when you want to determine the evaluation polarity (positive or negative) of this sentence, human beings judge it as negative by looking at the following clauses that include `worst`. Similarly, the Attention mechanism can place more importance on the `worst` part than on the `beauty` `,` `(style) good` part.

Originally published in machine translation. Neural Machine Translation by Jointly Learning to Align and Translate [Bahdanau et al., ICLR2015] For those who want to know more details, the article here is easy to understand.

Attention mechanism on CNN

The Attention mechanism has been used in various tasks of natural language processing since it was announced in machine translation. However, most of them are RNN methods that apply LSTM and GRU. So, this time I tried using the Attention mechanism for CNN in the evaluation polarity classification task. Evaluation polarity classification, as mentioned in the example above, is the task of predicting whether an input sentence has a positive or negative meaning when given.

Network model

It is based on Convolutional Neural Networks for Sentence Classification [Kim, EMNLP2014]. スクリーンショット 2016-12-20 13.20.27(2).jpg

Attention calculation

Introduced Attention mechanism to RNN using GRU for document classification [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf) [Yang et al., NAACL2016] I referred to. Feature map $ \ boldsymbol {c} \ in \ mathcal {R} ^ {l-k + 1} $

\boldsymbol{c} = [c_1, c_2,\cdots,c_{l-k+1}]

$ l $ is the sentence length and $ k $ is the window size. Calculates the importance in this feature map $ \ boldsymbol {c} $. This part is the Attention mechanism.

\begin{align}
p & = \sum_{i} a_i \odot c_i \\
a_i & = \frac{\exp(W^{(C2)} \tanh(W^{(C1)} c_i))} {\sum_{j} \exp(W^{(C2)} \tanh(W^{(C1)} c_j))}
\end{align}

$ \ Odot $ is the product of elements. $ W ^ {(C1)} \ in \ mathcal {R} ^ {{d} \ times 1} $, $ W ^ {(C2)} \ in \ mathcal {R} ^ {1 \ times {d}} $ D $ of $ is a hyperparameter. What is this name? .. $ a_i $ is calculated to have real numbers from 0 to 1, and the closer $ a_i $ is to 1, the more important the corresponding $ c_i $ is. One pooling result $ p $ is output from one feature map. From here on, it's the same as the Kim CNN model introduced above. Combine multiple $ p $, compress the resulting vector $ v $, and classify it with a softmax classifier.

v = p^1\oplus p^2\oplus \cdots p^m

$ m $ is the number of feature maps. Here, it is set to 100 like Kim.

It's like using Attention instead of max pooling on CNN's pooling layer. It looks like this in the figure.

スクリーンショット 2017-02-03 15.07.49.jpg

When using Attention in an RNN, we calculate the importance of the hidden layer vector, This is a form that calculates the importance of the scalar obtained by convolution (do you have ngram information?), So whether it will work or not. .. ..

data

Code (network part)

cnn_attention.py


class CNN_attention(Chain):
    def __init__(self, vocab_size, embedding_size, input_channel, output_channel_1, output_channel_2, output_channel_3, k1size, k2size, k3size, pooling_units, atten_size=20, output_size=args.classtype, train=True):
        super(CNN_attention, self).__init__(
            w2e = L.EmbedID(vocab_size, embedding_size),
            conv1 = L.Convolution2D(input_channel, output_channel_1, (k1size, embedding_size)),
            conv2 = L.Convolution2D(input_channel, output_channel_2, (k2size, embedding_size)),
            conv3 = L.Convolution2D(input_channel, output_channel_3, (k3size, embedding_size)),

            l1 = L.Linear(pooling_units, output_size),
            #Attention
            a1 = L.Linear(1, atten_size),
            a2 = L.Linear(atten_size, 1),
        )
        self.output_size = output_size
        self.train = train
        self.embedding_size = embedding_size
        self.ignore_label = 0
        self.w2e.W.data[self.ignore_label] = 0
        self.w2e.W.data[1] = 0  #Non-character
        self.input_channel = input_channel

    def initialize_embeddings(self, word2id):
        #w_vector = word2vec.Word2Vec.load_word2vec_format('./vector/glove.840B.300d.txt', binary=False)  # GloVe
        w_vector = word2vec.Word2Vec.load_word2vec_format('./vector/GoogleNews-vectors-negative300.bin', binary=True)  # word2vec
        for word, id in sorted(word2id.items(), key=lambda x:x[1])[1:]:
            if word in w_vector:
                self.w2e.W.data[id] = w_vector[word]
            else:
                self.w2e.W.data[id] = np.reshape(np.random.uniform(-0.25,0.25,self.embedding_size),(self.embedding_size,))
    
    def __call__(self, x):
        h_list = list()
        ox = copy.copy(x)
        if args.gpu != -1:
            ox.to_gpu()
        
        x = xp.array(x.data)
        x = F.tanh(self.w2e(x))
        b, max_len, w = x.shape  # batch_size, max_len, embedding_size
        x = F.reshape(x, (b, self.input_channel, max_len, w))

        c1 = self.conv1(x)
        b, outputC, fixed_len, _ = c1.shape
        tf = self.set_tfs(ox, b, outputC, fixed_len)  # true&flase
        h1 = self.attention_pooling(F.relu(c1), b, outputC, fixed_len, tf)
        h1 = F.reshape(h1, (b, outputC))
        h_list.append(h1)

        c2 = self.conv2(x)
        b, outputC, fixed_len, _ = c2.shape
        tf = self.set_tfs(ox, b, outputC, fixed_len)  # true&flase
        h2 = self.attention_pooling(F.relu(c2), b, outputC, fixed_len, tf)
        h2 = F.reshape(h2, (b, outputC))
        h_list.append(h2)

        c3 = self.conv3(x)
        b, outputC, fixed_len, _ = c3.shape
        tf = self.set_tfs(ox, b, outputC, fixed_len)  # true&flase
        h3 = self.attention_pooling(F.relu(c3), b, outputC, fixed_len, tf)
        h3 = F.reshape(h3, (b, outputC))
        h_list.append(h3)

        h4 = F.concat(h_list)
        y = self.l1(F.dropout(h4, train=self.train))
        return y

    def set_tfs(self, x, b, outputC, fixed_len):
        TF = Variable(x[:,:fixed_len].data != 0, volatile='auto')
        TF = F.reshape(TF, (b, 1, fixed_len, 1))
        TF = F.broadcast_to(TF, (b, outputC, fixed_len, 1))
        return TF

    def attention_pooling(self, c, b, outputC, fixed_len, tf):
        reshaped_c = F.reshape(c, (b*outputC*fixed_len, 1))
        scala = self.a2(F.tanh(self.a1(reshaped_c)))
        reshaped_scala = F.reshape(scala, (b, outputC, fixed_len, 1)) 
        reshaped_scala = F.where(tf, reshaped_scala, Variable(-10*xp.ones((b, outputC, fixed_len, 1)).astype(xp.float32), volatile='auto'))  
        rereshaped_scala = F.reshape(reshaped_scala, (b*outputC, fixed_len))  # reshape for F.softmax
        softmax_scala = F.softmax(rereshaped_scala)
        atten = F.reshape(softmax_scala, (b*outputC*fixed_len, 1))
        a_h = F.scale(reshaped_c, atten, axis=0)
        reshaped_a_h = F.reshape(a_h, (b, outputC, fixed_len, 1))
        p = F.sum(reshaped_a_h, axis=2)
        return p

Experiment contents

We used SST to compare the accuracy rate of the classification with max pooling. We experimented with two tasks: SST-5, which classifies the five values of very negative, negative, neutral, positive, and very positive, and SST-2, which classifies positive and negative without neutral.

Experimental result

method SST-2 SST-5
max 86.3 (0.27) 46.5 (1.13)
attention 86.0 (0.20) 47.2 (0.37)

The value is the average value tried 5 times, and the value in parentheses is the standard deviation. It is better to use Attention in 5-value classification, but the result is that it does not change so much in 2-value classification. By the way, the maximum value (SST-5) out of 5 times was 48.2% for max pooling and 47.7% for Attention, and max pooling gave better results. It's just easy to shake. .. ..

Consideration

If you take a closer look at how Attention is in the feature map, I found that one of them was strongly emphasized by about 0.9, and the others were almost 0, which was similar to max pooling. However, unlike max pooling, the value of the entire feature map is taken into consideration, so I wonder if it is difficult to make a mistake. .. ..

in conclusion

Intuitively, I felt that Attention, which looks at the overall importance, is better than max pooling, which uses only the maximum value. It's not bad because the accuracy is higher in the 5-value classification, regardless of the binary classification. .. I think it depends on the task, so I would like to try other tasks as well.

The article here also introduces text classification using CNN in an easy-to-understand manner.

Recommended Posts

[Text classification] I tried using the Attention mechanism for Convolutional Neural Networks.
[Text classification] I implemented Convolutional Neural Networks for Sentence Classification with Chainer
[Sentence classification] I tried various pooling methods of Convolutional Neural Networks
I tried using scrapy for the first time
vprof --I tried using the profiler for Python
Miscellaneous notes that I tried using python for the matter
[For beginners] I tried using the Tensorflow Object Detection API
I tried using the checkio API
I tried using the python module Kwant for quantum transport calculation
I tried using Azure Speech to Text.
I tried tensorflow for the first time
I tried using the BigQuery Storage API
I tried logistic regression analysis for the first time using Titanic data
I checked the library for using the Gracenote API
Learning neural networks using the genetic algorithm (GA)
I tried using PyCaret at the fastest speed
I tried using the Google Cloud Vision API
I tried python programming for the first time.
I tried using the Datetime module by Python
I tried Mind Meld for the first time
I tried running the TensorFlow tutorial with comments (text classification of movie reviews)
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I tried using firebase for Django's cache server
I tried using the image filter of OpenCV
I tried using the functional programming library toolz
I tried to extract the text in the image file using Tesseract of the OCR engine
I tried Python on Mac for the first time.
I tried python on heroku for the first time
[Linux] I tried using the genetic statistics software PLINK
I tried clustering ECG data using the K-Shape method
I tried to approximate the sin function using chainer
AI Gaming I tried it for the first time
I tried the simplest method of multi-label document classification
I tried to identify the language using CNN + Melspectogram
I tried to complement the knowledge graph using OpenKE
I tried to compress the image using machine learning
I tried using argparse
I tried using anytree
I tried using aiomysql
I tried using Summpy
I tried using coturn
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried using PyCaret
I tried using cron
I tried using ngrok
I tried using face_recognition
I tried using Jupyter
I tried using PyCaret
I tried using Heapq
I tried using doctest
I tried using folium
I tried using jinja2
I tried using folium
I tried using time-window
I tried using the frequently used seaborn method with as few arguments as possible [for beginners]
[For those who want to use TPU] I tried using the Tensorflow Object Detection API 2
I tried porting the code written for TensorFlow to Theano
Text classification using convolutional (CNN) and Spatial Pyramid Pooling (SPP-net)
I tried using the Python library from Ruby with PyCall