Introduction

Convolutional Neural Networks (CNN) is also used for image processing and natural language processing, but why not combine it with an Attention mechanism? So I tried it.

Attention mechanism

Simply put, it's a feature that allows you to focus more on the important parts of the input (here the sentence).

She is beautiful and has a good style, but she has the worst personality.

For example, when you want to determine the evaluation polarity (positive or negative) of this sentence, human beings judge it as negative by looking at the following clauses that include `worst`. Similarly, the Attention mechanism can place more importance on the `worst` part than on the `beauty` `,` `(style) good` part.

Originally published in machine translation. Neural Machine Translation by Jointly Learning to Align and Translate [Bahdanau et al., ICLR2015] For those who want to know more details, the article here is easy to understand.

Attention mechanism on CNN

The Attention mechanism has been used in various tasks of natural language processing since it was announced in machine translation. However, most of them are RNN methods that apply LSTM and GRU. So, this time I tried using the Attention mechanism for CNN in the evaluation polarity classification task. Evaluation polarity classification, as mentioned in the example above, is the task of predicting whether an input sentence has a positive or negative meaning when given.

Network model

It is based on Convolutional Neural Networks for Sentence Classification [Kim, EMNLP2014]. スクリーンショット 2016-12-20 13.20.27（2）.jpg

Attention calculation

Introduced Attention mechanism to RNN using GRU for document classification [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf) [Yang et al., NAACL2016] I referred to. Feature map $ \ boldsymbol {c} \ in \ mathcal {R} ^ {l-k + 1} $

\boldsymbol{c} = [c_1, c_2,\cdots,c_{l-k+1}]

$ l $ is the sentence length and $ k $ is the window size. Calculates the importance in this feature map $ \ boldsymbol {c} $. This part is the Attention mechanism.

\begin{align}
p & = \sum_{i} a_i \odot c_i \\
a_i & = \frac{\exp(W^{(C2)} \tanh(W^{(C1)} c_i))} {\sum_{j} \exp(W^{(C2)} \tanh(W^{(C1)} c_j))}
\end{align}

$ \ Odot $ is the product of elements. $ W ^ {(C1)} \ in \ mathcal {R} ^ {{d} \ times 1} $, $ W ^ {(C2)} \ in \ mathcal {R} ^ {1 \ times {d}} $ D $ of $ is a hyperparameter. What is this name? .. $ a_i $ is calculated to have real numbers from 0 to 1, and the closer $ a_i $ is to 1, the more important the corresponding $ c_i $ is. One pooling result $ p $ is output from one feature map. From here on, it's the same as the Kim CNN model introduced above. Combine multiple $ p $, compress the resulting vector $ v $, and classify it with a softmax classifier.

v = p^1\oplus p^2\oplus \cdots p^m

$ m $ is the number of feature maps. Here, it is set to 100 like Kim.

It's like using Attention instead of max pooling on CNN's pooling layer. It looks like this in the figure.

スクリーンショット 2017-02-03 15.07.49.jpg

When using Attention in an RNN, we calculate the importance of the hidden layer vector, This is a form that calculates the importance of the scalar obtained by convolution (do you have ngram information?), So whether it will work or not. .. ..

data

Source code --Implemented with Chainer.
- omr001@github
data set --I'm using Stanford Sentiment Treebank (SST). -You can download it at here. --Word distributed expression --I'm using the trained model of word2vec (GoogleNews-vectors-negative300.bin.gz). You can download it at here.

Code (network part)

`cnn_attention.py`


class CNN_attention(Chain):
    def __init__(self, vocab_size, embedding_size, input_channel, output_channel_1, output_channel_2, output_channel_3, k1size, k2size, k3size, pooling_units, atten_size=20, output_size=args.classtype, train=True):
        super(CNN_attention, self).__init__(
            w2e = L.EmbedID(vocab_size, embedding_size),
            conv1 = L.Convolution2D(input_channel, output_channel_1, (k1size, embedding_size)),
            conv2 = L.Convolution2D(input_channel, output_channel_2, (k2size, embedding_size)),
            conv3 = L.Convolution2D(input_channel, output_channel_3, (k3size, embedding_size)),

            l1 = L.Linear(pooling_units, output_size),
            #Attention
            a1 = L.Linear(1, atten_size),
            a2 = L.Linear(atten_size, 1),
        )
        self.output_size = output_size
        self.train = train
        self.embedding_size = embedding_size
        self.ignore_label = 0
        self.w2e.W.data[self.ignore_label] = 0
        self.w2e.W.data[1] = 0  #Non-character
        self.input_channel = input_channel

    def initialize_embeddings(self, word2id):
        #w_vector = word2vec.Word2Vec.load_word2vec_format('./vector/glove.840B.300d.txt', binary=False)  # GloVe
        w_vector = word2vec.Word2Vec.load_word2vec_format('./vector/GoogleNews-vectors-negative300.bin', binary=True)  # word2vec
        for word, id in sorted(word2id.items(), key=lambda x:x[1])[1:]:
            if word in w_vector:
                self.w2e.W.data[id] = w_vector[word]
            else:
                self.w2e.W.data[id] = np.reshape(np.random.uniform(-0.25,0.25,self.embedding_size),(self.embedding_size,))
    
    def __call__(self, x):
        h_list = list()
        ox = copy.copy(x)
        if args.gpu != -1:
            ox.to_gpu()
        
        x = xp.array(x.data)
        x = F.tanh(self.w2e(x))
        b, max_len, w = x.shape  # batch_size, max_len, embedding_size
        x = F.reshape(x, (b, self.input_channel, max_len, w))

        c1 = self.conv1(x)
        b, outputC, fixed_len, _ = c1.shape
        tf = self.set_tfs(ox, b, outputC, fixed_len)  # true&flase
        h1 = self.attention_pooling(F.relu(c1), b, outputC, fixed_len, tf)
        h1 = F.reshape(h1, (b, outputC))
        h_list.append(h1)

        c2 = self.conv2(x)
        b, outputC, fixed_len, _ = c2.shape
        tf = self.set_tfs(ox, b, outputC, fixed_len)  # true&flase
        h2 = self.attention_pooling(F.relu(c2), b, outputC, fixed_len, tf)
        h2 = F.reshape(h2, (b, outputC))
        h_list.append(h2)

        c3 = self.conv3(x)
        b, outputC, fixed_len, _ = c3.shape
        tf = self.set_tfs(ox, b, outputC, fixed_len)  # true&flase
        h3 = self.attention_pooling(F.relu(c3), b, outputC, fixed_len, tf)
        h3 = F.reshape(h3, (b, outputC))
        h_list.append(h3)

        h4 = F.concat(h_list)
        y = self.l1(F.dropout(h4, train=self.train))
        return y

    def set_tfs(self, x, b, outputC, fixed_len):
        TF = Variable(x[:,:fixed_len].data != 0, volatile='auto')
        TF = F.reshape(TF, (b, 1, fixed_len, 1))
        TF = F.broadcast_to(TF, (b, outputC, fixed_len, 1))
        return TF

    def attention_pooling(self, c, b, outputC, fixed_len, tf):
        reshaped_c = F.reshape(c, (b*outputC*fixed_len, 1))
        scala = self.a2(F.tanh(self.a1(reshaped_c)))
        reshaped_scala = F.reshape(scala, (b, outputC, fixed_len, 1)) 
        reshaped_scala = F.where(tf, reshaped_scala, Variable(-10*xp.ones((b, outputC, fixed_len, 1)).astype(xp.float32), volatile='auto'))  
        rereshaped_scala = F.reshape(reshaped_scala, (b*outputC, fixed_len))  # reshape for F.softmax
        softmax_scala = F.softmax(rereshaped_scala)
        atten = F.reshape(softmax_scala, (b*outputC*fixed_len, 1))
        a_h = F.scale(reshaped_c, atten, axis=0)
        reshaped_a_h = F.reshape(a_h, (b, outputC, fixed_len, 1))
        p = F.sum(reshaped_a_h, axis=2)
        return p

Experiment contents

We used SST to compare the accuracy rate of the classification with max pooling. We experimented with two tasks: SST-5, which classifies the five values of very negative, negative, neutral, positive, and very positive, and SST-2, which classifies positive and negative without neutral.

Experimental result

method	SST-2	SST-5
max	86.3 (0.27)	46.5 (1.13)
attention	86.0 (0.20)	47.2 (0.37)

The value is the average value tried 5 times, and the value in parentheses is the standard deviation. It is better to use Attention in 5-value classification, but the result is that it does not change so much in 2-value classification. By the way, the maximum value (SST-5) out of 5 times was 48.2% for max pooling and 47.7% for Attention, and max pooling gave better results. It's just easy to shake. .. ..

Consideration

If you take a closer look at how Attention is in the feature map, I found that one of them was strongly emphasized by about 0.9, and the others were almost 0, which was similar to max pooling. However, unlike max pooling, the value of the entire feature map is taken into consideration, so I wonder if it is difficult to make a mistake. .. ..

in conclusion

Intuitively, I felt that Attention, which looks at the overall importance, is better than max pooling, which uses only the maximum value. It's not bad because the accuracy is higher in the 5-value classification, regardless of the binary classification. .. I think it depends on the task, so I would like to try other tasks as well.

The article here also introduces text classification using CNN in an easy-to-understand manner.

[Text classification] I tried using the Attention mechanism for Convolutional Neural Networks.