Convolutional Neural Networks (CNN) is also used for image processing and natural language processing, but why not combine it with an Attention mechanism? So I tried it.
Simply put, it's a feature that allows you to focus more on the important parts of the input (here the sentence).
She is beautiful and has a good style, but she has the worst personality.
For example, when you want to determine the evaluation polarity (positive or negative) of this sentence, human beings judge it as negative by looking at the following clauses that include `worst`
.
Similarly, the Attention mechanism can place more importance on the `worst`
part than on the `beauty` `,` `(style) good`
part.
Originally published in machine translation. Neural Machine Translation by Jointly Learning to Align and Translate [Bahdanau et al., ICLR2015] For those who want to know more details, the article here is easy to understand.
The Attention mechanism has been used in various tasks of natural language processing since it was announced in machine translation. However, most of them are RNN methods that apply LSTM and GRU. So, this time I tried using the Attention mechanism for CNN in the evaluation polarity classification task. Evaluation polarity classification, as mentioned in the example above, is the task of predicting whether an input sentence has a positive or negative meaning when given.
It is based on Convolutional Neural Networks for Sentence Classification [Kim, EMNLP2014].
Introduced Attention mechanism to RNN using GRU for document classification [Hierarchical Attention Networks for Document Classification] (https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf) [Yang et al., NAACL2016] I referred to. Feature map $ \ boldsymbol {c} \ in \ mathcal {R} ^ {l-k + 1} $
\boldsymbol{c} = [c_1, c_2,\cdots,c_{l-k+1}]
$ l $ is the sentence length and $ k $ is the window size. Calculates the importance in this feature map $ \ boldsymbol {c} $. This part is the Attention mechanism.
\begin{align}
p & = \sum_{i} a_i \odot c_i \\
a_i & = \frac{\exp(W^{(C2)} \tanh(W^{(C1)} c_i))} {\sum_{j} \exp(W^{(C2)} \tanh(W^{(C1)} c_j))}
\end{align}
$ \ Odot $ is the product of elements. $ W ^ {(C1)} \ in \ mathcal {R} ^ {{d} \ times 1} $, $ W ^ {(C2)} \ in \ mathcal {R} ^ {1 \ times {d}} $ D $ of $ is a hyperparameter. What is this name? .. $ a_i $ is calculated to have real numbers from 0 to 1, and the closer $ a_i $ is to 1, the more important the corresponding $ c_i $ is. One pooling result $ p $ is output from one feature map. From here on, it's the same as the Kim CNN model introduced above. Combine multiple $ p $, compress the resulting vector $ v $, and classify it with a softmax classifier.
v = p^1\oplus p^2\oplus \cdots p^m
$ m $ is the number of feature maps. Here, it is set to 100 like Kim.
It's like using Attention instead of max pooling on CNN's pooling layer. It looks like this in the figure.
When using Attention in an RNN, we calculate the importance of the hidden layer vector, This is a form that calculates the importance of the scalar obtained by convolution (do you have ngram information?), So whether it will work or not. .. ..
cnn_attention.py
class CNN_attention(Chain):
def __init__(self, vocab_size, embedding_size, input_channel, output_channel_1, output_channel_2, output_channel_3, k1size, k2size, k3size, pooling_units, atten_size=20, output_size=args.classtype, train=True):
super(CNN_attention, self).__init__(
w2e = L.EmbedID(vocab_size, embedding_size),
conv1 = L.Convolution2D(input_channel, output_channel_1, (k1size, embedding_size)),
conv2 = L.Convolution2D(input_channel, output_channel_2, (k2size, embedding_size)),
conv3 = L.Convolution2D(input_channel, output_channel_3, (k3size, embedding_size)),
l1 = L.Linear(pooling_units, output_size),
#Attention
a1 = L.Linear(1, atten_size),
a2 = L.Linear(atten_size, 1),
)
self.output_size = output_size
self.train = train
self.embedding_size = embedding_size
self.ignore_label = 0
self.w2e.W.data[self.ignore_label] = 0
self.w2e.W.data[1] = 0 #Non-character
self.input_channel = input_channel
def initialize_embeddings(self, word2id):
#w_vector = word2vec.Word2Vec.load_word2vec_format('./vector/glove.840B.300d.txt', binary=False) # GloVe
w_vector = word2vec.Word2Vec.load_word2vec_format('./vector/GoogleNews-vectors-negative300.bin', binary=True) # word2vec
for word, id in sorted(word2id.items(), key=lambda x:x[1])[1:]:
if word in w_vector:
self.w2e.W.data[id] = w_vector[word]
else:
self.w2e.W.data[id] = np.reshape(np.random.uniform(-0.25,0.25,self.embedding_size),(self.embedding_size,))
def __call__(self, x):
h_list = list()
ox = copy.copy(x)
if args.gpu != -1:
ox.to_gpu()
x = xp.array(x.data)
x = F.tanh(self.w2e(x))
b, max_len, w = x.shape # batch_size, max_len, embedding_size
x = F.reshape(x, (b, self.input_channel, max_len, w))
c1 = self.conv1(x)
b, outputC, fixed_len, _ = c1.shape
tf = self.set_tfs(ox, b, outputC, fixed_len) # true&flase
h1 = self.attention_pooling(F.relu(c1), b, outputC, fixed_len, tf)
h1 = F.reshape(h1, (b, outputC))
h_list.append(h1)
c2 = self.conv2(x)
b, outputC, fixed_len, _ = c2.shape
tf = self.set_tfs(ox, b, outputC, fixed_len) # true&flase
h2 = self.attention_pooling(F.relu(c2), b, outputC, fixed_len, tf)
h2 = F.reshape(h2, (b, outputC))
h_list.append(h2)
c3 = self.conv3(x)
b, outputC, fixed_len, _ = c3.shape
tf = self.set_tfs(ox, b, outputC, fixed_len) # true&flase
h3 = self.attention_pooling(F.relu(c3), b, outputC, fixed_len, tf)
h3 = F.reshape(h3, (b, outputC))
h_list.append(h3)
h4 = F.concat(h_list)
y = self.l1(F.dropout(h4, train=self.train))
return y
def set_tfs(self, x, b, outputC, fixed_len):
TF = Variable(x[:,:fixed_len].data != 0, volatile='auto')
TF = F.reshape(TF, (b, 1, fixed_len, 1))
TF = F.broadcast_to(TF, (b, outputC, fixed_len, 1))
return TF
def attention_pooling(self, c, b, outputC, fixed_len, tf):
reshaped_c = F.reshape(c, (b*outputC*fixed_len, 1))
scala = self.a2(F.tanh(self.a1(reshaped_c)))
reshaped_scala = F.reshape(scala, (b, outputC, fixed_len, 1))
reshaped_scala = F.where(tf, reshaped_scala, Variable(-10*xp.ones((b, outputC, fixed_len, 1)).astype(xp.float32), volatile='auto'))
rereshaped_scala = F.reshape(reshaped_scala, (b*outputC, fixed_len)) # reshape for F.softmax
softmax_scala = F.softmax(rereshaped_scala)
atten = F.reshape(softmax_scala, (b*outputC*fixed_len, 1))
a_h = F.scale(reshaped_c, atten, axis=0)
reshaped_a_h = F.reshape(a_h, (b, outputC, fixed_len, 1))
p = F.sum(reshaped_a_h, axis=2)
return p
We used SST to compare the accuracy rate of the classification with max pooling. We experimented with two tasks: SST-5, which classifies the five values of very negative, negative, neutral, positive, and very positive, and SST-2, which classifies positive and negative without neutral.
method | SST-2 | SST-5 |
---|---|---|
max | 86.3 (0.27) | 46.5 (1.13) |
attention | 86.0 (0.20) | 47.2 (0.37) |
The value is the average value tried 5 times, and the value in parentheses is the standard deviation. It is better to use Attention in 5-value classification, but the result is that it does not change so much in 2-value classification. By the way, the maximum value (SST-5) out of 5 times was 48.2% for max pooling and 47.7% for Attention, and max pooling gave better results. It's just easy to shake. .. ..
If you take a closer look at how Attention is in the feature map, I found that one of them was strongly emphasized by about 0.9, and the others were almost 0, which was similar to max pooling. However, unlike max pooling, the value of the entire feature map is taken into consideration, so I wonder if it is difficult to make a mistake. .. ..
Intuitively, I felt that Attention, which looks at the overall importance, is better than max pooling, which uses only the maximum value. It's not bad because the accuracy is higher in the 5-value classification, regardless of the binary classification. .. I think it depends on the task, so I would like to try other tasks as well.
The article here also introduces text classification using CNN in an easy-to-understand manner.
Recommended Posts