Last time is the basic usage of Chainer, implementation of MLP (multilayer perceptron) and the number of nodes of fully connected layer from the convolution layer of CNN. I introduced the calculation formula necessary for this.
This time, I will read the actual Twitter data and build a CNN.
The explanation of the convolutional neural network was easy to understand here.
A filter is applied to the 2D image data to compress the features. After that, pooling is performed to further extract the features. Probably, there is not one type of filter, but different filters are applied for the number of sheets you want to output.
Refer to this article and refer to this paper. I think about imitating it.
The outline of the process is as follows.
The definition and learning of CNN uses the following method described in the dissertation.
This figure depicts the convolution and pooling process for a sentence, with $ d $ being the dimension of the Word vector and $ s $ being the number of words in the sentence in the large matrix on the far left. And the filter size is not a symmetric matrix, but an asymmetric matrix of $ d × m $ dimensions.
However, I did not understand even after reading the paper, but $ s $ is different for each sentence, so if I was wondering how to do it, this article In / items / 93fcb2bc27d7b268cbe6), the maximum number of words for each sentence that appears in all Tweets was taken, so imitate this.
Get the Tweet data (https://raw.githubusercontent.com/satwantrana/CharSCNN/master/tweets_clean.txt). In the data, the first column is the flag of [0,1], and the second column is the Tweet in English.
Word Embed
I want to keep the sentence in a two-dimensional image format, so I will change it to a distributed representation. This time, the Embed ID that came with Chainer didn't work, so I'll use Word2Vec from the gensim package.
First, assign an ID to Word from the text data.
#! -*- coding:utf-8 -*-
def read(inp_file,num_sent=None):
f_in = open(inp_file, 'r')
lines = f_in.readlines()
words_map = {}
word_cnt = 0
k_wrd = 5 #Word context window
y = []
x_wrd = []
if num_sent is None:
num_sent = len(lines)
max_sen_len = 0
else:
max_sen_len, num_sent = 0, num_sent
words_vocab_mat = []
token_list = []
for line in lines[:num_sent]:
words = line[:-1].split()
tokens = words[1:]
y.append(int(float(words[0])))
max_sen_len = max(max_sen_len,len(tokens))
for token in tokens:
if token not in words_map:
words_map[token] = word_cnt
token_list.append(token)
word_cnt += 1
words_vocab_mat.append(tokens)
cnt = 0
for line in lines[:num_sent]:
words = line[:-1].split()
cnt += 1
tokens = words[1:]
word_mat = [-1] * (max_sen_len+k_wrd-1)
for i in xrange(len(tokens)):
word_mat[(k_wrd/2)+i] = words_map[tokens[i]]
x_wrd.append(word_mat)
max_sen_len += k_wrd-1
# num_sent:Number of documents
# word_cnt:Number of word types
# max_sen_len:Maximum length of document
# x_wrd:Number of rows in id column of input word:Number of sentences(num_sent)Number of columns:Maximum length of document(max_sen_len)
# k_wrd: window size
# words_map : key = word,value = id
# y: 1 or 0 (i.e., positive or negative)
# words_vocab_mat :Decomposed sentence, number of rows is number of sentences, number of columns is variable and number of words
# token_list :Token list, index corresponds to id
data = (num_sent, word_cnt, max_sen_len, k_wrd, x_wrd, y,words_map,words_vocab_mat,token_list)
return data
(num_sent, word_cnt, max_sen_len, k_wrd, x_wrd, y,words_map,sentences,token_list) = load.read("data/tweets_clean.txt",10000)
x_wrd is a matrix of number of sentences x maximum document length, and each element is the ID of the word that appears. You'll also need words_map, token_list, and words_vocab_mat, as you'll need them later.
Next, use Word2Vec to get a vector representation of each word, and then create a "sentence image matrix" (arbitrarily attached).
"""Create a word vector space in Word2Vec"""
word_dimension = 200
from gensim.models import Word2Vec
model_w2v = Word2Vec(sentences,seed=123,size=word_dimension,min_count=0,window=5)
sentence_image_matrix = np.zeros((len(sentences),1,word_dimension,max_sen_len)) #Initialization of sentence image matrix for convolution
"""x_Generate a vector for wrd"""
for i in range(0,len(x_wrd)):
tmp_id_list = x_wrd[i,:]
for j in range(0,len(tmp_id_list)):
"""Turn for one line"""
id = tmp_id_list[j]
if id == -1:
"""No information"""
sentence_image_matrix[i,0,:,j] = [0.] * word_dimension #Insert 0 vector
else:
target_word = token_list[id]
sentence_image_matrix[i,0,:,j] = model_w2v[target_word]
sentence_image_matrix is defined as a 4-dimensional tensor with the size (number of sentences, 1, vector dimension = 200, maximum sentence length).
As I learned for the first time, scikit train_test_split can also be used for 4D tensors. Probably because I'm only looking at the first dimension.
"""Divide into training data and test data"""
sentence_image_matrix = np.array(sentence_image_matrix,dtype=np.float32)
N = len(sentence_image_matrix)
t_n = 0.33
x_train,x_test,y_train,y_test = train_test_split(sentence_image_matrix,y,test_size=t_n,random_state=123)
The problem is the definition of CNN. In this paper, we use an asymmetric filter, and the pooling is also asymmetric, so we have to take that into consideration.
Then it looks like this.
class CNNFiltRow(ChainerClassifier):
"""
A pattern that moves all the row directions of CNN in the column direction as a filter
"""
def _setup_network(self, **params):
self.input_dim = params["input_dim"] #Column-wise dimensions of a single image
self.in_channels = params["in_channels"] #input channels : default = 1
self.out_channels = params["out_channels"] #out_channels :Any
self.row_dim = params["row_dim"] #Row direction dimension of one image=It becomes the number of lines of Filter
self.filt_clm = params["filt_clm"] #Number of Filter columns
self.pooling_row = params["pooling_row"] if params.has_key("pooling_row") else 1 #Number of pooling lines: default = 1
self.pooling_clm = params["pooling_clm"] if params.has_key("pooling_clm") else int(self.input_dim - 2 * math.floor(self.filt_clm/2.)) #Number of Pooling columns: default = math.floor((self.input_dim - 2 * math.floor(self.filt_clm/2.))
self.batch_size = params["batch_size"] if params.has_key("batch_size") else 100
self.hidden_dim = params["hidden_dim"]
self.n_classes = params["n_classes"]
self.conv1_out_dim = math.floor((self.input_dim - 2 * math.floor(self.filt_clm/2.))/self.pooling_clm)
network = FunctionSet(
conv1 = F.Convolution2D(self.in_channels,self.out_channels,(self.row_dim,self.filt_clm)), #Made the Filter asymmetric
l1=F.Linear(self.conv1_out_dim*self.out_channels, self.hidden_dim),
l2=F.Linear(self.hidden_dim, self.hidden_dim),
l3=F.Linear(self.hidden_dim, self.n_classes),
)
return network
def forward(self, x, train=True):
h = F.max_pooling_2d(F.relu(self.network.conv1(x)), (self.pooling_row,self.pooling_clm))
h1 = F.dropout(F.relu(self.network.l1(h)),train=train)
h2 = F.dropout(F.relu(self.network.l2(h1)),train=train)
y = self.network.l3(h2)
return y
def output_func(self, h):
return F.softmax(h)
def loss_func(self, y, t):
return F.softmax_cross_entropy(y, t)
def fit(self, x_data, y_data):
batchsize = self.batch_size
N = len(y_data)
for loop in range(self.n_iter):
perm = np.random.permutation(N)
sum_accuracy = 0
sum_loss = 0
for i in six.moves.range(0, N, batchsize):
x_batch = x_data[perm[i:i + batchsize]]
y_batch = y_data[perm[i:i + batchsize]]
x = Variable(x_batch)
y = Variable(y_batch)
self.optimizer.zero_grads()
yp = self.forward(x)
loss = self.loss_func(yp,y)
loss.backward()
self.optimizer.update()
sum_loss += loss.data * len(y_batch)
sum_accuracy += F.accuracy(yp,y).data * len(y_batch)
if self.report > 0 and loop % self.report == 0:
print('loop={}, train mean loss={} , train mean accuracy={}'.format(loop, sum_loss / N,sum_accuracy / N))
return self
def fit_test(self, x_data, y_data,x_test,y_test):
batchsize = self.batch_size
N = len(y_data)
Nt = len(y_test)
train_ac = []
test_ac = []
for loop in range(self.n_iter):
perm = np.random.permutation(N)
permt = np.random.permutation(Nt)
sum_accuracy = 0
sum_loss = 0
sum_accuracy_t = 0
"""Learning phase"""
for i in six.moves.range(0, N, batchsize):
x_batch = x_data[perm[i:i + batchsize]]
y_batch = y_data[perm[i:i + batchsize]]
x = Variable(x_batch)
y = Variable(y_batch)
self.optimizer.zero_grads()
yp = self.forward(x)
loss = self.loss_func(yp,y)
loss.backward()
self.optimizer.update()
sum_loss += loss.data * len(y_batch)
sum_accuracy += F.accuracy(yp,y).data * len(y_batch)
"""Test phase"""
for i in six.moves.range(0,Nt,batchsize):
x_batch = x_test[permt[i:i + batchsize]]
y_batch = y_test[permt[i:i + batchsize]]
x = Variable(x_batch)
y = Variable(y_batch)
yp = self.forward(x,False)
sum_accuracy_t += F.accuracy(yp,y).data * len(y_batch)
if self.report > 0 and loop % self.report == 0:
print('loop={}, train mean loss={} , train mean accuracy={} , test mean accuracy={}'.format(loop, sum_loss / N,sum_accuracy / N,sum_accuracy_t / Nt))
train_ac.append(sum_accuracy / N)
test_ac.append(sum_accuracy_t / Nt)
return self,train_ac,test_ac
Please refer to Previous article for Chainer Classifier.
I also wanted to see the accuracy of the test data, so I added the fit_test method.
"""Learning CNN Filter Row"""
n_iter = 200
report = 5
params = {"input_dim":max_sen_len,"in_channels":1,"out_channels":20,"row_dim":word_dimension,"filt_clm":3,"batch_size":100,"hidden_dim":300,"n_classes":2}
cnn = CNNFiltRow(n_iter=n_iter,report=report,**params)
cnn,train_ac,test_ac = cnn.fit_test(x_train,y_train,x_test,y_test)
Below is a plot of the accuracy during training and the accuracy in the test data. It seems that overfitting starts from around 100. Still, the impression that generalization performance is high.
When test data is put into the finally completed model and each index is put out, it becomes as follows.
[CNN]P AUC: 0.80 Pres: 0.66 Recl: 0.89 Fscr: 0.76
The F score was 0.76 and the AUC was 0.8, which was pretty good.
Do the same with Random Forest and MLP (Multilayer Perceptron) as benchmarks. Since the input data is not two-dimensional in this case, it is corrected to one-dimensional like MNIST.
As a result, various indicators for similar test data are as follows.
[RF ]P AUC: 0.71 Pres: 0.65 Recl: 0.60 Fscr: 0.62
[MLP]P AUC: 0.71 Pres: 0.64 Recl: 0.69 Fscr: 0.67
[CNN]P AUC: 0.80 Pres: 0.66 Recl: 0.89 Fscr: 0.76
This overwhelming performance difference of CNN ...
Recommended Posts