100 Language Processing Knock 2020 Chapter 9: RNN, CNN

The other day, 100 Language Processing Knock 2020 was released. I myself have only been in natural language processing for a year, and I don't know the details, but I will solve all the problems and publish them in order to improve my technical skills.

All shall be executed on jupyter notebook, and the restrictions of the problem statement may be broken conveniently. The source code is also on github. Yes.

Chapter 8 is here.

The environment is Python 3.8.2 and Ubuntu 18.04.

Chapter 9: RNN, CNN

Same as Chapter 8 and use PyTorch.

80. Conversion to ID number

I want to give a unique ID number to the words in the learning data constructed in question 51. Assign an ID number to the word that appears more than once in the learning data by the method such as 1 for the word that appears most frequently in the learning data, 2 for the word that appears second, and so on. Then, implement a function that returns a sequence of ID numbers for a given word string. However, all ID numbers of words that appear less than twice should be 0.

code


import re
import spacy
import torch

Prepare spacy model and label.

code


nlp = spacy.load('en')
categories = ['b', 't', 'e', 'm']
category_names = ['business', 'science and technology', 'entertainment', 'health']

Read the file and tokenize it with spacy.

code


def tokenize(x):
    x = re.sub(r'\s+', ' ', x)
    x = nlp.make_doc(x)
    x = [d.text for d in x]
    return x

def read_feature_dataset(filename):
    with open(filename) as f:
        dataset = f.read().splitlines()
    dataset = [line.split('\t') for line in dataset]
    dataset_t = [categories.index(line[0]) for line in dataset]
    dataset_x = [tokenize(line[1]) for line in dataset]
    return dataset_x, torch.tensor(dataset_t, dtype=torch.long)

code


train_x, train_t = read_feature_dataset('data/train.txt')
valid_x, valid_t = read_feature_dataset('data/valid.txt')
test_x, test_t = read_feature_dataset('data/test.txt')

Extract the vocabulary. Only words that appear more than once are targeted.

code


from collections import Counter

code


counter = Counter([
    x
    for sent in train_x
    for x in sent
])

vocab_in_train = [
    token
    for token, freq in counter.most_common()
    if freq > 1
]
len(vocab_in_train)

output


9700

Converts a word string into a string of ID numbers.

code


vocab_list = ['[UNK]'] + vocab_in_train
vocab_dict = {x:n for n, x in enumerate(vocab_list)}

code


def sent_to_ids(sent):
    return torch.tensor([vocab_dict[x if x in vocab_dict else '[UNK]'] for x in sent], dtype=torch.long)

Let's tokenize the first sentence of the training data.

code


print(train_x[0])
print(sent_to_ids(train_x[0]))

output


['Kathleen', 'Sebelius', "'", 'LGBT', 'legacy']
tensor([   0,    0,    2, 2648,    0])

Convert it to a column of ID numbers.

code


def dataset_to_ids(dataset):
    return [sent_to_ids(x) for x in dataset]

code


train_s = dataset_to_ids(train_x)
valid_s = dataset_to_ids(valid_x)
test_s = dataset_to_ids(test_x)
train_s[:3]

output


[tensor([   0,    0,    2, 2648,    0]),
 tensor([   9, 6740, 1445, 2076,  583,   10,  547,   32,   51,  873, 6741]),
 tensor([   0,  205, 4198,  315, 1899, 1232,    0])]

81. Forecast by RNN

There is a word string $ \ boldsymbol {x} = (x_1, x_2, \ dots, x_T) $ represented by an ID number. However, $ T $ is the length of the word string, and $ x_t \ in \ mathbb {R} ^ {V} $ is the one-hot notation of the word ID number ($ V $ is the total number of words). Using a recurrent neural network (RNN), implement the following equation as a model for predicting the category $ y $ from the word string $ \ boldsymbol {x} $. $ \overrightarrow h_0 = 0, \ \overrightarrow h_t = {\rm \overrightarrow{RNN}}(\mathrm{emb}(x_t), \overrightarrow h_{t-1}), \ y = {\rm softmax}(W^{(yh)} \overrightarrow h_T + b^{(y)}) $ However, $ \ mathrm {emb} (x) \ in \ mathbb {R} ^ {d_w} $ is word embedding (a function that converts a word from one-hot notation to a word vector), $ \ overridearrow h_t \ in \ mathbb {R} ^ {d_h} $ is the hidden state vector of time $ t $, $ {\ rm \ overridearrow {RNN}} (x, h) $ is from the input $ x $ and the hidden state $ h $ of the previous time The RNN unit that calculates the next state, $ W ^ {(yh)} \ in \ mathbb {R} ^ {L \ times d_h} $ is the matrix for predicting the category from the hidden state vector, $ b ^ {(y) )} \ in \ mathbb {R} ^ {L} $ is a bias term ($ d_w, d_h, L $ are the number of word embedding dimensions, the number of hidden state vectors, and the number of labels, respectively). The RNN unit $ {\ rm \ overrightarrow {RNN}} (x, h) $ can have various configurations, and the following equation is a typical example. $ {\rm \overrightarrow{RNN}}(x,h) = g(W^{(hx)} x + W^{(hh)}h + b^{(h)}) $ However, $ W ^ {(hx)} \ in \ mathbb {R} ^ {d_h \ times d_w}, W ^ {(hh)} \ in \ mathbb {R} ^ {d_h \ times d_h}, b ^ {(h)} \ in \ mathbb {R} ^ {d_h} $ is the parameter of the RNN unit, and $ g $ is the activation function (for example, $ \ tanh $ and ReLU). In this problem, we do not learn the parameters, we just need to calculate $ y $ with the randomly initialized parameters. Hyperparameters such as the number of dimensions should be set to appropriate values such as $ d_w = 300, d_h = 50 $ (the same applies to the following problems).

Unlike Chapter 8, the length of the input data differs depending on the sentence. Use various things in torch.nn.utils.rnn to pad the end of the variable length series so that it can be handled.

code


import random as rd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence as pad
from torch.nn.utils.rnn import pack_padded_sequence as pack
from torch.nn.utils.rnn import pad_packed_sequence as unpack

Create a Dataset class that holds the dataset. In addition to the input statement source and the target label target, it has the input statement lengths lengths as members.

code


class Dataset(torch.utils.data.Dataset):
    def __init__(self, source, target):
        self.source = source
        self.target = target
        self.lengths = torch.tensor([len(x) for x in source])
        self.size = len(source)
    
    def __len__(self):
        return self.size
            
    def __getitem__(self, index):
        return {
            'src':self.source[index],
            'trg':self.target[index],
            'lengths':self.lengths[index],
        }
    
    def collate(self, xs):
        return {
            'src':pad([x['src'] for x in xs]),
            'trg':torch.stack([x['trg'] for x in xs], dim=-1),
            'lengths':torch.stack([x['lengths'] for x in xs], dim=-1)
        }

Prepare the data set.

code


train_dataset = Dataset(train_s, train_t)
valid_dataset = Dataset(valid_s, valid_t)
test_dataset = Dataset(test_s, test_t)

Define the same Sampler as in Chapter 8.

code


class Sampler(torch.utils.data.Sampler):
    def __init__(self, dataset, width, shuffle = False):
        self.dataset = dataset
        self.width = width
        self.shuffle = shuffle
        if not shuffle:
            self.indices = torch.arange(len(dataset))
            
    def __iter__(self):
        if self.shuffle:
            self.indices = torch.randperm(len(self.dataset))
        index = 0
        while index < len(self.dataset):
            yield self.indices[index : index + self.width]
            index += self.width

Since it is convenient to pack padding when the series in the batch are in descending order, we define a sampler that satisfies such a contract.

All you have to do is sort the indexes in descending order of length in advance and load them into batches from the front.

code


class DescendingSampler(Sampler):
    def __init__(self, dataset, width, shuffle = False):
        assert not shuffle
        super().__init__(dataset, width, shuffle)
        self.indices = self.indices[self.dataset.lengths[self.indices].argsort(descending=True)]

Also, during training, the less padding in the batch, the less unnecessary calculations and faster learning, so we will implement such a checkmate method. In the above two examples, the index is separated by the number of batches, but the next child is separated by the maximum number of tokens in the batch, so the number of cases in the batch is not always constant.

code


class MaxTokensSampler(Sampler):
    def __iter__(self):
        self.indices = torch.randperm(len(self.dataset))
        self.indices = self.indices[self.dataset.lengths[self.indices].argsort(descending=True)]
        for batch in self.generate_batches():
            yield batch

    def generate_batches(self):
        batches = []
        batch = []
        acc = 0
        max_len = 0
        for index in self.indices:
            acc += 1
            this_len = self.dataset.lengths[index]
            max_len = max(max_len, this_len)
            if acc * max_len > self.width:
                batches.append(batch)
                batch = [index]
                acc = 1
                max_len = this_len
            else:
                batch.append(index)
        if batch != []:
            batches.append(batch)
        rd.shuffle(batches)
        return batches

Prepare a function to create DataLoader.

code


def gen_loader(dataset, width, sampler=Sampler, shuffle=False, num_workers=8):
    return torch.utils.data.DataLoader(
        dataset, 
        batch_sampler = sampler(dataset, width, shuffle),
        collate_fn = dataset.collate,
        num_workers = num_workers,
    )

def gen_descending_loader(dataset, width, num_workers=0):
    return gen_loader(dataset, width, sampler = DescendingSampler, shuffle = False, num_workers = num_workers)

def gen_maxtokens_loader(dataset, width, num_workers=0):
    return gen_loader(dataset, width, sampler = MaxTokensSampler, shuffle = True, num_workers = num_workers)

Defines a one-layer unidirectional LSTM classifier. The length of each statement is required when packing apaded tensor with the collate of the dataset.

code


class LSTMClassifier(nn.Module):
    def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(v_size, e_size)
        self.rnn = nn.LSTM(e_size, h_size, num_layers = 1)
        self.out = nn.Linear(h_size, c_size)
        self.dropout = nn.Dropout(dropout)
        self.embed.weight.data.uniform_(-0.1, 0.1)
        for name, param in self.rnn.named_parameters():
            if 'weight' in name or 'bias' in name:
                param.data.uniform_(-0.1, 0.1)
        self.out.weight.data.uniform_(-0.1, 0.1)
    
    def forward(self, batch, h=None):
        x = self.embed(batch['src'])
        x = pack(x, batch['lengths'])
        x, (h, c) = self.rnn(x, h)
        h = self.out(h)
        return h.squeeze(0)

Predict.

code


model = LSTMClassifier(len(vocab_dict), 300, 50, 4)
loader = gen_loader(test_dataset, 10, DescendingSampler, False)
model(iter(loader).next()).argmax(dim=-1)

output


tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

82. Learning by stochastic gradient descent

Learn the model constructed in Problem 81 using Stochastic Gradient Descent (SGD). Learn the model while displaying the loss and correct answer rate on the training data and the loss and correct answer rate on the evaluation data, and finish with an appropriate standard (for example, 10 epochs).

All you have to do is define Task and Trainer and run the training.

code


class Task:
    def __init__(self):
        self.criterion = nn.CrossEntropyLoss()
    
    def train_step(self, model, batch):
        model.zero_grad()
        loss = self.criterion(model(batch), batch['trg'])
        loss.backward()
        return loss.item()
    
    def valid_step(self, model, batch):
        with torch.no_grad():
            loss = self.criterion(model(batch), batch['trg'])
        return loss.item()

code


class Trainer:
    def __init__(self, model, loaders, task, optimizer, max_iter, device = None):
        self.model = model
        self.model.to(device)
        self.train_loader, self.valid_loader = loaders
        self.task = task
        self.optimizer = optimizer
        self.max_iter = max_iter
        self.device = device
    
    def send(self, batch):
        for key in batch:
            batch[key] = batch[key].to(self.device)
        return batch
        
    def train_epoch(self):
        self.model.train()
        acc = 0
        for n, batch in enumerate(self.train_loader):
            batch = self.send(batch)
            acc += self.task.train_step(self.model, batch)
            self.optimizer.step()
        return acc / n
            
    def valid_epoch(self):
        self.model.eval()
        acc = 0
        for n, batch in enumerate(self.valid_loader):
            batch = self.send(batch)
            acc += self.task.valid_step(self.model, batch)
        return acc / n
    
    def train(self):
        for epoch in range(self.max_iter):
            train_loss = self.train_epoch()
            valid_loss = self.valid_epoch()
            print('epoch {}, train_loss:{:.5f}, valid_loss:{:.5f}'.format(epoch, train_loss, valid_loss))

I will learn.

code


device = torch.device('cuda')
model = LSTMClassifier(len(vocab_dict), 300, 128, 4)
loaders = (
    gen_loader(train_dataset, 1),
    gen_loader(valid_dataset, 1),
)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 3, device)
trainer.train()

I will try to predict.

code


import numpy as np

code


class Predictor:
    def __init__(self, model, loader, device=None):
        self.model = model
        self.loader = loader
        self.device = device
        
    def send(self, batch):
        for key in batch:
            batch[key] = batch[key].to(self.device)
        return batch
        
    def infer(self, batch):
        self.model.eval()
        batch = self.send(batch)
        return self.model(batch).argmax(dim=-1).item()
        
    def predict(self):
        lst = []
        for batch in self.loader:
            lst.append(self.infer(batch))
        return lst

code


def accuracy(true, pred):
    return np.mean([t == p for t, p in zip(true, pred)])

predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

output


Correct answer rate in training data: 0.7592661924372894
Correct answer rate in evaluation data: 0.6384730538922155

If you turn the epoch a little more, the accuracy will improve a little, but I don't care.

83. Mini-batch / Learning on GPU

Modify the code of Problem 82 so that learning can be performed by calculating the loss / gradient for each $ B $ case (choose the value of $ B $ appropriately). Also, execute learning on the GPU.

code


model = LSTMClassifier(len(vocab_dict), 300, 128, 4)
loaders = (
    gen_maxtokens_loader(train_dataset, 4000),
    gen_descending_loader(valid_dataset, 128),
)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.2, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()

output


epoch 0, train_loss:1.22489, valid_loss:1.26302
epoch 1, train_loss:1.11631, valid_loss:1.19404
epoch 2, train_loss:1.07750, valid_loss:1.18451
epoch 3, train_loss:0.96149, valid_loss:1.06748
epoch 4, train_loss:0.81597, valid_loss:0.86547
epoch 5, train_loss:0.74748, valid_loss:0.81049
epoch 6, train_loss:0.80179, valid_loss:0.89621
epoch 7, train_loss:0.60231, valid_loss:0.78494
epoch 8, train_loss:0.52551, valid_loss:0.73272
epoch 9, train_loss:0.97286, valid_loss:1.05034

code


predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

output


Correct answer rate in training data: 0.7202358667165856
Correct answer rate in evaluation data: 0.6773952095808383

For some reason, the loss has increased and the accuracy has decreased, but let's live strongly without worrying about it.

84. Introduction of word vector

Pre-learned word vector (for example, [learned word vector] in Google News dataset (about 100 billion words)](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing )) Initialize the word embedding $ \ mathrm {emb} (x) $ and learn.

code


from gensim.models import KeyedVectors
vectors = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)

Initialize the word embedding before learning.

code


def init_embed(embed):
    for i, token in enumerate(vocab_list):
        if token in vectors:
            embed.weight.data[i] = torch.from_numpy(vectors[token])
    return embed

code


model = LSTMClassifier(len(vocab_dict), 300, 128, 4)
init_embed(model.embed)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.05, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()

output


epoch 0, train_loss:1.21390, valid_loss:1.19333
epoch 1, train_loss:0.88751, valid_loss:0.74930
epoch 2, train_loss:0.57240, valid_loss:0.65822
epoch 3, train_loss:0.50240, valid_loss:0.62686
epoch 4, train_loss:0.45800, valid_loss:0.59535
epoch 5, train_loss:0.44051, valid_loss:0.55849
epoch 6, train_loss:0.38251, valid_loss:0.51837
epoch 7, train_loss:0.35731, valid_loss:0.47709
epoch 8, train_loss:0.30278, valid_loss:0.43797
epoch 9, train_loss:0.25518, valid_loss:0.41287

code


predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

output


Correct answer rate in training data: 0.925028079371022
Correct answer rate in evaluation data: 0.8839820359281437

85. Bi-directional RNN / multi-layer

Encode the input text using both forward and reverse RNNs and train the model. $ \overleftarrow h_{T+1} = 0, \ \overleftarrow h_t = {\rm \overleftarrow{RNN}}(\mathrm{emb}(x_t), \overleftarrow h_{t+1}), \ y = {\rm softmax}(W^{(yh)} [\overrightarrow h_T; \overleftarrow h_1] + b^{(y)}) $ However, $ \ overrightarrow h_t \ in \ mathbb {R} ^ {d_h}, \ overleftarrow h_t \ in \ mathbb {R} ^ {d_h} $ are the times $ t obtained by the forward and reverse RNNs, respectively. The hidden state vector of $, $ {\ rm \ overleftarrow {RNN}} (x, h) $ is the RNN unit that calculates the previous state from the input $ x $ and the hidden state $ h $ of the next time, $ W ^ {( yh)} \ in \ mathbb {R} ^ {L \ times 2d_h} $ is a matrix for predicting categories from hidden state vectors, $ b ^ {(y)} \ in \ mathbb {R} ^ {L} $ Is a bias term. Also, $ [a; b] $ represents the concatenation of the vectors $ a $ and $ b $. Furthermore, experiment with bidirectional RNNs in multiple layers.

If you change the parameters of nn.LSTM a little, you can make it multi-layered or bidirectional.

Since the hidden state increases, the last two (hidden states in the forward and reverse directions of the last layer) are acquired.

code


class BiLSTMClassifier(nn.Module):
    def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(v_size, e_size)
        self.rnn = nn.LSTM(e_size, h_size, num_layers = 2, bidirectional = True)
        self.out = nn.Linear(h_size * 2, c_size)
        self.dropout = nn.Dropout(dropout)
        nn.init.uniform_(self.embed.weight, -0.1, 0.1)
        for name, param in self.rnn.named_parameters():
            if 'weight' in name or 'bias' in name:
                nn.init.uniform_(param, -0.1, 0.1)
        nn.init.uniform_(self.out.weight, -0.1, 0.1)
    
    def forward(self, batch, h=None):
        x = self.embed(batch['src'])
        x = pack(x, batch['lengths'])
        x, (h, c) = self.rnn(x, h)
        h = h[-2:]
        h = h.transpose(0,1)
        h = h.contiguous().view(-1, h.size(1) * h.size(2))
        h = self.out(h)
        return h

code


model = BiLSTMClassifier(len(vocab_dict), 300, 128, 4)
init_embed(model.embed)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.05, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()

code


predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

86. Convolutional Neural Network (CNN)

There is a word string $ \ boldsymbol x = (x_1, x_2, \ dots, x_T) $ represented by an ID number. However, $ T $ is the length of the word string, and $ x_t \ in \ mathbb {R} ^ {V} $ is the one-hot notation of the word ID number ($ V $ is the total number of words). Implement a model that predicts the category $ y $ from the word string $ \ boldsymbol x $ using a convolutional neural network (CNN). However, the configuration of the convolutional neural network is as follows.

  • Word embedding dimensions: $ d_w $

That is, the feature vector $ p_t \ in \ mathbb {R} ^ {d_h} $ at time $ t $ is expressed by the following equation. $ p_t = g(W^{(px)} [\mathrm{emb}(x_{t-1}); \mathrm{emb}(x_t); \mathrm{emb}(x_{t+1})] + b^{(p)}) $ However, $ W ^ {(px)} \ in \ mathbb {R} ^ {d_h \ times 3d_w}, b ^ {(p)} \ in \ mathbb {R} ^ {d_h} $ is a CNN parameter, $ g $ is the activation function (eg $ \ tanh $, ReLU, etc.), and $ [a; b; c] $ is the concatenation of the vectors $ a, b, c $. The number of columns in the matrix $ W ^ {(px)} $ is $ 3d_w $ because the linear transformation is performed on the concatenated word embeddings of three tokens. In maximum value pooling, the maximum value at all times is taken for each dimension of the feature vector, and the feature vector $ c \ in \ mathbb {R} ^ {d_h} $ of the input document is obtained. If $ c [i] $ represents the value of the $ i $ th dimension of the vector $ c $, the maximum value pooling is expressed by the following equation. $ c[i] = \max_{1 \leq t \leq T} p_t[i] $ Finally, the input document feature vector $ c $ with the matrix $ W ^ {(yc)} \ in \ mathbb {R} ^ {L \ times d_h} $ and the bias term $ b ^ {(y)} \ in Apply the linear transformation by \ mathbb {R} ^ {L} $ and the softmax function to predict the category $ y $. $ y = {\rm softmax}(W^{(yc)} c + b^{(y)}) $ Note that this problem does not train the model, it only needs to calculate $ y $ with a randomly initialized weight matrix.

I want to add PAD tokens to both ends of the input data, so I will use such a dataset.

code


cnn_vocab_list = ['[PAD]', '[UNK]'] + vocab_in_train
cnn_vocab_dict = {x:n for n, x in enumerate(cnn_vocab_list)}

def cnn_sent_to_ids(sent):
    return torch.tensor([cnn_vocab_dict[x if x in cnn_vocab_dict else '[UNK]'] for x in sent], dtype=torch.long)

print(train_x[0])
print(cnn_sent_to_ids(train_x[0]))

output


['Kathleen', 'Sebelius', "'", 'LGBT', 'legacy']
tensor([   1,    1,    3, 2649,    1])

Since the window width is 3, I added two ʻEOS`s.

output


def cnn_dataset_to_ids(dataset):
    return [cnn_sent_to_ids(x) for x in dataset]

cnn_train_s = cnn_dataset_to_ids(train_x)
cnn_valid_s = cnn_dataset_to_ids(valid_x)
cnn_test_s = cnn_dataset_to_ids(test_x)

cnn_train_s[:3]

output


[tensor([   1,    1,    3, 2649,    1]),
 tensor([  10, 6741, 1446, 2077,  584,   11,  548,   33,   52,  874, 6742]),
 tensor([   1,  206, 4199,  316, 1900, 1233,    1])]

code


class CNNDataset(Dataset):
    def collate(self, xs):
        max_seq_len = max([x['lengths'] for x in xs])
        src = [torch.cat([x['src'], torch.zeros(max_seq_len - x['lengths'], dtype=torch.long)], dim=-1) for x in xs]
        src = torch.stack(src)
        mask = [[1] * x['lengths'] + [0] * (max_seq_len - x['lengths']) for x in xs]
        mask = torch.tensor(mask, dtype=torch.long)
        return {
            'src':src,
            'trg':torch.tensor([x['trg'] for x in xs]),
            'mask':mask,
        }

cnn_train_dataset = CNNDataset(cnn_train_s, train_t)
cnn_valid_dataset = CNNDataset(cnn_valid_s, valid_t)
cnn_test_dataset = CNNDataset(cnn_test_s, test_t)

Create a CNN model.

code


class CNNClassifier(nn.Module):
    def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(v_size, e_size)
        self.conv = nn.Conv1d(e_size, h_size, 3, padding=1)
        self.act = nn.ReLU()
        self.out = nn.Linear(h_size, c_size)
        self.dropout = nn.Dropout(dropout)
        nn.init.normal_(self.embed.weight, 0, 0.1)
        nn.init.kaiming_normal_(self.conv.weight)
        nn.init.constant_(self.conv.bias, 0)
        nn.init.kaiming_normal_(self.out.weight)
        nn.init.constant_(self.out.bias, 0)
    
    def forward(self, batch):
        x = self.embed(batch['src'])
        x = self.dropout(x)
        x = self.conv(x.transpose(-1, -2))
        x = self.act(x)
        x = self.dropout(x)
        x.masked_fill_(batch['mask'].unsqueeze(-2) == 0, -1e4)
        x = F.max_pool1d(x, x.size(-1)).squeeze(-1)
        x = self.out(x)
        return x

Since padding = 1 is specified for nn.Conv1d, one padding token is inserted at each end. This token is 0 by default, but there is no problem because the padding id that matches the length of the series length is also 0. When convolving in the time direction, transpose is done to make the last dimension the time axis. The value of the padding part is set to -1 so that the value is not fetched by the maximum value pooling. When pooling the maximum value with max_pool1d, it is necessary to specify the dimension.

87. Learning CNN by Stochastic Gradient Descent

Learn the model constructed in Problem 86 using Stochastic Gradient Descent (SGD). Learn the model while displaying the loss and correct answer rate on the training data and the loss and correct answer rate on the evaluation data, and finish with an appropriate standard (for example, 10 epochs).

Let me learn.

code


def init_cnn_embed(embed):
    for i, token in enumerate(cnn_vocab_list):
        if token in vectors:
            embed.weight.data[i] = torch.from_numpy(vectors[token])
    return embed

code


model = CNNClassifier(len(cnn_vocab_dict), 300, 128, 4)
init_cnn_embed(model.embed)
loaders = (
    gen_maxtokens_loader(cnn_train_dataset, 4000),
    gen_descending_loader(cnn_valid_dataset, 32),
)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()

output


epoch 0, train_loss:1.03501, valid_loss:0.85454
epoch 1, train_loss:0.68068, valid_loss:0.70825
epoch 2, train_loss:0.56784, valid_loss:0.60257
epoch 3, train_loss:0.50570, valid_loss:0.55611
epoch 4, train_loss:0.45707, valid_loss:0.52386
epoch 5, train_loss:0.42078, valid_loss:0.48479
epoch 6, train_loss:0.38858, valid_loss:0.45933
epoch 7, train_loss:0.36667, valid_loss:0.43547
epoch 8, train_loss:0.34746, valid_loss:0.41509
epoch 9, train_loss:0.32849, valid_loss:0.40350

code


predictor = Predictor(model, gen_loader(cnn_train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(cnn_test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

output


Correct answer rate in training data: 0.9106140022463497
Correct answer rate in evaluation data: 0.8847305389221557

88. Parameter tuning

Build a high-performance category classifier by modifying the code of Problem 85 and Problem 87 and adjusting the shape and hyperparameters of the neural network.

Since it's a big deal, I wrote a model that pools the output of LSTM to the maximum value with CNN.

code


class BiLSTMCNNDataset(Dataset):
    def collate(self, xs):
        max_seq_len = max([x['lengths'] for x in xs])
        mask = [[1] * (x['lengths'] - 2) + [0] * (max_seq_len - x['lengths']) for x in xs]
        mask = torch.tensor(mask, dtype=torch.long)
        return {
            'src':pad([x['src'] for x in xs]),
            'trg':torch.stack([x['trg'] for x in xs], dim=-1),
            'mask':mask,
            'lengths':torch.stack([x['lengths'] for x in xs], dim=-1)
        }

rnncnn_train_dataset = BiLSTMCNNDataset(cnn_train_s, train_t)
rnncnn_valid_dataset = BiLSTMCNNDataset(cnn_valid_s, valid_t)
rnncnn_test_dataset = BiLSTMCNNDataset(cnn_test_s, test_t)

code


class BiLSTMCNNClassifier(nn.Module):
    def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(v_size, e_size)
        self.rnn = nn.LSTM(e_size, h_size, bidirectional = True)
        self.conv = nn.Conv1d(h_size* 2, h_size, 3, padding=1)
        self.act = nn.ReLU()
        self.out = nn.Linear(h_size, c_size)
        self.dropout = nn.Dropout(dropout)
        nn.init.uniform_(self.embed.weight, -0.1, 0.1)
        for name, param in self.rnn.named_parameters():
            if 'weight' in name or 'bias' in name:
                nn.init.uniform_(param, -0.1, 0.1)
        nn.init.kaiming_normal_(self.conv.weight)
        nn.init.constant_(self.conv.bias, 0)
        nn.init.kaiming_normal_(self.out.weight)
        nn.init.constant_(self.out.bias, 0)
    
    def forward(self, batch, h=None):
        x = self.embed(batch['src'])
        x = self.dropout(x)
        x = pack(x, batch['lengths'])
        x, (h, c) = self.rnn(x, h)
        x, _ = unpack(x)
        x = self.dropout(x)
        x = self.conv(x.permute(1, 2, 0))
        x = self.act(x)
        x = self.dropout(x)
        x.masked_fill_(batch['mask'].unsqueeze(-2) == 0, -1)
        x = F.max_pool1d(x, x.size(-1)).squeeze(-1)
        x = self.out(x)
        return x

code


loaders = (
    gen_maxtokens_loader(rnncnn_train_dataset, 4000),
    gen_descending_loader(rnncnn_valid_dataset, 32),
)
task = Task()
for h in [32, 64, 128, 256, 512]:
    model = BiLSTMCNNClassifier(len(cnn_vocab_dict), 300, h, 4)
    init_cnn_embed(model.embed)
    optimizer = optim.SGD(model.parameters(), lr=0.02, momentum=0.9, nesterov=True)
    trainer = Trainer(model, loaders, task, optimizer, 10, device)
    trainer.train()
    predictor = Predictor(model, gen_loader(rnncnn_test_dataset, 1), device)
    pred = predictor.predict()
    print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

output


epoch 0, train_loss:1.21905, valid_loss:1.12725
epoch 1, train_loss:0.95913, valid_loss:0.84094
epoch 2, train_loss:0.66851, valid_loss:0.66997
epoch 3, train_loss:0.57141, valid_loss:0.61373
epoch 4, train_loss:0.52795, valid_loss:0.59354
epoch 5, train_loss:0.49844, valid_loss:0.57013
epoch 6, train_loss:0.47408, valid_loss:0.55163
epoch 7, train_loss:0.44922, valid_loss:0.52349
epoch 8, train_loss:0.41864, valid_loss:0.49231
epoch 9, train_loss:0.38975, valid_loss:0.46807
Correct answer rate in evaluation data: 0.8690119760479041
epoch 0, train_loss:1.16516, valid_loss:1.06582
epoch 1, train_loss:0.81246, valid_loss:0.71224
epoch 2, train_loss:0.58068, valid_loss:0.61988
epoch 3, train_loss:0.52451, valid_loss:0.58465
epoch 4, train_loss:0.48807, valid_loss:0.55663
epoch 5, train_loss:0.45712, valid_loss:0.52742
epoch 6, train_loss:0.41639, valid_loss:0.50089
epoch 7, train_loss:0.38595, valid_loss:0.46442
epoch 8, train_loss:0.35262, valid_loss:0.43459
epoch 9, train_loss:0.32527, valid_loss:0.40692
Correct answer rate in evaluation data: 0.8772455089820359
epoch 0, train_loss:1.12191, valid_loss:0.97533
epoch 1, train_loss:0.71378, valid_loss:0.66554
epoch 2, train_loss:0.55280, valid_loss:0.59733
epoch 3, train_loss:0.50526, valid_loss:0.57163
epoch 4, train_loss:0.46889, valid_loss:0.53955
epoch 5, train_loss:0.43500, valid_loss:0.50500
epoch 6, train_loss:0.40006, valid_loss:0.47222
epoch 7, train_loss:0.36444, valid_loss:0.43941
epoch 8, train_loss:0.33329, valid_loss:0.41224
epoch 9, train_loss:0.30588, valid_loss:0.39965
Correct answer rate in evaluation data: 0.8839820359281437
epoch 0, train_loss:1.04536, valid_loss:0.84626
epoch 1, train_loss:0.61410, valid_loss:0.62255
epoch 2, train_loss:0.49830, valid_loss:0.55984
epoch 3, train_loss:0.44190, valid_loss:0.51720
epoch 4, train_loss:0.39713, valid_loss:0.46718
epoch 5, train_loss:0.35052, valid_loss:0.43181
epoch 6, train_loss:0.32145, valid_loss:0.39898
epoch 7, train_loss:0.30279, valid_loss:0.37586
epoch 8, train_loss:0.28171, valid_loss:0.37333
epoch 9, train_loss:0.26904, valid_loss:0.37849
Correct answer rate in evaluation data: 0.8884730538922155
epoch 0, train_loss:0.93974, valid_loss:0.71999
epoch 1, train_loss:0.53687, valid_loss:0.58747
epoch 2, train_loss:0.44848, valid_loss:0.52432
epoch 3, train_loss:0.38761, valid_loss:0.46509
epoch 4, train_loss:0.34431, valid_loss:0.43651
epoch 5, train_loss:0.31699, valid_loss:0.39881
epoch 6, train_loss:0.28963, valid_loss:0.38732
epoch 7, train_loss:0.27550, valid_loss:0.37152
epoch 8, train_loss:0.26003, valid_loss:0.36476
epoch 9, train_loss:0.24991, valid_loss:0.36012
Correct answer rate in evaluation data: 0.8944610778443114

I tried learning while changing the size of the hidden state. The result is that the larger the size of the hidden state, the better.

89. Transfer learning from a pre-trained language model

Build a model that classifies news article headlines into categories, starting from a pre-learned language model (eg BERT).

Use huggingface / transformers.

code


from transformers import *

The BERT input is a wordpiece, so you have to give it through the tokenizer. Prepare a tokenizer.

code


tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Talkize.

code


def read_for_bert(filename):
    with open(filename) as f:
        dataset = f.read().splitlines()
    dataset = [line.split('\t') for line in dataset]
    dataset_t = [categories.index(line[0]) for line in dataset]
    dataset_x = [torch.tensor(tokenizer.encode(line[1]), dtype=torch.long) for line in dataset]
    return dataset_x, torch.tensor(dataset_t, dtype=torch.long)

bert_train_x, bert_train_t = read_for_bert('data/train.txt')
bert_valid_x, bert_valid_t = read_for_bert('data/valid.txt')
bert_test_x, bert_test_t = read_for_bert('data/test.txt')

Prepare a dataset class for BERT. I do padding and so on. mask is an attention mask. I try not to get attention on the padding token.

code


class BertDataset(Dataset):
    def collate(self, xs):
        max_seq_len = max([x['lengths'] for x in xs])
        src = [torch.cat([x['src'], torch.zeros(max_seq_len - x['lengths'], dtype=torch.long)], dim=-1) for x in xs]
        src = torch.stack(src)
        mask = [[1] * x['lengths'] + [0] * (max_seq_len - x['lengths']) for x in xs]
        mask = torch.tensor(mask, dtype=torch.long)
        return {
            'src':src,
            'trg':torch.tensor([x['trg'] for x in xs]),
            'mask':mask,
        }

code


bert_train_dataset = BertDataset(bert_train_x, bert_train_t)
bert_valid_dataset = BertDataset(bert_valid_x, bert_valid_t)
bert_test_dataset = BertDataset(bert_test_x, bert_test_t)

Load the pre-learning model.

code


class BertClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        config = BertConfig.from_pretrained('bert-base-cased', num_labels=4)
        self.bert = BertForSequenceClassification.from_pretrained('bert-base-cased', config=config)
    
    def forward(self, batch):
        x = self.bert(batch['src'], attention_mask=batch['mask'])
        return x[0]

code


model = BertClassifier()
loaders = (
    gen_maxtokens_loader(bert_train_dataset, 1000),
    gen_descending_loader(bert_valid_dataset, 32),
)
task = Task()
optimizer = optim.AdamW(model.parameters(), lr=1e-5)
trainer = Trainer(model, loaders, task, optimizer, 5, device)
trainer.train()

code


predictor = Predictor(model, gen_loader(bert_train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(bert_test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(test_t, pred))

output


Correct answer rate in training data: 0.9927929614376638
Correct answer rate in training data: 0.9229041916167665

Sounds good.

Next is Chapter 10

Language processing 100 knocks 2020 Chapter 10: Machine translation (90-98)

Recommended Posts

100 Language Processing Knock 2020 Chapter 9: RNN, CNN
[Language processing 100 knocks 2020] Chapter 9: RNN, CNN
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock (2020): 28
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock (2020): 38
I tried 100 language processing knock 2020: Chapter 1
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock UNIX Commands Learned in Chapter 2
100 Language Processing Knock Regular Expressions Learned in Chapter 3
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary