[Language processing 100 knocks 2020] Chapter 9: RNN, CNN

Introduction

2020 version of 100 knocks of language processing, which is famous as a collection of problems of natural language processing, has been released. This article summarizes the results of solving "Chapter 9: RNN, CNN" from the following chapters 1 to 10.

-Chapter 1: Preparatory Movement -Chapter 2: UNIX Commands -Chapter 3: Regular Expressions -Chapter 4: Morphological analysis -Chapter 5: Dependency Analysis -Chapter 6: Machine Learning -Chapter 7: Word Vector -Chapter 8: Neural Net --Chapter 9: RNN, CNN --Chapter 10: Machine Translation

Advance preparation

Google Colaboratory is used for the answer. For details on how to set up and use Google Colaboratory, see this article. ** Since GPU is used in this chapter, change the hardware accelerator to "GPU" from "Runtime"-> "Change runtime type" and save it in advance. ** ** The notebook containing the execution results of the following answers is available on github.

Chapter 9: RNN, CNN

80. Conversion to ID number

I want to give a unique ID number to the words in the learning data constructed in question 51. The word that appears most frequently in the training data is `1```, the word that appears second is ``` 2```, and so on. The word that appears more than once in the training data is ID. Give it a number. Then, implement a function that returns a sequence of ID numbers for a given word string. However, all ID numbers of words that appear less than twice should be `0```.

First, after downloading the specified data, read it as a data frame. Then, it is divided into training data, verification data, and evaluation data and saved. Up to this point, the process is exactly the same as Problem 50 in Chapter 6, so there is no problem reading the data created there.

#Download data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
!unzip NewsAggregatorDataset.zip

#Replaced double quotes with single quotes to avoid errors when reading
!sed -e 's/"/'\''/g' ./newsCorpora.csv > ./newsCorpora_re.csv

import pandas as pd
from sklearn.model_selection import train_test_split

#Data reading
df = pd.read_csv('./newsCorpora_re.csv', header=None, sep='\t', names=['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP'])

#Data extraction
df = df.loc[df['PUBLISHER'].isin(['Reuters', 'Huffington Post', 'Businessweek', 'Contactmusic.com', 'Daily Mail']), ['TITLE', 'CATEGORY']]

#Data split
train, valid_test = train_test_split(df, test_size=0.2, shuffle=True, random_state=123, stratify=df['CATEGORY'])
valid, test = train_test_split(valid_test, test_size=0.5, shuffle=True, random_state=123, stratify=valid_test['CATEGORY'])

#Confirmation of the number of cases
print('[Learning data]')
print(train['CATEGORY'].value_counts())
print('[Verification data]')
print(valid['CATEGORY'].value_counts())
print('[Evaluation data]')
print(test['CATEGORY'].value_counts())

`output`


[Learning data]
b    4501
e    4235
t    1220
m     728
Name: CATEGORY, dtype: int64
[Verification data]
b    563
e    529
t    153
m     91
Name: CATEGORY, dtype: int64
[Evaluation data]
b    563
e    530
t    152
m     91
Name: CATEGORY, dtype: int64

Next, create a dictionary of words. The words in the learning data are counted, and the frequency ranking (ID) is registered using the one that appears more than once as a key.

from collections import defaultdict
import string

#Word frequency aggregation
d = defaultdict(int)
table = str.maketrans(string.punctuation, ' '*len(string.punctuation))  #A table that replaces symbols with spaces
for text in train['TITLE']:
  for word in text.translate(table).split():
    d[word] += 1
d = sorted(d.items(), key=lambda x:x[1], reverse=True)

#Creating a word ID dictionary
word2id = {word: i + 1 for i, (word, cnt) in enumerate(d) if cnt > 1}  #Register words that appear more than once

print(f'Number of IDs: {len(set(word2id.values()))}\n')
print('---Top 20 words by frequency---')
for key in list(word2id)[:20]:
    print(f'{key}: {word2id[key]}')

`output`


Number of IDs: 9405

---Top 20 words by frequency---
to: 1
s: 2
in: 3
on: 4
UPDATE: 5
as: 6
US: 7
for: 8
The: 9
of: 10
1: 11
To: 12
2: 13
the: 14
and: 15
In: 16
Of: 17
a: 18
at: 19
A: 20

Finally, we define a function that uses a dictionary to convert a given word string into a sequence of ID numbers. At this time, follow the instructions in the question sentence and return `` `0``` for words that are not in the dictionary.

def tokenizer(text, word2id=word2id, unk=0):
  """Divide the input text with spaces and convert it to an ID column(If it is not in the dictionary, set the number specified by unk)"""
  table = str.maketrans(string.punctuation, ' '*len(string.punctuation))
  return [word2id.get(word, unk) for word in text.translate(table).split()]

Check in the second sentence.

#Verification
text = train.iloc[1, train.columns.get_loc('TITLE')]
print(f'text: {text}')
print(f'ID column: {tokenizer(text)}')

`output`


text: Amazon Plans to Fight FTC Over Mobile-App Purchases
ID column: [169, 539, 1, 683, 1237, 82, 279, 1898, 4199]

81. Forecast by RNN

There is a word string $ \ boldsymbol {x} = (x_1, x_2, \ dots, x_T) $ represented by an ID number. However, $ T $ is the length of the word string, and $ x_t \ in \ mathbb {R} ^ {V} $ is the one-hot notation of the word ID number ($ V $ is the total number of words). Using a recurrent neural network (RNN), implement the following equation as a model for predicting the category $ y $ from the word string $ \ boldsymbol {x} $.

\overrightarrow h_0 = 0,\ \overrightarrow h_t = {\rm \overrightarrow{RNN}}(\mathrm{emb}(x_t), \overrightarrow h_{t-1}), \ y = {\rm softmax}(W^{(yh)} \overrightarrow h_T + b^{(y)})


 > However, $ \ mathrm {emb} (x) \ in \ mathbb {R} ^ {d_w} $ is word embedding (a function that converts a word from one-hot notation to a word vector), $ \ overridearrow h_t \ in \ mathbb {R} ^ {d_h} $ is the hidden state vector of time $ t $, $ {\ rm \ overridearrow {RNN}} (x, h) $ is from the input $ x $ and the hidden state $ h $ of the previous time The RNN unit that calculates the next state, $ W ^ {(yh)} \ in \ mathbb {R} ^ {L \ times d_h} $ is the matrix for predicting the category from the hidden state vector, $ b ^ {(y) )} \ in \ mathbb {R} ^ {L} $ is a bias term ($ d_w, d_h, L $ are the number of word embedding dimensions, the number of hidden state vectors, and the number of labels, respectively). The RNN unit $ {\ rm \ overrightarrow {RNN}} (x, h) $ can have various configurations, and the following equation is a typical example.

>```math
{\rm \overrightarrow{RNN}}(x,h) = g(W^{(hx)} x + W^{(hh)}h + b^{(h)})

However, $ W ^ {(hx)} \ in \ mathbb {R} ^ {d_h \ times d_w}, W ^ {(hh)} \ in \ mathbb {R} ^ {d_h \ times d_h}, b ^ {(h)} \ in \ mathbb {R} ^ {d_h} $ is the parameter of the RNN unit, and $ g $ is the activation function (for example, $ \ tanh $ and ReLU).

In this problem, we do not learn the parameters, we just need to calculate $ y $ with the randomly initialized parameters. Hyperparameters such as the number of dimensions should be set to appropriate values such as $ d_w = 300, d_h = 50 $ (the same applies to the following problems).

Before getting into the answer, let's organize the flow of natural language processing using neural networks, especially in text classification. Text classification using a neural network mainly consists of the following four steps.

Split a sentence into columns of tokens (eg words)
Convert each token to a vector
Aggregate token vectors into one as a statement vector
Classify labels with statement vector as input

Various methods can be considered for each process, but for example, in Chapter 8,

Split a sentence into columns of tokens (eg words) ⇒ ** Divide by space **
Convert each token to a vector ⇒ ** Convert with pre-learned Word2Vec **
Aggregate token vectors into one as a statement vector ⇒ ** Average token vector **
Classify labels with statement vector as input ⇒ ** Classified by fully connected layer **

I implemented the flow of No. 4 and learned the parameters of No. 4 (when targeting Japanese documents, in No. 1 Chapter 4 Morphological analysis of e06014b146a18e97ca59) is required).

On the other hand, in this chapter,

Split a sentence into columns of tokens (eg words) ⇒ Divide by space
Convert each token to a vector ⇒ ** Convert with embedded layer **
Aggregate token vectors into one as a statement vector ⇒ ** Aggregate by RNN or CNN **
Classify labels with statement vector as input ⇒ Classified by fully connected layer

Then, we will learn the parameters of the network that connects Nos. 2 to 4. In addition, as in the problem of this chapter, it is often the case that the divided tokens are converted to the corresponding IDs for convenience, but it is included in No. 1 as a process.

Now, let's implement the network of this question immediately. Use nn.Embedding for the embedded layer. Given the word ID, this layer converts to a one-hot vector and then to a vector of the specified size (emb_size). The following RNN part can be realized by recursively passing through the fully connected layer, but it can be written simply by using `` `nn.RNN```. Finally, connect the fully connected layers and you're done.

import torch
from torch import nn

class RNN(nn.Module):
  def __init__(self, vocab_size, emb_size, padding_idx, output_size, hidden_size):
    super().__init__()
    self.hidden_size = hidden_size
    self.emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
    self.rnn = nn.RNN(emb_size, hidden_size, nonlinearity='tanh', batch_first=True)
    self.fc = nn.Linear(hidden_size, output_size)
    
  def forward(self, x):
    self.batch_size = x.size()[0]
    hidden = self.init_hidden()  #Create a zero vector of h0
    emb = self.emb(x)
    # emb.size() = (batch_size, seq_len, emb_size)
    out, hidden = self.rnn(emb, hidden)
    # out.size() = (batch_size, seq_len, hidden_size)
    out = self.fc(out[:, -1, :])
    # out.size() = (batch_size, output_size)
    return out
    
  def init_hidden(self):
    hidden = torch.zeros(1, self.batch_size, self.hidden_size)
    return hidden

Next, define the class that creates the Dataset as in the previous chapter. This time, we will receive the text and label, convert the text to ID with the specified `` `tokenizer```, and then give each one a function to output in Tensor type.

from torch.utils.data import Dataset

class CreateDataset(Dataset):
  def __init__(self, X, y, tokenizer):
    self.X = X
    self.y = y
    self.tokenizer = tokenizer

  def __len__(self):  # len(Dataset)Specify the value to be returned with
    return len(self.y)

  def __getitem__(self, index):  # Dataset[index]Specify the value to be returned with
    text = self.X[index]
    inputs = self.tokenizer(text)

    return {
      'inputs': torch.tensor(inputs, dtype=torch.int64),
      'labels': torch.tensor(self.y[index], dtype=torch.int64)
    }

Create a Dataset using the above. For `` `tokenizer```, specify the function defined in the previous question.

#Creating a label vector
category_dict = {'b': 0, 't': 1, 'e':2, 'm':3}
y_train = train['CATEGORY'].map(lambda x: category_dict[x]).values
y_valid = valid['CATEGORY'].map(lambda x: category_dict[x]).values
y_test = test['CATEGORY'].map(lambda x: category_dict[x]).values

#Creating a Dataset
dataset_train = CreateDataset(train['TITLE'], y_train, tokenizer)
dataset_valid = CreateDataset(valid['TITLE'], y_valid, tokenizer)
dataset_test = CreateDataset(test['TITLE'], y_test, tokenizer)

print(f'len(Dataset)Output of: {len(dataset_train)}')
print('Dataset[index]Output of:')
for var in dataset_train[1]:
  print(f'  {var}: {dataset_train[1][var]}')

`output`


len(Dataset)Output of: 10684
Dataset[index]Output of:
  inputs: tensor([ 169,  539,    1,  683, 1237,   82,  279, 1898, 4199])
  labels: 1

Since we will not learn in this question, give `inputs``` from `Dataset``` to the model and check the output as it is after Softmax.

#Parameter setting
VOCAB_SIZE = len(set(word2id.values())) + 1  #Number of dictionary IDs+Padding ID
EMB_SIZE = 300
PADDING_IDX = len(set(word2id.values()))
OUTPUT_SIZE = 4
HIDDEN_SIZE = 50

#Model definition
model = RNN(VOCAB_SIZE, EMB_SIZE, PADDING_IDX, OUTPUT_SIZE, HIDDEN_SIZE)

#Get the first 10 predicted values
for i in range(10):
  X = dataset_train[i]['inputs']
  print(torch.softmax(model(X.unsqueeze(0)), dim=-1))

`output`


tensor([[0.2667, 0.2074, 0.2974, 0.2285]], grad_fn=<SoftmaxBackward>)
tensor([[0.1660, 0.3465, 0.2154, 0.2720]], grad_fn=<SoftmaxBackward>)
tensor([[0.2133, 0.2987, 0.3097, 0.1783]], grad_fn=<SoftmaxBackward>)
tensor([[0.2512, 0.4107, 0.1825, 0.1556]], grad_fn=<SoftmaxBackward>)
tensor([[0.2784, 0.1307, 0.3715, 0.2194]], grad_fn=<SoftmaxBackward>)
tensor([[0.2625, 0.1569, 0.2339, 0.3466]], grad_fn=<SoftmaxBackward>)
tensor([[0.1331, 0.5129, 0.2220, 0.1319]], grad_fn=<SoftmaxBackward>)
tensor([[0.2404, 0.1314, 0.2023, 0.4260]], grad_fn=<SoftmaxBackward>)
tensor([[0.1162, 0.4576, 0.2588, 0.1674]], grad_fn=<SoftmaxBackward>)
tensor([[0.4685, 0.1414, 0.2633, 0.1268]], grad_fn=<SoftmaxBackward>)

82. Learning by stochastic gradient descent

Learn the model constructed in Problem 81 using Stochastic Gradient Descent (SGD). Learn the model while displaying the loss and correct answer rate on the training data and the loss and correct answer rate on the evaluation data, and finish with an appropriate standard (for example, 10 epochs).

As in the previous chapter, this also defines a series of processing for learning as a `` `train_model``` function.

from torch.utils.data import DataLoader
import time
from torch import optim

def calculate_loss_and_accuracy(model, dataset, device=None, criterion=None):
  """Calculate loss / correct answer rate"""
  dataloader = DataLoader(dataset, batch_size=1, shuffle=False)
  loss = 0.0
  total = 0
  correct = 0
  with torch.no_grad():
    for data in dataloader:
      #Device specification
      inputs = data['inputs'].to(device)
      labels = data['labels'].to(device)

      #Forward propagation
      outputs = model(inputs)

      #Loss calculation
      if criterion != None:
        loss += criterion(outputs, labels).item()

      #Correct answer rate calculation
      pred = torch.argmax(outputs, dim=-1)
      total += len(inputs)
      correct += (pred == labels).sum().item()
      
  return loss / len(dataset), correct / total
  

def train_model(dataset_train, dataset_valid, batch_size, model, criterion, optimizer, num_epochs, collate_fn=None, device=None):
  """Executes model training and returns a log of loss / correct answer rate"""
  #Device specification
  model.to(device)

  #Creating a dataloader
  dataloader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
  dataloader_valid = DataLoader(dataset_valid, batch_size=1, shuffle=False)

  #Scheduler settings
  scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, num_epochs, eta_min=1e-5, last_epoch=-1)

  #Learning
  log_train = []
  log_valid = []
  for epoch in range(num_epochs):
    #Record start time
    s_time = time.time()

    #Set to training mode
    model.train()
    for data in dataloader_train:
      #Initialize gradient to zero
      optimizer.zero_grad()

      #Forward propagation+Backpropagation of error+Weight update
      inputs = data['inputs'].to(device)
      labels = data['labels'].to(device)
      outputs = model.forward(inputs)
      loss = criterion(outputs, labels)
      loss.backward()
      optimizer.step()
    
    #Set to evaluation mode
    model.eval()

    #Calculation of loss and correct answer rate
    loss_train, acc_train = calculate_loss_and_accuracy(model, dataset_train, device, criterion=criterion)
    loss_valid, acc_valid = calculate_loss_and_accuracy(model, dataset_valid, device, criterion=criterion)
    log_train.append([loss_train, acc_train])
    log_valid.append([loss_valid, acc_valid])

    #Save checkpoint
    torch.save({'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict()}, f'checkpoint{epoch + 1}.pt')

    #Record end time
    e_time = time.time()

    #Output log
    print(f'epoch: {epoch + 1}, loss_train: {loss_train:.4f}, accuracy_train: {acc_train:.4f}, loss_valid: {loss_valid:.4f}, accuracy_valid: {acc_valid:.4f}, {(e_time - s_time):.4f}sec') 

    #If the loss of verification data does not decrease for 3 consecutive epochs, learning ends.
    if epoch > 2 and log_valid[epoch - 3][0] <= log_valid[epoch - 2][0] <= log_valid[epoch - 1][0] <= log_valid[epoch][0]:
      break
      
    #Take the scheduler one step
    scheduler.step()

  return {'train': log_train, 'valid': log_valid}

In addition, define a function to visualize the log.

import numpy as np
from matplotlib import pyplot as plt

def visualize_logs(log):
  fig, ax = plt.subplots(1, 2, figsize=(15, 5))
  ax[0].plot(np.array(log['train']).T[0], label='train')
  ax[0].plot(np.array(log['valid']).T[0], label='valid')
  ax[0].set_xlabel('epoch')
  ax[0].set_ylabel('loss')
  ax[0].legend()
  ax[1].plot(np.array(log['train']).T[1], label='train')
  ax[1].plot(np.array(log['valid']).T[1], label='valid')
  ax[1].set_xlabel('epoch')
  ax[1].set_ylabel('accuracy')
  ax[1].legend()
  plt.show()

Set the parameters and train the model.

#Parameter setting
VOCAB_SIZE = len(set(word2id.values())) + 1 
EMB_SIZE = 300
PADDING_IDX = len(set(word2id.values()))
OUTPUT_SIZE = 4
HIDDEN_SIZE = 50
LEARNING_RATE = 1e-3
BATCH_SIZE = 1
NUM_EPOCHS = 10

#Model definition
model = RNN(VOCAB_SIZE, EMB_SIZE, PADDING_IDX, OUTPUT_SIZE, HIDDEN_SIZE)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

#Model learning
log = train_model(dataset_train, dataset_valid, BATCH_SIZE, model, criterion, optimizer, NUM_EPOCHS)

`output`


epoch: 1, loss_train: 1.0954, accuracy_train: 0.5356, loss_valid: 1.1334, accuracy_valid: 0.5015, 86.4033sec
epoch: 2, loss_train: 1.0040, accuracy_train: 0.6019, loss_valid: 1.0770, accuracy_valid: 0.5516, 85.2816sec
epoch: 3, loss_train: 0.8813, accuracy_train: 0.6689, loss_valid: 0.9793, accuracy_valid: 0.6287, 78.9026sec
epoch: 4, loss_train: 0.7384, accuracy_train: 0.7364, loss_valid: 0.8498, accuracy_valid: 0.7058, 78.4496sec
epoch: 5, loss_train: 0.6427, accuracy_train: 0.7696, loss_valid: 0.7878, accuracy_valid: 0.7253, 83.4453sec
epoch: 6, loss_train: 0.5730, accuracy_train: 0.7942, loss_valid: 0.7378, accuracy_valid: 0.7470, 79.6968sec
epoch: 7, loss_train: 0.5221, accuracy_train: 0.8064, loss_valid: 0.7058, accuracy_valid: 0.7530, 79.7377sec
epoch: 8, loss_train: 0.4924, accuracy_train: 0.8173, loss_valid: 0.7017, accuracy_valid: 0.7605, 78.2168sec
epoch: 9, loss_train: 0.4800, accuracy_train: 0.8234, loss_valid: 0.7014, accuracy_valid: 0.7575, 77.8689sec
epoch: 10, loss_train: 0.4706, accuracy_train: 0.8253, loss_valid: 0.6889, accuracy_valid: 0.7650, 79.4202sec

#Log visualization
visualize_logs(log)

#Calculation of correct answer rate
_, acc_train = calculate_loss_and_accuracy(model, dataset_train)
_, acc_test = calculate_loss_and_accuracy(model, dataset_test)
print(f'Correct answer rate (learning data):{acc_train:.3f}')
print(f'Correct answer rate (evaluation data):{acc_test:.3f}')

`output`


Correct answer rate (learning data): 0.825
Correct answer rate (evaluation data): 0.773

83. Mini-batch / Learning on GPU

Modify the code of Problem 82 so that learning can be performed by calculating the loss / gradient for each $ B $ case (choose the value of $ B $ appropriately). Also, execute learning on the GPU.

Currently, the series length differs for each sentence, but it is necessary to align the series length in order to put them together as a mini-batch. Therefore, we define a new `Padsequence``` class that has the function of padding according to the maximum sequence length of multiple sentences. By giving this to the argument collate_fn``` of `` Dataloader```, it is possible to realize the process of aligning the series length every time a mini-batch is fetched.

class Padsequence():
  """Padding with maximum series length each time a mini batch is taken out from Dataloader"""
  def __init__(self, padding_idx):
    self.padding_idx = padding_idx

  def __call__(self, batch):
    sorted_batch = sorted(batch, key=lambda x: x['inputs'].shape[0], reverse=True)
    sequences = [x['inputs'] for x in sorted_batch]
    sequences_padded = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=True, padding_value=self.padding_idx)
    labels = torch.LongTensor([x['labels'] for x in sorted_batch])

    return {'inputs': sequences_padded, 'labels': labels}

#Parameter setting
VOCAB_SIZE = len(set(word2id.values())) + 1
EMB_SIZE = 300
PADDING_IDX = len(set(word2id.values()))
OUTPUT_SIZE = 4
HIDDEN_SIZE = 50
LEARNING_RATE = 5e-2
BATCH_SIZE = 32
NUM_EPOCHS = 10

#Model definition
model = RNN(VOCAB_SIZE, EMB_SIZE, PADDING_IDX, OUTPUT_SIZE, HIDDEN_SIZE)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

#Device specification
device = torch.device('cuda')

#Model learning
log = train_model(dataset_train, dataset_valid, BATCH_SIZE, model, criterion, optimizer, NUM_EPOCHS, collate_fn=Padsequence(PADDING_IDX), device=device)

`output`


epoch: 1, loss_train: 1.2605, accuracy_train: 0.3890, loss_valid: 1.2479, accuracy_valid: 0.4162, 12.1096sec
epoch: 2, loss_train: 1.2492, accuracy_train: 0.4246, loss_valid: 1.2541, accuracy_valid: 0.4424, 12.0607sec
epoch: 3, loss_train: 1.2034, accuracy_train: 0.4795, loss_valid: 1.2220, accuracy_valid: 0.4686, 11.8881sec
epoch: 4, loss_train: 1.1325, accuracy_train: 0.5392, loss_valid: 1.1542, accuracy_valid: 0.5210, 12.2269sec
epoch: 5, loss_train: 1.0543, accuracy_train: 0.6214, loss_valid: 1.0623, accuracy_valid: 0.6175, 11.8767sec
epoch: 6, loss_train: 1.0381, accuracy_train: 0.6316, loss_valid: 1.0556, accuracy_valid: 0.6145, 11.9757sec
epoch: 7, loss_train: 1.0546, accuracy_train: 0.6165, loss_valid: 1.0806, accuracy_valid: 0.5913, 12.0352sec
epoch: 8, loss_train: 0.9924, accuracy_train: 0.6689, loss_valid: 1.0150, accuracy_valid: 0.6587, 11.9090sec
epoch: 9, loss_train: 1.0123, accuracy_train: 0.6517, loss_valid: 1.0482, accuracy_valid: 0.6310, 12.0953sec
epoch: 10, loss_train: 1.0036, accuracy_train: 0.6623, loss_valid: 1.0319, accuracy_valid: 0.6504, 11.9331sec

#Log visualization
visualize_logs(log)

#Calculation of correct answer rate
_, acc_train = calculate_loss_and_accuracy(model, dataset_train, device)
_, acc_test = calculate_loss_and_accuracy(model, dataset_test, device)
print(f'Correct answer rate (learning data):{acc_train:.3f}')
print(f'Correct answer rate (evaluation data):{acc_test:.3f}')

`output`


Correct answer rate (learning data): 0.662
Correct answer rate (evaluation data): 0.649

84. Introduction of word vector

Initialize and learn the word embedding $ emb (x) $ with a pre-learned word vector (for example, a learned word vector in the Google News dataset (about 100 billion words)).

Download the pre-learned word vector as in the previous chapter.

#Download learned word vector
FILE_ID = "0B7XkCwpI5KDYNlNUTTlSS21pQmM"
FILE_NAME = "GoogleNews-vectors-negative300.bin.gz"
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=$FILE_ID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=$FILE_ID" -O $FILE_NAME && rm -rf /tmp/cookies.txt

When using a pre-learned word vector as a model, there are two methods: one is to use all the words (replace the dictionary), and the other is to use the dictionary of the data at hand as it is and use it only as the initial value of those word vectors. There is. This time, the latter method is adopted, and the word vector corresponding to the dictionary already created is extracted.

from gensim.models import KeyedVectors

#Loading trained model
model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin.gz', binary=True)

#Get learned word vector
VOCAB_SIZE = len(set(word2id.values())) + 1
EMB_SIZE = 300
weights = np.zeros((VOCAB_SIZE, EMB_SIZE))
words_in_pretrained = 0
for i, word in enumerate(word2id.keys()):
  try:
    weights[i] = model[word]
    words_in_pretrained += 1
  except KeyError:
    weights[i] = np.random.normal(scale=0.4, size=(EMB_SIZE,))
weights = torch.from_numpy(weights.astype((np.float32)))

print(f'Number of words used in learned vector: {words_in_pretrained} / {VOCAB_SIZE}')
print(weights.size())

`output`


Number of words used in learned vector: 9174 / 9406
torch.Size([9406, 300])

Change so that the initial value can be set for the embedded layer of the network. Also, add settings for bidirectional and multi-layering for the next problem.

class RNN(nn.Module):
  def __init__(self, vocab_size, emb_size, padding_idx, output_size, hidden_size, num_layers, emb_weights=None, bidirectional=False):
    super().__init__()
    self.hidden_size = hidden_size
    self.num_layers = num_layers
    self.num_directions = bidirectional + 1  #Unidirectional: 1, bidirectional: 2
    if emb_weights != None:  #Emb the weight of the embedded layer if specified_Initialize with weights
      self.emb = nn.Embedding.from_pretrained(emb_weights, padding_idx=padding_idx)
    else:
      self.emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
    self.rnn = nn.RNN(emb_size, hidden_size, num_layers, nonlinearity='tanh', bidirectional=bidirectional, batch_first=True)
    self.fc = nn.Linear(hidden_size * self.num_directions, output_size)
    
  def forward(self, x):
    self.batch_size = x.size()[0]
    hidden = self.init_hidden()  #Create a zero vector of h0
    emb = self.emb(x)
    # emb.size() = (batch_size, seq_len, emb_size)
    out, hidden = self.rnn(emb, hidden)
    # out.size() = (batch_size, seq_len, hidden_size * num_directions)
    out = self.fc(out[:, -1, :])
    # out.size() = (batch_size, output_size)
    return out
    
  def init_hidden(self):
    hidden = torch.zeros(self.num_layers * self.num_directions, self.batch_size, self.hidden_size)
    return hidden

Learn by specifying the initial value of the embedded layer.

#Parameter setting
VOCAB_SIZE = len(set(word2id.values())) + 1
EMB_SIZE = 300
PADDING_IDX = len(set(word2id.values()))
OUTPUT_SIZE = 4
HIDDEN_SIZE = 50
NUM_LAYERS = 1
LEARNING_RATE = 5e-2
BATCH_SIZE = 32
NUM_EPOCHS = 10

#Model definition
model = RNN(VOCAB_SIZE, EMB_SIZE, PADDING_IDX, OUTPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS, emb_weights=weights)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

#Device specification
device = torch.device('cuda')

#Model learning
log = train_model(dataset_train, dataset_valid, BATCH_SIZE, model, criterion, optimizer, NUM_EPOCHS, collate_fn=Padsequence(PADDING_IDX), device=device)

`output`


epoch: 1, loss_train: 1.1655, accuracy_train: 0.4270, loss_valid: 1.1839, accuracy_valid: 0.4244, 9.7483sec
epoch: 2, loss_train: 1.1555, accuracy_train: 0.4635, loss_valid: 1.1404, accuracy_valid: 0.4865, 9.7553sec
epoch: 3, loss_train: 1.0189, accuracy_train: 0.6263, loss_valid: 1.0551, accuracy_valid: 0.6085, 10.0445sec
epoch: 4, loss_train: 1.0377, accuracy_train: 0.6221, loss_valid: 1.0947, accuracy_valid: 0.5951, 10.1138sec
epoch: 5, loss_train: 1.0392, accuracy_train: 0.6082, loss_valid: 1.0776, accuracy_valid: 0.5921, 9.8540sec
epoch: 6, loss_train: 1.0447, accuracy_train: 0.6087, loss_valid: 1.1020, accuracy_valid: 0.5793, 9.8598sec
epoch: 7, loss_train: 0.9999, accuracy_train: 0.6270, loss_valid: 1.0519, accuracy_valid: 0.6108, 9.7565sec
epoch: 8, loss_train: 0.9539, accuracy_train: 0.6557, loss_valid: 1.0092, accuracy_valid: 0.6385, 9.7457sec
epoch: 9, loss_train: 0.9287, accuracy_train: 0.6674, loss_valid: 0.9806, accuracy_valid: 0.6430, 9.6464sec
epoch: 10, loss_train: 0.9456, accuracy_train: 0.6593, loss_valid: 1.0029, accuracy_valid: 0.6377, 9.6835sec

#Log visualization
visualize_logs(log)

#Calculation of correct answer rate
_, acc_train = calculate_loss_and_accuracy(model, dataset_train, device)
_, acc_test = calculate_loss_and_accuracy(model, dataset_test, device)
print(f'Correct answer rate (learning data):{acc_train:.3f}')
print(f'Correct answer rate (evaluation data):{acc_test:.3f}')

`output`


Correct answer rate (learning data): 0.659
Correct answer rate (evaluation data): 0.645

85. Bi-directional RNN / multi-layer

Encode the input text using both forward and reverse RNNs and train the model.

\overleftarrow h_{T+1} = 0, \ \overleftarrow h_t = {\rm \overleftarrow{RNN}}(\mathrm{emb}(x_t), \overleftarrow h_{t+1}), \ y = {\rm softmax}(W^{(yh)} [\overrightarrow h_T; \overleftarrow h_1] + b^{(y)})


 > However, $ \ overrightarrow h_t \ in \ mathbb {R} ^ {d_h}, \ overleftarrow h_t \ in \ mathbb {R} ^ {d_h} $ are the times $ t obtained by the forward and reverse RNNs, respectively. The hidden state vector of $, $ {\ rm \ overleftarrow {RNN}} (x, h) $ is the RNN unit that calculates the previous state from the input $ x $ and the hidden state $ h $ of the next time, $ W ^ {( yh)} \ in \ mathbb {R} ^ {L \ times 2d_h} $ is a matrix for predicting categories from hidden state vectors, $ b ^ {(y)} \ in \ mathbb {R} ^ {L} $ Is a bias term. Also, $ [a; b] $ represents the concatenation of the vectors $ a $ and $ b $.

 > Furthermore, experiment with bidirectional RNNs in multiple layers.

 `` `bidirectional``` which is an argument to specify both directions is set to `` `True```, and `` `NUM_LAYERS``` is set to `` `2``` to execute learning. ..

```python
#Parameter setting
VOCAB_SIZE = len(set(word2id.values())) + 1
EMB_SIZE = 300
PADDING_IDX = len(set(word2id.values()))
OUTPUT_SIZE = 4
HIDDEN_SIZE = 50
NUM_LAYERS = 2
LEARNING_RATE = 5e-2
BATCH_SIZE = 32
NUM_EPOCHS = 10

#Model definition
model = RNN(VOCAB_SIZE, EMB_SIZE, PADDING_IDX, OUTPUT_SIZE, HIDDEN_SIZE, NUM_LAYERS, emb_weights=weights, bidirectional=True)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

#Device specification
device = torch.device('cuda')

#Model learning
log = train_model(dataset_train, dataset_valid, BATCH_SIZE, model, criterion, optimizer, NUM_EPOCHS, collate_fn=Padsequence(PADDING_IDX), device=device)

`output`


epoch: 1, loss_train: 1.1731, accuracy_train: 0.4307, loss_valid: 1.1915, accuracy_valid: 0.4274, 19.3181sec
epoch: 2, loss_train: 1.0395, accuracy_train: 0.6116, loss_valid: 1.0555, accuracy_valid: 0.5996, 18.8118sec
epoch: 3, loss_train: 1.0529, accuracy_train: 0.5899, loss_valid: 1.0832, accuracy_valid: 0.5696, 18.9088sec
epoch: 4, loss_train: 0.9831, accuracy_train: 0.6351, loss_valid: 1.0144, accuracy_valid: 0.6235, 18.8913sec
epoch: 5, loss_train: 1.0622, accuracy_train: 0.5797, loss_valid: 1.1142, accuracy_valid: 0.5487, 19.0636sec
epoch: 6, loss_train: 1.0463, accuracy_train: 0.5741, loss_valid: 1.0972, accuracy_valid: 0.5367, 19.0612sec
epoch: 7, loss_train: 1.0056, accuracy_train: 0.6102, loss_valid: 1.0485, accuracy_valid: 0.5898, 19.0420sec
epoch: 8, loss_train: 0.9724, accuracy_train: 0.6294, loss_valid: 1.0278, accuracy_valid: 0.6093, 19.3077sec
epoch: 9, loss_train: 0.9469, accuracy_train: 0.6371, loss_valid: 0.9943, accuracy_valid: 0.6160, 19.2803sec
epoch: 10, loss_train: 0.9343, accuracy_train: 0.6451, loss_valid: 0.9867, accuracy_valid: 0.6235, 19.0755sec

#Log visualization
visualize_logs(log)

#Calculation of correct answer rate
_, acc_train = calculate_loss_and_accuracy(model, dataset_train, device)
_, acc_test = calculate_loss_and_accuracy(model, dataset_test, device)
print(f'Correct answer rate (learning data):{acc_train:.3f}')
print(f'Correct answer rate (evaluation data):{acc_test:.3f}')

`output`


Correct answer rate (learning data): 0.645
Correct answer rate (evaluation data): 0.634

86. Convolutional Neural Network (CNN)

There is a word string $ \ boldsymbol x = (x_1, x_2, \ dots, x_T) $ represented by an ID number. However, $ T $ is the length of the word string, and $ x_t \ in \ mathbb {R} ^ {V} $ is the one-hot notation of the word ID number ($ V $ is the total number of words). Implement a model that predicts the category $ y $ from the word string $ \ boldsymbol x $ using a convolutional neural network (CNN).

However, the configuration of the convolutional neural network is as follows.

Number of dimensions for embedding words: $ d_w $

Convolution filter size: 3 tokens
Convolution Stride: 1 token
Convolution padding: Yes
Number of dimensions of vector at each time after convolution operation: $ d_h $
After the convolution operation, apply max pooling and express the input statement as a hidden vector of $ d_h $ dimension. That is, the feature vector $ p_t \ in \ mathbb {R} ^ {d_h} $ at time $ t $ is expressed by the following equation.

p_t = g(W^{(px)} [\mathrm{emb}(x_{t-1}); \mathrm{emb}(x_t); \mathrm{emb}(x_{t+1})] + b^{(p)}) $]


 > However, $ W ^ {(px)} \ in \ mathbb {R} ^ {d_h \ times 3d_w}, b ^ {(p)} \ in \ mathbb {R} ^ {d_h} $ is a CNN parameter, $ g $ is the activation function (eg $ \ tanh $ and ReLU), and $ [a; b; c] $ is the concatenation of the vectors $ a, b, c $. The number of columns in the matrix $ W ^ {(px)} $ is $ 3d_w $ because the linear transformation is performed on the concatenated word embeddings of three tokens.
 In maximum value pooling, the maximum value at all times is taken for each dimension of the feature vector, and the feature vector $ c \ in \ mathbb {R} ^ {d_h} $ of the input document is obtained. If $ c [i] $ represents the value of the $ i $ th dimension of the vector $ c $, the maximum value pooling is expressed by the following equation.

>```math
c[i] = \max_{1 \leq t \leq T} p_t[i]

Finally, the input document feature vector $ c $ with the matrix $ W ^ {(yc)} \ in \ mathbb {R} ^ {L \ times d_h} $ and the bias term $ b ^ {(y)} \ in Apply the linear transformation by \ mathbb {R} ^ {L} $ and the softmax function to predict the category $ y $.

y = {\rm softmax}(W^{(yc)} c + b^{(y)})


 > Note that this problem does not train the model, it only needs to calculate $ y $ with a randomly initialized weight matrix.

 Implements the specified network.
 Following the embedded layer, calculate the convolution with ``` nn.Conv2d```. The maximum value is acquired in the series length direction with `` `max_pool```, and the vectors are aggregated in sentence units in this part.

```python
from torch.nn import functional as F

class CNN(nn.Module):
  def __init__(self, vocab_size, emb_size, padding_idx, output_size, out_channels, kernel_heights, stride, padding, emb_weights=None):
    super().__init__()
    if emb_weights != None:  #Emb the weight of the embedded layer if specified_Initialize with weights
      self.emb = nn.Embedding.from_pretrained(emb_weights, padding_idx=padding_idx)
    else:
      self.emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
    self.conv = nn.Conv2d(1, out_channels, (kernel_heights, emb_size), stride, (padding, 0))
    self.drop = nn.Dropout(0.3)
    self.fc = nn.Linear(out_channels, output_size)
    
  def forward(self, x):
    # x.size() = (batch_size, seq_len)
    emb = self.emb(x).unsqueeze(1)
    # emb.size() = (batch_size, 1, seq_len, emb_size)
    conv = self.conv(emb)
    # conv.size() = (batch_size, out_channels, seq_len, 1)
    act = F.relu(conv.squeeze(3))
    # act.size() = (batch_size, out_channels, seq_len)
    max_pool = F.max_pool1d(act, act.size()[2])
    # max_pool.size() = (batch_size, out_channels, 1) -> seq_Get maximum value in len direction
    out = self.fc(self.drop(max_pool.squeeze(2)))
    # out.size() = (batch_size, output_size)
    return out

#Parameter setting
VOCAB_SIZE = len(set(word2id.values())) + 1
EMB_SIZE = 300
PADDING_IDX = len(set(word2id.values()))
OUTPUT_SIZE = 4
OUT_CHANNELS = 100
KERNEL_HEIGHTS = 3
STRIDE = 1
PADDING = 1

#Model definition
model = CNN(VOCAB_SIZE, EMB_SIZE, PADDING_IDX, OUTPUT_SIZE, OUT_CHANNELS, KERNEL_HEIGHTS, STRIDE, PADDING, emb_weights=weights)

#Get the first 10 predicted values
for i in range(10):
  X = dataset_train[i]['inputs']
  print(torch.softmax(model(X.unsqueeze(0)), dim=-1))

`output`


tensor([[0.2607, 0.2267, 0.2121, 0.3006]], grad_fn=<SoftmaxBackward>)
tensor([[0.2349, 0.2660, 0.2462, 0.2529]], grad_fn=<SoftmaxBackward>)
tensor([[0.2305, 0.2649, 0.2099, 0.2948]], grad_fn=<SoftmaxBackward>)
tensor([[0.2569, 0.2409, 0.2418, 0.2604]], grad_fn=<SoftmaxBackward>)
tensor([[0.2610, 0.2149, 0.2355, 0.2886]], grad_fn=<SoftmaxBackward>)
tensor([[0.2627, 0.2363, 0.2388, 0.2622]], grad_fn=<SoftmaxBackward>)
tensor([[0.2694, 0.2434, 0.2224, 0.2648]], grad_fn=<SoftmaxBackward>)
tensor([[0.2423, 0.2465, 0.2365, 0.2747]], grad_fn=<SoftmaxBackward>)
tensor([[0.2591, 0.2695, 0.2468, 0.2246]], grad_fn=<SoftmaxBackward>)
tensor([[0.2794, 0.2465, 0.2234, 0.2507]], grad_fn=<SoftmaxBackward>)

87. Learning CNN by Stochastic Gradient Descent

Learn the model constructed in Problem 86 using Stochastic Gradient Descent (SGD). Learn the model while displaying the loss and correct answer rate on the training data and the loss and correct answer rate on the evaluation data, and finish with an appropriate standard (for example, 10 epochs).

#Parameter setting
VOCAB_SIZE = len(set(word2id.values())) + 1
EMB_SIZE = 300
PADDING_IDX = len(set(word2id.values()))
OUTPUT_SIZE = 4
OUT_CHANNELS = 100
KERNEL_HEIGHTS = 3
STRIDE = 1
PADDING = 1
LEARNING_RATE = 5e-2
BATCH_SIZE = 64
NUM_EPOCHS = 10

#Model definition
model = CNN(VOCAB_SIZE, EMB_SIZE, PADDING_IDX, OUTPUT_SIZE, OUT_CHANNELS, KERNEL_HEIGHTS, STRIDE, PADDING, emb_weights=weights)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)

#Device specification
device = torch.device('cuda')

#Model learning
log = train_model(dataset_train, dataset_valid, BATCH_SIZE, model, criterion, optimizer, NUM_EPOCHS, collate_fn=Padsequence(PADDING_IDX), device=device)

`output`


epoch: 1, loss_train: 1.0671, accuracy_train: 0.5543, loss_valid: 1.0744, accuracy_valid: 0.5726, 12.9214sec
epoch: 2, loss_train: 0.9891, accuracy_train: 0.6594, loss_valid: 1.0148, accuracy_valid: 0.6452, 12.6483sec
epoch: 3, loss_train: 0.9098, accuracy_train: 0.6928, loss_valid: 0.9470, accuracy_valid: 0.6729, 12.7305sec
epoch: 4, loss_train: 0.8481, accuracy_train: 0.7139, loss_valid: 0.8956, accuracy_valid: 0.7028, 12.7967sec
epoch: 5, loss_train: 0.8055, accuracy_train: 0.7250, loss_valid: 0.8634, accuracy_valid: 0.7096, 12.6543sec
epoch: 6, loss_train: 0.7728, accuracy_train: 0.7361, loss_valid: 0.8425, accuracy_valid: 0.7141, 12.7423sec
epoch: 7, loss_train: 0.7527, accuracy_train: 0.7396, loss_valid: 0.8307, accuracy_valid: 0.7216, 12.6718sec
epoch: 8, loss_train: 0.7403, accuracy_train: 0.7432, loss_valid: 0.8227, accuracy_valid: 0.7246, 12.5854sec
epoch: 9, loss_train: 0.7346, accuracy_train: 0.7447, loss_valid: 0.8177, accuracy_valid: 0.7216, 12.4846sec
epoch: 10, loss_train: 0.7331, accuracy_train: 0.7448, loss_valid: 0.8167, accuracy_valid: 0.7231, 12.7443sec

#Log visualization
visualize_logs(log)

#Calculation of correct answer rate
_, acc_train = calculate_loss_and_accuracy(model, dataset_train, device)
_, acc_test = calculate_loss_and_accuracy(model, dataset_test, device)
print(f'Correct answer rate (learning data):{acc_train:.3f}')
print(f'Correct answer rate (evaluation data):{acc_test:.3f}')

`output`


Correct answer rate (learning data): 0.745
Correct answer rate (evaluation data): 0.719

88. Parameter tuning

Build a high-performance category classifier by modifying the code of Problem 85 and Problem 87 and adjusting the shape and hyperparameters of the neural network.

This time, I will try a network that simplifies TextCNN proposed in Convolutional Neural Networks for Sentence Classification. In the previous question, CNN learned only filters with a width of 3, but this network uses filters with three widths of 2, 3, and 4.

from torch.nn import functional as F

class textCNN(nn.Module):
  def __init__(self, vocab_size, emb_size, padding_idx, output_size, out_channels, conv_params, drop_rate, emb_weights=None):
    super().__init__()
    if emb_weights != None:  #Emb the weight of the embedded layer if specified_Initialize with weights
      self.emb = nn.Embedding.from_pretrained(emb_weights, padding_idx=padding_idx)
    else:
      self.emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
    self.convs = nn.ModuleList([nn.Conv2d(1, out_channels, (kernel_height, emb_size), padding=(padding, 0)) for kernel_height, padding in conv_params])
    self.drop = nn.Dropout(drop_rate)
    self.fc = nn.Linear(len(conv_params) * out_channels, output_size)
    
  def forward(self, x):
    # x.size() = (batch_size, seq_len)
    emb = self.emb(x).unsqueeze(1)
    # emb.size() = (batch_size, 1, seq_len, emb_size)
    conv = [F.relu(conv(emb)).squeeze(3) for i, conv in enumerate(self.convs)]
    # conv[i].size() = (batch_size, out_channels, seq_len + padding * 2 - kernel_height + 1)
    max_pool = [F.max_pool1d(i, i.size(2)) for i in conv]
    # max_pool[i].size() = (batch_size, out_channels, 1) -> seq_Get maximum value in len direction
    max_pool_cat = torch.cat(max_pool, 1)
    # max_pool_cat.size() = (batch_size, len(conv_params) * out_channels, 1)  ->Combine results by filter
    out = self.fc(self.drop(max_pool_cat.squeeze(2)))
    # out.size() = (batch_size, output_size)
    return out

Also, for parameter tuning, use optuna as in Chapter 6.

!pip install optuna

import optuna

def objective(trial):
  #Set of parameters to be tuned
  emb_size = int(trial.suggest_discrete_uniform('emb_size', 100, 400, 100))
  out_channels = int(trial.suggest_discrete_uniform('out_channels', 50, 200, 50))
  drop_rate = trial.suggest_discrete_uniform('drop_rate', 0.0, 0.5, 0.1)
  learning_rate = trial.suggest_loguniform('learning_rate', 5e-4, 5e-2)
  momentum = trial.suggest_discrete_uniform('momentum', 0.5, 0.9, 0.1)
  batch_size = int(trial.suggest_discrete_uniform('batch_size', 16, 128, 16))

  #Fixed parameter settings
  VOCAB_SIZE = len(set(word2id.values())) + 1
  PADDING_IDX = len(set(word2id.values()))
  OUTPUT_SIZE = 4
  CONV_PARAMS = [[2, 0], [3, 1], [4, 2]]
  NUM_EPOCHS = 30

  #Model definition
  model = textCNN(VOCAB_SIZE, EMB_SIZE, PADDING_IDX, OUTPUT_SIZE, out_channels, CONV_PARAMS, drop_rate, emb_weights=weights)

  #Definition of loss function
  criterion = nn.CrossEntropyLoss()

  #Optimizer definition
  optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

  #Device specification
  device = torch.device('cuda')

  #Model learning
  log = train_model(dataset_train, dataset_valid, batch_size, model, criterion, optimizer, NUM_EPOCHS, collate_fn=Padsequence(PADDING_IDX), device=device)

  #Loss calculation
  loss_valid, _ = calculate_loss_and_accuracy(model, dataset_valid, device, criterion=criterion) 

  return loss_valid

Performs a parameter search.

#optimisation
study = optuna.create_study()
study.optimize(objective, timeout=7200)

#View results
print('Best trial:')
trial = study.best_trial
print('  Value: {:.3f}'.format(trial.value))
print('  Params: ')
for key, value in trial.params.items():
  print('    {}: {}'.format(key, value))

`output`


Best trial:
  Value: 0.469
  Params: 
    emb_size: 300.0
    out_channels: 100.0
    drop_rate: 0.4
    learning_rate: 0.013345934577557608
    momentum: 0.8
    batch_size: 32.0

Train the model with the parameters you searched for.

#Parameter setting
VOCAB_SIZE = len(set(word2id.values())) + 1
EMB_SIZE = int(trial.params['emb_size'])
PADDING_IDX = len(set(word2id.values()))
OUTPUT_SIZE = 4
OUT_CHANNELS = int(trial.params['out_channels'])
CONV_PARAMS = [[2, 0], [3, 1], [4, 2]]
DROP_RATE = trial.params['drop_rate']
LEARNING_RATE = trial.params['learning_rate']
BATCH_SIZE = int(trial.params['batch_size'])
NUM_EPOCHS = 30

#Model definition
model = textCNN(VOCAB_SIZE, EMB_SIZE, PADDING_IDX, OUTPUT_SIZE, OUT_CHANNELS, CONV_PARAMS, DROP_RATE, emb_weights=weights)
print(model)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=0.9)

#Device specification
device = torch.device('cuda')

#Model learning
log = train_model(dataset_train, dataset_valid, BATCH_SIZE, model, criterion, optimizer, NUM_EPOCHS, collate_fn=Padsequence(PADDING_IDX), device=device)

`output`


textCNN(
  (emb): Embedding(9406, 300, padding_idx=9405)
  (convs): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(2, 300), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(3, 300), stride=(1, 1), padding=(1, 0))
    (2): Conv2d(1, 100, kernel_size=(4, 300), stride=(1, 1), padding=(2, 0))
  )
  (drop): Dropout(p=0.4, inplace=False)
  (fc): Linear(in_features=300, out_features=4, bias=True)
)
epoch: 1, loss_train: 0.7908, accuracy_train: 0.7239, loss_valid: 0.8660, accuracy_valid: 0.6901, 12.2279sec
epoch: 2, loss_train: 0.5800, accuracy_train: 0.7944, loss_valid: 0.7384, accuracy_valid: 0.7485, 12.1637sec
epoch: 3, loss_train: 0.3951, accuracy_train: 0.8738, loss_valid: 0.6189, accuracy_valid: 0.7919, 12.1612sec
epoch: 4, loss_train: 0.2713, accuracy_train: 0.9217, loss_valid: 0.5499, accuracy_valid: 0.8136, 12.1877sec
epoch: 5, loss_train: 0.1913, accuracy_train: 0.9593, loss_valid: 0.5176, accuracy_valid: 0.8293, 12.1722sec
epoch: 6, loss_train: 0.1322, accuracy_train: 0.9749, loss_valid: 0.5042, accuracy_valid: 0.8234, 12.4483sec
epoch: 7, loss_train: 0.1033, accuracy_train: 0.9807, loss_valid: 0.4922, accuracy_valid: 0.8323, 12.1556sec
epoch: 8, loss_train: 0.0723, accuracy_train: 0.9943, loss_valid: 0.4900, accuracy_valid: 0.8308, 12.0309sec
epoch: 9, loss_train: 0.0537, accuracy_train: 0.9966, loss_valid: 0.4903, accuracy_valid: 0.8346, 11.9471sec
epoch: 10, loss_train: 0.0414, accuracy_train: 0.9966, loss_valid: 0.4801, accuracy_valid: 0.8421, 11.9275sec
epoch: 11, loss_train: 0.0366, accuracy_train: 0.9978, loss_valid: 0.4943, accuracy_valid: 0.8406, 11.9691sec
epoch: 12, loss_train: 0.0292, accuracy_train: 0.9983, loss_valid: 0.4839, accuracy_valid: 0.8436, 11.9665sec
epoch: 13, loss_train: 0.0271, accuracy_train: 0.9982, loss_valid: 0.5042, accuracy_valid: 0.8421, 11.9634sec
epoch: 14, loss_train: 0.0222, accuracy_train: 0.9986, loss_valid: 0.4912, accuracy_valid: 0.8458, 11.9298sec
epoch: 15, loss_train: 0.0194, accuracy_train: 0.9988, loss_valid: 0.4925, accuracy_valid: 0.8436, 11.9375sec
epoch: 16, loss_train: 0.0176, accuracy_train: 0.9988, loss_valid: 0.5074, accuracy_valid: 0.8451, 11.9333sec
epoch: 17, loss_train: 0.0163, accuracy_train: 0.9991, loss_valid: 0.5124, accuracy_valid: 0.8436, 11.9137sec

#Log visualization
visualize_logs(log)

#Calculation of correct answer rate
_, acc_train = calculate_loss_and_accuracy(model, dataset_train, device)
_, acc_test = calculate_loss_and_accuracy(model, dataset_test, device)
print(f'Correct answer rate (learning data):{acc_train:.3f}')
print(f'Correct answer rate (evaluation data):{acc_test:.3f}')

`output`


Correct answer rate (learning data): 0.999
Correct answer rate (evaluation data): 0.851

89. Transfer learning from a pre-trained language model

Build a model that classifies news article headlines into categories, starting from a pre-learned language model (eg BERT).

[PyTorch] Introduction to document classification using BERT is cut out in another article. Here, only the result of the correct answer rate is posted.

Correct answer rate (learning data): 0.993
Correct answer rate (evaluation data): 0.948

in conclusion

100 Language Processing Knock is designed so that you can learn not only natural language processing itself, but also basic data processing and general-purpose machine learning. Even those who are studying machine learning in online courses will be able to practice very good output, so please try it.

To answer all 100 questions