[Language processing 100 knocks 2020] Chapter 8: Neural network

Introduction

2020 version of 100 knocks of language processing, which is famous as a collection of problems of natural language processing, has been released. This article summarizes the results of solving "Chapter 8: Neural Nets" from Chapters 1 to 10 below.

-Chapter 1: Preparatory Movement -Chapter 2: UNIX Commands -Chapter 3: Regular Expressions -Chapter 4: Morphological analysis -Chapter 5: Dependency Analysis -Chapter 6: Machine Learning -Chapter 7: Word Vector --Chapter 8: Neural Net --Chapter 9: RNN, CNN --Chapter 10: Machine Translation

Advance preparation

We use Google Colaboratory for answers. For details on how to set up and use Google Colaboratory, see this article. ** Since GPU is used in this chapter, change the hardware accelerator to "GPU" from "Runtime"-> "Change runtime type" and save it in advance. ** ** The notebook containing the execution results of the following answers is available on github.

Chapter 8: Neural Net

Implement a categorization model with a neural network based on the categorization of news articles discussed in Chapter 6. In this chapter, use machine learning platforms such as PyTorch, TensorFlow, and Chainer.

70. Features by sum of word vectors

I want to convert the training data, validation data, and evaluation data constructed in Problem 50 into matrices and vectors. For example, for training data, we would like to create a matrix $ X $ in which the feature vectors $ \ boldsymbol {x} _i $ of all cases $ x_i $ are arranged and a matrix (vector) $ Y $ in which the correct labels are arranged.

X = \begin{pmatrix} \boldsymbol{x}_1 \ \boldsymbol{x}_2 \ \dots \ \boldsymbol{x}_n \ \end{pmatrix} \in \mathbb{R}^{n \times d}, Y = \begin{pmatrix} y_1 \ y_2 \ \dots \ y_n \ \end{pmatrix} \in \mathbb{N}^{n}

>
 > Here, $ n $ is the number of cases of training data, and $ \ boldsymbol x_i \ in \ mathbb {R} ^ d $ and $ y_i \ in \ mathbb N $ are $ i \ in \ {1, respectively. \ dots, n \} Represents the feature vector and correct label of the $ th case.
 > This time, there are four categories, "business", "science and technology", "entertainment", and "health". If $ \ mathbb N_ {<4} $ represents a natural number less than $ 4 $ (including $ 0 $), the correct label $ y_i $ in any case is $ y_i \ in \ mathbb N_ {<4} $ Can be expressed by.
 > In the following, the number of label types is represented by $ L $ ($ L = 4 $ in this classification task).
>
 > The feature vector $ \ boldsymbol x_i $ of the $ i $ th case is calculated by the following equation.
>
> $$\boldsymbol x_i = \frac{1}{T_i} \sum_{t=1}^{T_i} \mathrm{emb}(w_{i,t})$$
>
 > Here, the $ i $ th case is $ T_i $ word strings (in the article headline) $ (w_ {i, 1}, w_ {i, 2}, \ dots, w_ {i, T_i}) $ \\ mathrm {emb} (w) \ in \ mathbb {R} ^ d $ is a word vector (the number of dimensions is $ d $) corresponding to the word $ w $. That is, $ \ boldsymbol x_i $ is the article headline of the $ i $ th case expressed by the average of the vector of words included in the headline. This time, the word vector downloaded in question 60 should be used. Since we used a word vector of $ 300 $ dimension, $ d = 300 $.
 > The label $ y_i $ of the $ i $ th case is defined as follows.
>
>```math
y_i = \begin{cases}
0 & (\mbox{article}\boldsymbol x_i\mbox{If is in the "Business" category}) \\
1 & (\mbox{article}\boldsymbol x_i\mbox{Is in the "Science and Technology" category}) \\
2 & (\mbox{article}\boldsymbol x_i\mbox{If is in the "Entertainment" category}) \\
3 & (\mbox{article}\boldsymbol x_i\mbox{If is in the "health" category}) \\
\end{cases}

If there is a one-to-one correspondence between the category name and the label number, the correspondence does not have to be as shown in the above formula.

Based on the above specifications, create the following matrix / vector and save it in a file.

Feature matrix of training data: $ X_ {\ rm train} \ in \ mathbb {R} ^ {N_t \ times d} $

Training data label vector: $ Y_ {\ rm train} \ in \ mathbb {N} ^ {N_t} $

Feature matrix of validation data: $ X_ {\ rm valid} \ in \ mathbb {R} ^ {N_v \ times d} $

Validation data label vector: $ Y_ {\ rm valid} \ in \ mathbb {N} ^ {N_v} $

Evaluation data feature matrix: $ X_ {\ rm test} \ in \ mathbb {R} ^ {N_e \ times d} $

Evaluation data label vector: $ Y_ {\ rm test} \ in \ mathbb {N} ^ {N_e} $

Note that $ N_t, N_v, and N_e $ are the number of cases of training data, the number of cases of verification data, and the number of cases of evaluation data, respectively.

First, after downloading the specified data, read it as a data frame. Then, it is divided into training data, verification data, and evaluation data and saved. Up to this point, the processing is exactly the same as problem 50, so there is no problem reading the data created there.

#Download data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
!unzip NewsAggregatorDataset.zip

#Replaced double quotes with single quotes to avoid errors when reading
!sed -e 's/"/'\''/g' ./newsCorpora.csv > ./newsCorpora_re.csv

import pandas as pd
from sklearn.model_selection import train_test_split

#Data reading
df = pd.read_csv('./newsCorpora_re.csv', header=None, sep='\t', names=['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP'])

#Data extraction
df = df.loc[df['PUBLISHER'].isin(['Reuters', 'Huffington Post', 'Businessweek', 'Contactmusic.com', 'Daily Mail']), ['TITLE', 'CATEGORY']]

#Data split
train, valid_test = train_test_split(df, test_size=0.2, shuffle=True, random_state=123, stratify=df['CATEGORY'])
valid, test = train_test_split(valid_test, test_size=0.5, shuffle=True, random_state=123, stratify=valid_test['CATEGORY'])

#Data storage
train.to_csv('./train.txt', sep='\t', index=False)
valid.to_csv('./valid.txt', sep='\t', index=False)
test.to_csv('./test.txt', sep='\t', index=False)

#Confirmation of the number of cases
print('[Learning data]')
print(train['CATEGORY'].value_counts())
print('[Verification data]')
print(valid['CATEGORY'].value_counts())
print('[Evaluation data]')
print(test['CATEGORY'].value_counts())

`output`


[Learning data]
b    4501
e    4235
t    1220
m     728
Name: CATEGORY, dtype: int64
[Verification data]
b    563
e    529
t    153
m     91
Name: CATEGORY, dtype: int64
[Evaluation data]
b    563
e    530
t    152
m     91
Name: CATEGORY, dtype: int64

Next, download and load the learned word vector used in question 60.

#Download learned word vector
FILE_ID = "0B7XkCwpI5KDYNlNUTTlSS21pQmM"
FILE_NAME = "GoogleNews-vectors-negative300.bin.gz"
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=$FILE_ID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=$FILE_ID" -O $FILE_NAME && rm -rf /tmp/cookies.txt

from gensim.models import KeyedVectors

#Load learned word vector
model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin.gz', binary=True)

Finally, create and save the feature and label vectors. After that, it is converted to Tensor type for use as input of neural network by PyTorch.

import string
import torch

def transform_w2v(text):
  table = str.maketrans(string.punctuation, ' '*len(string.punctuation))
  words = text.translate(table).split()  #After replacing the symbol with a space, divide it by a space and list it.
  vec = [model[word] for word in words if word in model]  #Vectorized word by word

  return torch.tensor(sum(vec) / len(vec))  #Convert the average vector to Tensor type and output

#Creating a feature vector
X_train = torch.stack([transform_w2v(text) for text in train['TITLE']])
X_valid = torch.stack([transform_w2v(text) for text in valid['TITLE']])
X_test = torch.stack([transform_w2v(text) for text in test['TITLE']])

print(X_train.size())
print(X_train)

`output`


torch.Size([10684, 300])
tensor([[ 0.0837,  0.0056,  0.0068,  ...,  0.0751,  0.0433, -0.0868],
        [ 0.0272,  0.0266, -0.0947,  ..., -0.1046, -0.0489, -0.0092],
        [ 0.0577, -0.0159, -0.0780,  ..., -0.0421,  0.1229,  0.0876],
        ...,
        [ 0.0392, -0.0052,  0.0686,  ..., -0.0175,  0.0061, -0.0224],
        [ 0.0798,  0.1017,  0.1066,  ..., -0.0752,  0.0623,  0.1138],
        [ 0.1664,  0.0451,  0.0508,  ..., -0.0531, -0.0183, -0.0039]])

#Creating a label vector
category_dict = {'b': 0, 't': 1, 'e':2, 'm':3}
y_train = torch.tensor(train['CATEGORY'].map(lambda x: category_dict[x]).values)
y_valid = torch.tensor(valid['CATEGORY'].map(lambda x: category_dict[x]).values)
y_test = torch.tensor(test['CATEGORY'].map(lambda x: category_dict[x]).values)

print(y_train.size())
print(y_train)

`output`


torch.Size([10684])
tensor([0, 1, 3,  ..., 0, 3, 2])

#Save
torch.save(X_train, 'X_train.pt')
torch.save(X_valid, 'X_valid.pt')
torch.save(X_test, 'X_test.pt')
torch.save(y_train, 'y_train.pt')
torch.save(y_valid, 'y_valid.pt')
torch.save(y_test, 'y_test.pt')

71. Prediction by single-layer neural network

Read the matrix saved in question 70 and perform the following calculations on the training data.

\hat{y}1=softmax(x_1W),\\hat{Y}=softmax(X{[1:4]}W)

 However, $ softmax $ is the softmax function, $ X_ {[1: 4]} ∈ \ mathbb {R} ^ {4 × d} $ is the feature vector $ x_1 $, $ x_2 $, $ x_3 $, $ x_4 $ Is a matrix in which

>```math
X_{[1:4]}=\begin{pmatrix}x_1\\x_2\\x_3\\x_4\end{pmatrix}

The matrix $ W \ in \ mathbb {R} ^ {d \ times L} $ is a weight matrix of a single-layer neural network, which can be initialized with a random value (learned in problem 73 and later). .. Note that $ \ hat {\ boldsymbol y_1} \ in \ mathbb {R} ^ L $ is a vector representing the probability of belonging to each category when the case $ x_1 $ is classified by the unlearned matrix $ W $. Similarly, $ \ hat {Y} \ in \ mathbb {R} ^ {n \ times L} $ expresses the probabilities belonging to each category as a matrix for the training data examples $ x_1, x_2, x_3, x_4 $. are doing.

First, define a single-layer neural network called SLPNet. Define the layers that make up the network with `__ init__```, and arrange the layers through which the input data passes in order with the `forward``` method.

from torch import nn

class SLPNet(nn.Module):
  def __init__(self, input_size, output_size):
    super().__init__()
    self.fc = nn.Linear(input_size, output_size, bias=False)
    nn.init.normal_(self.fc.weight, 0.0, 1.0)  #Initialize weights with normal random numbers

  def forward(self, x):
    x = self.fc(x)
    return x

It then initializes the defined model and performs the indicated calculations.

model = SLPNet(300, 4)  #Initialization of single-layer neural network
y_hat_1 = torch.softmax(model.forward(X_train[:1]), dim=-1)
print(y_hat_1)

`output`


tensor([[0.4273, 0.0958, 0.2492, 0.2277]], grad_fn=<SoftmaxBackward>)

Y_hat = torch.softmax(model.forward(X_train[:4]), dim=-1)
print(Y_hat)

`output`


tensor([[0.4273, 0.0958, 0.2492, 0.2277],
        [0.2445, 0.2431, 0.0197, 0.4927],
        [0.7853, 0.1132, 0.0291, 0.0724],
        [0.5279, 0.2319, 0.0873, 0.1529]], grad_fn=<SoftmaxBackward>)

72. Loss and gradient calculations

Calculate the cross-entropy loss and the gradient for the matrix $ W $ for the case $ x_1 $ and the case set $ x_1 $, $ x_2 $, $ x_3 $, $ x_4 $ of the training data. For a certain case $ x_i $, the loss is calculated by the following equation.

l_i = −log [Probability that case x_i is classified as y_i]

However, the cross-entropy loss for the case set is the average of the losses of each case included in the set.

Here, we will use the `CrossEntropyLoss``` of the `nn``` module. By inputting the output vector and label vector of the model, the average loss of the above equation can be calculated.

criterion = nn.CrossEntropyLoss()

l_1 = criterion(model.forward(X_train[:1]), y_train[:1])  #Input vector is the value before softmax
model.zero_grad()  #Initialize gradient to zero
l_1.backward()  #Calculate the gradient
print(f'loss: {l_1:.4f}')
print(f'Slope:\n{model.fc.weight.grad}')

`output`


loss: 2.9706
Slope:
tensor([[-0.0794, -0.0053, -0.0065,  ..., -0.0713, -0.0411,  0.0823],
        [ 0.0022,  0.0001,  0.0002,  ...,  0.0020,  0.0011, -0.0023],
        [ 0.0611,  0.0041,  0.0050,  ...,  0.0549,  0.0316, -0.0634],
        [ 0.0161,  0.0011,  0.0013,  ...,  0.0144,  0.0083, -0.0167]])

l = criterion(model.forward(X_train[:4]), y_train[:4])
model.zero_grad()
l.backward()
print(f'loss: {l:.4f}')
print(f'Slope:\n{model.fc.weight.grad}')

`output`


loss: 3.0799
Slope:
tensor([[-0.0207,  0.0079, -0.0090,  ..., -0.0350, -0.0003,  0.0232],
        [-0.0055, -0.0063,  0.0225,  ...,  0.0252,  0.0166,  0.0039],
        [ 0.0325, -0.0089, -0.0215,  ...,  0.0084,  0.0122, -0.0030],
        [-0.0063,  0.0072,  0.0081,  ...,  0.0014, -0.0285, -0.0241]])

73. Learning by stochastic gradient descent

Learn the matrix $ W $ using Stochastic Gradient Descent (SGD). Learning may be completed according to an appropriate standard (for example, "end at 100 epochs").

For learning, prepare `Dataset``` and `Dataloader. dataset```Is a type that can hold the feature vector and the label vector together, and transforms the original tensor using the following class.

from torch.utils.data import Dataset

class CreateDataset(Dataset):
    def __init__(self, X, y):  #Specify the components of dataset
        self.X = X
        self.y = y

    def __len__(self):  # len(dataset)Specify the value to be returned with
        return len(self.y)

    def __getitem__(self, idx):  # dataset[idx]Specify the value to be returned with
        if isinstance(idx, torch.Tensor):
            idx = idx.tolist()
        return [self.X[idx], self.y[idx]]

After conversion, create `DataLoader```. Dataloader``` can be input to ``` Dataset```, and the data collected in the specified size (`` batch_size) can be retrieved in order. Here, `` `batch_size = 1 is set, so it means to create `Dataloader``` to retrieve data one by one. Note that ``` Dataloader``` can be retrieved in order with the` `for``` statement, or the next chunk can be called with` next (iter (Dataloader)) ``.

from torch.utils.data import DataLoader

dataset_train = CreateDataset(X_train, y_train)
dataset_valid = CreateDataset(X_valid, y_valid)
dataset_test = CreateDataset(X_test, y_test)
dataloader_train = DataLoader(dataset_train, batch_size=1, shuffle=True)
dataloader_valid = DataLoader(dataset_valid, batch_size=len(dataset_valid), shuffle=False)
dataloader_test = DataLoader(dataset_test, batch_size=len(dataset_test), shuffle=False)

Now that the data is ready, let's train the matrix $ W $. The definition of the model and the definition of the loss function are the same as in the previous question. This time we will also update the weights from the calculated gradient, so we will also define an optimizer. Here, SGD is set according to the instructions. When the parts are ready, the learning is executed with the number of epochs set to 10.

#Model definition
model = SLPNet(300, 4)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=1e-1)

#Learning
num_epochs = 10
for epoch in range(num_epochs):
  #Set to training mode
  model.train()
  loss_train = 0.0
  for i, (inputs, labels) in enumerate(dataloader_train):
    #Initialize gradient to zero
    optimizer.zero_grad()

    #Forward propagation+Error back propagation+Weight update
    outputs = model.forward(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    #Record loss
    loss_train += loss.item()
 
  #Average loss calculation for each batch
  loss_train = loss_train / i

  #Validation data loss calculation
  model.eval() 
  with torch.no_grad():
    inputs, labels = next(iter(dataloader_valid))
    outputs = model.forward(inputs)
    loss_valid = criterion(outputs, labels)

  #Output log
  print(f'epoch: {epoch + 1}, loss_train: {loss_train:.4f}, loss_valid: {loss_valid:.4f}')

`output`


epoch: 1, loss_train: 0.4745, loss_valid: 0.3637
epoch: 2, loss_train: 0.3173, loss_valid: 0.3306
epoch: 3, loss_train: 0.2884, loss_valid: 0.3208
epoch: 4, loss_train: 0.2716, loss_valid: 0.3150
epoch: 5, loss_train: 0.2615, loss_valid: 0.3141
epoch: 6, loss_train: 0.2519, loss_valid: 0.3092
epoch: 7, loss_train: 0.2474, loss_valid: 0.3114
epoch: 8, loss_train: 0.2431, loss_valid: 0.3072
epoch: 9, loss_train: 0.2393, loss_valid: 0.3096
epoch: 10, loss_train: 0.2359, loss_valid: 0.3219

You can see that the loss of training data is gradually decreasing as the epoch progresses.

74. Measurement of correct answer rate

When classifying the cases of training data and evaluation data using the matrix obtained in question 73, find the correct answer rate for each.

With the trained model and Dataloader as input, define a function to calculate the accuracy rate.

def calculate_accuracy(model, loader):
  model.eval()
  total = 0
  correct = 0
  with torch.no_grad():
    for inputs, labels in loader:
      outputs = model(inputs)
      pred = torch.argmax(outputs, dim=-1)
      total += len(inputs)
      correct += (pred == labels).sum().item()
      
  return correct / total

acc_train = calculate_accuracy(model, dataloader_train)
acc_test = calculate_accuracy(model, dataloader_test)
print(f'Correct answer rate (learning data):{acc_train:.3f}')
print(f'Correct answer rate (evaluation data):{acc_test:.3f}')

`output`


Correct answer rate (learning data): 0.920
Correct answer rate (evaluation data): 0.891

75. Loss and accuracy rate plot

By modifying the code of question 73, each time the parameter update of each epoch is completed, the loss in the training data, the correct answer rate, the loss in the verification data, and the correct answer rate can be plotted on a graph to check the progress of learning. Do it.

The function in the previous question is modified so that the loss can also be calculated, and the loss and the correct answer rate are recorded by applying it for each epoch.

def calculate_loss_and_accuracy(model, criterion, loader):
  model.eval()
  loss = 0.0
  total = 0
  correct = 0
  with torch.no_grad():
    for inputs, labels in loader:
      outputs = model(inputs)
      loss += criterion(outputs, labels).item()
      pred = torch.argmax(outputs, dim=-1)
      total += len(inputs)
      correct += (pred == labels).sum().item()
      
  return loss / len(loader), correct / total

#Model definition
model = SLPNet(300, 4)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=1e-1)

#Learning
num_epochs = 30
log_train = []
log_valid = []
for epoch in range(num_epochs):
  #Set to training mode
  model.train()
  for inputs, labels in dataloader_train:
    #Initialize gradient to zero
    optimizer.zero_grad()

    #Forward propagation+Error back propagation+Weight update
    outputs = model.forward(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
 
  #Calculation of loss and correct answer rate
  loss_train, acc_train = calculate_loss_and_accuracy(model, criterion, dataloader_train)
  loss_valid, acc_valid = calculate_loss_and_accuracy(model, criterion, dataloader_valid)
  log_train.append([loss_train, acc_train])
  log_valid.append([loss_valid, acc_valid])

  #Output log
  print(f'epoch: {epoch + 1}, loss_train: {loss_train:.4f}, accuracy_train: {acc_train:.4f}, loss_valid: {loss_valid:.4f}, accuracy_valid: {acc_valid:.4f}')

`output`


epoch: 1, loss_train: 0.3476, accuracy_train: 0.8796, loss_valid: 0.3656, accuracy_valid: 0.8840
epoch: 2, loss_train: 0.2912, accuracy_train: 0.8988, loss_valid: 0.3219, accuracy_valid: 0.8967
・ ・ ・
epoch: 29, loss_train: 0.2102, accuracy_train: 0.9287, loss_valid: 0.3259, accuracy_valid: 0.8930
epoch: 30, loss_train: 0.2119, accuracy_train: 0.9289, loss_valid: 0.3262, accuracy_valid: 0.8945

from matplotlib import pyplot as plt

#Visualization
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax[0].plot(np.array(log_train).T[0], label='train')
ax[0].plot(np.array(log_valid).T[0], label='valid')
ax[0].set_xlabel('epoch')
ax[0].set_ylabel('loss')
ax[0].legend()
ax[1].plot(np.array(log_train).T[1], label='train')
ax[1].plot(np.array(log_valid).T[1], label='valid')
ax[1].set_xlabel('epoch')
ax[1].set_ylabel('accuracy')
ax[1].legend()
plt.show()

76. Checkpoint

Modify the code in question 75 and write the checkpoints (values of parameters in the process of learning (weight matrix, etc.) and internal state of the optimization algorithm) to a file each time the parameter update of each epoch is completed.

The parameters during training can be accessed with `model.state_dict ()`, and the internal state of the optimization algorithm can be accessed with ```optimizer.state_dict ()` ``, so save it together with the number of epochs in each epoch. Add processing. The output is the same as the previous question, so it will be omitted.

#Model definition
model = SLPNet(300, 4)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=1e-1)

#Learning
num_epochs = 10
log_train = []
log_valid = []
for epoch in range(num_epochs):
  #Set to training mode
  model.train()
  for inputs, labels in dataloader_train:
    #Initialize gradient to zero
    optimizer.zero_grad()

    #Forward propagation+Error back propagation+Weight update
    outputs = model.forward(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
 
  #Calculation of loss and correct answer rate
  loss_train, acc_train = calculate_loss_and_accuracy(model, criterion, dataloader_train)
  loss_valid, acc_valid = calculate_loss_and_accuracy(model, criterion, dataloader_valid)
  log_train.append([loss_train, acc_train])
  log_valid.append([loss_valid, acc_valid])

  #Save checkpoint
  torch.save({'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict()}, f'checkpoint{epoch + 1}.pt')

  #Output log
  print(f'epoch: {epoch + 1}, loss_train: {loss_train:.4f}, accuracy_train: {acc_train:.4f}, loss_valid: {loss_valid:.4f}, accuracy_valid: {acc_valid:.4f}')

77. Mini batch

Modify the code in question 76, calculate the loss / gradient for each $ B $ case, and update the value of the matrix $ W $ (mini-batch). Compare the time required to learn one epoch while changing the value of $ B $ to $ 1,2,4,8,… $.

Since it is difficult to write all the processes every time the batch size is changed, function the processes after the creation of `Dataloader``` as `train_model``` and take some parameters including the batch size as arguments. Set as.

import time

def train_model(dataset_train, dataset_valid, batch_size, model, criterion, optimizer, num_epochs):
  #Creating a dataloader
  dataloader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
  dataloader_valid = DataLoader(dataset_valid, batch_size=len(dataset_valid), shuffle=False)

  #Learning
  log_train = []
  log_valid = []
  for epoch in range(num_epochs):
    #Record start time
    s_time = time.time()

    #Set to training mode
    model.train()
    for inputs, labels in dataloader_train:
      #Initialize gradient to zero
      optimizer.zero_grad()

      #Forward propagation+Error back propagation+Weight update
      outputs = model.forward(inputs)
      loss = criterion(outputs, labels)
      loss.backward()
      optimizer.step()
  
    #Calculation of loss and correct answer rate
    loss_train, acc_train = calculate_loss_and_accuracy(model, criterion, dataloader_train)
    loss_valid, acc_valid = calculate_loss_and_accuracy(model, criterion, dataloader_valid)
    log_train.append([loss_train, acc_train])
    log_valid.append([loss_valid, acc_valid])

    #Save checkpoint
    torch.save({'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict()}, f'checkpoint{epoch + 1}.pt')

    #Record end time
    e_time = time.time()

    #Output log
    print(f'epoch: {epoch + 1}, loss_train: {loss_train:.4f}, accuracy_train: {acc_train:.4f}, loss_valid: {loss_valid:.4f}, accuracy_valid: {acc_valid:.4f}, {(e_time - s_time):.4f}sec') 

  return {'train': log_train, 'valid': log_valid}

Measure the processing time while changing the batch size.

#Creating dataset
dataset_train = CreateDataset(X_train, y_train)
dataset_valid = CreateDataset(X_valid, y_valid)

#Model definition
model = SLPNet(300, 4)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=1e-1)

#Model learning
for batch_size in [2 ** i for i in range(11)]:
  print(f'Batch size: {batch_size}')
  log = train_model(dataset_train, dataset_valid, batch_size, model, criterion, optimizer, 1)

`output`


Batch size: 1
epoch: 1, loss_train: 0.3237, accuracy_train: 0.8888, loss_valid: 0.3476, accuracy_valid: 0.8817, 5.4416sec
Batch size: 2
epoch: 1, loss_train: 0.2966, accuracy_train: 0.8999, loss_valid: 0.3258, accuracy_valid: 0.8847, 3.0029sec
Batch size: 4
epoch: 1, loss_train: 0.2883, accuracy_train: 0.8999, loss_valid: 0.3222, accuracy_valid: 0.8862, 1.5988sec
Batch size: 8
epoch: 1, loss_train: 0.2835, accuracy_train: 0.9023, loss_valid: 0.3179, accuracy_valid: 0.8907, 0.8732sec
Batch size: 16
epoch: 1, loss_train: 0.2817, accuracy_train: 0.9038, loss_valid: 0.3164, accuracy_valid: 0.8907, 0.5445sec
Batch size: 32
epoch: 1, loss_train: 0.2810, accuracy_train: 0.9038, loss_valid: 0.3159, accuracy_valid: 0.8900, 0.3482sec
Batch size: 64
epoch: 1, loss_train: 0.2806, accuracy_train: 0.9040, loss_valid: 0.3157, accuracy_valid: 0.8900, 0.2580sec
Batch size: 128
epoch: 1, loss_train: 0.2806, accuracy_train: 0.9041, loss_valid: 0.3156, accuracy_valid: 0.8900, 0.1984sec
Batch size: 256
epoch: 1, loss_train: 0.2801, accuracy_train: 0.9039, loss_valid: 0.3155, accuracy_valid: 0.8900, 0.1715sec
Batch size: 512
epoch: 1, loss_train: 0.2802, accuracy_train: 0.9038, loss_valid: 0.3155, accuracy_valid: 0.8900, 0.2177sec
Batch size: 1024
epoch: 1, loss_train: 0.2792, accuracy_train: 0.9038, loss_valid: 0.3155, accuracy_valid: 0.8900, 0.1603sec

In general, you can see that the larger the batch size, the shorter the calculation time.

78. Learning on GPU

Modify the code in question 77 and perform the learning on the GPU.

Add the argument `device``` to specify the GPU to calculate_loss_and_accuracy``` and `` train_model. In each function, you can use the GPU by adding the process of sending the model and input Tensor to the GPU and specifying `` `cuda for `` `device```.

def calculate_loss_and_accuracy(model, criterion, loader, device):
  model.eval()
  loss = 0.0
  total = 0
  correct = 0
  with torch.no_grad():
    for inputs, labels in loader:
      inputs = inputs.to(device)
      labels = labels.to(device)
      outputs = model(inputs)
      loss += criterion(outputs, labels).item()
      pred = torch.argmax(outputs, dim=-1)
      total += len(inputs)
      correct += (pred == labels).sum().item()
      
  return loss / len(loader), correct / total
  

def train_model(dataset_train, dataset_valid, batch_size, model, criterion, optimizer, num_epochs, device=None):
  #Send to GPU
  model.to(device)

  #Creating a dataloader
  dataloader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
  dataloader_valid = DataLoader(dataset_valid, batch_size=len(dataset_valid), shuffle=False)

  #Learning
  log_train = []
  log_valid = []
  for epoch in range(num_epochs):
    #Record start time
    s_time = time.time()

    #Set to training mode
    model.train()
    for inputs, labels in dataloader_train:
      #Initialize gradient to zero
      optimizer.zero_grad()

      #Forward propagation+Error back propagation+Weight update
      inputs = inputs.to(device)
      labels = labels.to(device)
      outputs = model.forward(inputs)
      loss = criterion(outputs, labels)
      loss.backward()
      optimizer.step()
    
    #Calculation of loss and correct answer rate
    loss_train, acc_train = calculate_loss_and_accuracy(model, criterion, dataloader_train, device)
    loss_valid, acc_valid = calculate_loss_and_accuracy(model, criterion, dataloader_valid, device)
    log_train.append([loss_train, acc_train])
    log_valid.append([loss_valid, acc_valid])

    #Save checkpoint
    torch.save({'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict()}, f'checkpoint{epoch + 1}.pt')

    #Record end time
    e_time = time.time()

    #Output log
    print(f'epoch: {epoch + 1}, loss_train: {loss_train:.4f}, accuracy_train: {acc_train:.4f}, loss_valid: {loss_valid:.4f}, accuracy_valid: {acc_valid:.4f}, {(e_time - s_time):.4f}sec') 

  return {'train': log_train, 'valid': log_valid}

#Creating dataset
dataset_train = CreateDataset(X_train, y_train)
dataset_valid = CreateDataset(X_valid, y_valid)

#Model definition
model = SLPNet(300, 4)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=1e-1)

#Device specification
device = torch.device('cuda')

#Model learning
for batch_size in [2 ** i for i in range(11)]:
  print(f'Batch size: {batch_size}')
  log = train_model(dataset_train, dataset_valid, batch_size, model, criterion, optimizer, 1, device=device)

`output`


Batch size: 1
epoch: 1, loss_train: 0.3300, accuracy_train: 0.8874, loss_valid: 0.3584, accuracy_valid: 0.8772, 9.0342sec
Batch size: 2
epoch: 1, loss_train: 0.3025, accuracy_train: 0.8994, loss_valid: 0.3374, accuracy_valid: 0.8870, 4.6391sec
Batch size: 4
epoch: 1, loss_train: 0.2938, accuracy_train: 0.9005, loss_valid: 0.3321, accuracy_valid: 0.8855, 2.4228sec
Batch size: 8
epoch: 1, loss_train: 0.2894, accuracy_train: 0.9039, loss_valid: 0.3299, accuracy_valid: 0.8855, 1.2517sec
Batch size: 16
epoch: 1, loss_train: 0.2876, accuracy_train: 0.9038, loss_valid: 0.3285, accuracy_valid: 0.8855, 0.7149sec
Batch size: 32
epoch: 1, loss_train: 0.2867, accuracy_train: 0.9050, loss_valid: 0.3280, accuracy_valid: 0.8862, 0.4323sec
Batch size: 64
epoch: 1, loss_train: 0.2863, accuracy_train: 0.9050, loss_valid: 0.3277, accuracy_valid: 0.8862, 0.2834sec
Batch size: 128
epoch: 1, loss_train: 0.2869, accuracy_train: 0.9051, loss_valid: 0.3276, accuracy_valid: 0.8862, 0.2070sec
Batch size: 256
epoch: 1, loss_train: 0.2864, accuracy_train: 0.9054, loss_valid: 0.3275, accuracy_valid: 0.8862, 0.1587sec
Batch size: 512
epoch: 1, loss_train: 0.2859, accuracy_train: 0.9056, loss_valid: 0.3275, accuracy_valid: 0.8862, 0.2016sec
Batch size: 1024
epoch: 1, loss_train: 0.2858, accuracy_train: 0.9056, loss_valid: 0.3275, accuracy_valid: 0.8862, 0.1303sec

While the batch size is small, it seems that the time to send to the GPU for each batch is longer, and the processing time is shorter when using the CPU. On the other hand, as the batch size increases, you can see that it is faster when using the GPU.

79. Multi-layer neural network

Modify the code in question 78 to build a high-performance categorizer while changing the shape of the neural network, such as introducing bias terms and multi-layering.

Define a new multi-layer neural network MLPNet. This network consists of input layer-> middle layer-> output layer, and batch normalization is performed after the middle layer. In addition, `` `train_model``` introduces a new learning termination standard. This time, the rule is simply to stop when the loss of verification data does not decrease for 3 consecutive epochs. In addition, we will add a scheduler that gradually lowers the learning rate, aiming to improve generalization performance.

from torch.nn import functional as F

class MLPNet(nn.Module):
  def __init__(self, input_size, mid_size, output_size, mid_layers):
    super().__init__()
    self.mid_layers = mid_layers
    self.fc = nn.Linear(input_size, mid_size)
    self.fc_mid = nn.Linear(mid_size, mid_size)
    self.fc_out = nn.Linear(mid_size, output_size) 
    self.bn = nn.BatchNorm1d(mid_size)

  def forward(self, x):
    x = F.relu(self.fc(x))
    for _ in range(self.mid_layers):
      x = F.relu(self.bn(self.fc_mid(x)))
    x = F.relu(self.fc_out(x))
 
    return x

from torch import optim

def calculate_loss_and_accuracy(model, criterion, loader, device):
  model.eval()
  loss = 0.0
  total = 0
  correct = 0
  with torch.no_grad():
    for inputs, labels in loader:
      inputs = inputs.to(device)
      labels = labels.to(device)
      outputs = model(inputs)
      loss += criterion(outputs, labels).item()
      pred = torch.argmax(outputs, dim=-1)
      total += len(inputs)
      correct += (pred == labels).sum().item()
      
  return loss / len(loader), correct / total
  

def train_model(dataset_train, dataset_valid, batch_size, model, criterion, optimizer, num_epochs, device=None):
  #Send to GPU
  model.to(device)

  #Creating a dataloader
  dataloader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
  dataloader_valid = DataLoader(dataset_valid, batch_size=len(dataset_valid), shuffle=False)

  #Scheduler settings
  scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, num_epochs, eta_min=1e-5, last_epoch=-1)

  #Learning
  log_train = []
  log_valid = []
  for epoch in range(num_epochs):
    #Record start time
    s_time = time.time()

    #Set to training mode
    model.train()
    for inputs, labels in dataloader_train:
      #Initialize gradient to zero
      optimizer.zero_grad()

      #Forward propagation+Error back propagation+Weight update
      inputs = inputs.to(device)
      labels = labels.to(device)
      outputs = model.forward(inputs)
      loss = criterion(outputs, labels)
      loss.backward()
      optimizer.step()
    
    #Calculation of loss and correct answer rate
    loss_train, acc_train = calculate_loss_and_accuracy(model, criterion, dataloader_train, device)
    loss_valid, acc_valid = calculate_loss_and_accuracy(model, criterion, dataloader_valid, device)
    log_train.append([loss_train, acc_train])
    log_valid.append([loss_valid, acc_valid])

    #Save checkpoint
    torch.save({'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict()}, f'checkpoint{epoch + 1}.pt')

    #Record end time
    e_time = time.time()

    #Output log
    print(f'epoch: {epoch + 1}, loss_train: {loss_train:.4f}, accuracy_train: {acc_train:.4f}, loss_valid: {loss_valid:.4f}, accuracy_valid: {acc_valid:.4f}, {(e_time - s_time):.4f}sec') 

    #If the loss of verification data does not decrease for 3 consecutive epochs, learning ends.
    if epoch > 2 and log_valid[epoch - 3][0] <= log_valid[epoch - 2][0] <= log_valid[epoch - 1][0] <= log_valid[epoch][0]:
      break

    #Take the scheduler one step
    scheduler.step()

  return {'train': log_train, 'valid': log_valid}

#Creating dataset
dataset_train = CreateDataset(X_train, y_train)
dataset_valid = CreateDataset(X_valid, y_valid)

#Model definition
model = MLPNet(300, 200, 4, 1)

#Definition of loss function
criterion = nn.CrossEntropyLoss()

#Optimizer definition
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

#Device specification
device = torch.device('cuda')

#Model learning
log = train_model(dataset_train, dataset_valid, 64, model, criterion, optimizer, 1000, device)

`output`


epoch: 1, loss_train: 1.1176, accuracy_train: 0.6679, loss_valid: 1.1150, accuracy_valid: 0.6572, 0.4695sec
epoch: 2, loss_train: 0.8050, accuracy_train: 0.7620, loss_valid: 0.8005, accuracy_valid: 0.7687, 0.4521sec
・ ・ ・
epoch: 96, loss_train: 0.1708, accuracy_train: 0.9460, loss_valid: 0.2858, accuracy_valid: 0.9034, 0.4632sec
epoch: 97, loss_train: 0.1702, accuracy_train: 0.9466, loss_valid: 0.2861, accuracy_valid: 0.9034, 0.5373sec

It was discontinued at 97 epochs. Visualize the loss and accuracy rate for each epoch.

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax[0].plot(np.array(log['train']).T[0], label='train')
ax[0].plot(np.array(log['valid']).T[0], label='valid')
ax[0].set_xlabel('epoch')
ax[0].set_ylabel('loss')
ax[0].legend()
ax[1].plot(np.array(log['train']).T[1], label='train')
ax[1].plot(np.array(log['valid']).T[1], label='valid')
ax[1].set_xlabel('epoch')
ax[1].set_ylabel('accuracy')
ax[1].legend()
plt.show()

Check the accuracy rate of the evaluation data.

def calculate_accuracy(model, loader, device):
  model.eval()
  total = 0
  correct = 0
  with torch.no_grad():
    for inputs, labels in loader:
      inputs = inputs.to(device)
      labels = labels.to(device)
      outputs = model(inputs)
      pred = torch.argmax(outputs, dim=-1)
      total += len(inputs)
      correct += (pred == labels).sum().item()
      
  return correct / total

#Confirmation of correct answer rate
acc_train = calculate_accuracy(model, dataloader_train, device)
acc_test = calculate_accuracy(model, dataloader_test, device)
print(f'Correct answer rate (learning data):{acc_train:.3f}')
print(f'Correct answer rate (evaluation data):{acc_test:.3f}')

`output`


Correct answer rate (learning data): 0.947
Correct answer rate (evaluation data): 0.921

In the single-layer neural network, the accuracy rate of the evaluation data was 0.891, but it is improved by 3 points by making it multi-layered.

in conclusion

100 Language Processing Knock is designed so that you can learn not only natural language processing itself, but also basic data processing and general-purpose machine learning. Even those who are studying machine learning in online courses will be able to practice very good output, so please try it.

To all 100 questions