[PyTorch Tutorial ⑥] What is torch.nn really?

Introduction

This is the 6th installment of PyTorch Official Tutorial following Last time. This time, we will proceed with What is torch.nn really?.

What is torch.nn really?

This tutorial describes torch.nn, torch.optim, Dataset, and DataLoader. (Although torch.nn and torch.optim were explained last time, there are some overlaps because various people have written tutorials.)

The dataset used is the MNIST dataset. The MNIST dataset is a dataset of handwritten digit images from 0 to 9. To better understand, first build the model without using the above packages. Next, we will proceed in the order of torch.nn, torch.optim, Dataset, DataLoader, replacing the code one by one.

  1. MNIST data setup

First, download the MNIST dataset (handwritten digit image dataset).

from pathlib import Path
import requests

DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"

PATH.mkdir(parents=True, exist_ok=True)

URL = "http://deeplearning.net/data/mnist/"
FILENAME = "mnist.pkl.gz"

if not (PATH / FILENAME).exists():
        content = requests.get(URL + FILENAME).content
        (PATH / FILENAME).open("wb").write(content)

This dataset is a numpy array. It is saved in pickle format.

import pickle
import gzip

with gzip.open((PATH / FILENAME).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")

One data (x_train [0]) is a 28x28 size image, but it is held as one row with 784 columns. To view it with pyplot.imshow, you need to convert it to 28x28.

from matplotlib import pyplot
import numpy as np

pyplot.imshow(x_train[0].reshape((28, 28)), cmap="gray")
print(x_train.shape)

out


(50000, 784)

ダウンロード.png

From now on, we will use PyTorch's Tensor. Convert from a numpy array to a Tensor.

import torch

x_train, y_train, x_valid, y_valid = map(
    torch.tensor, (x_train, y_train, x_valid, y_valid)
)
n, c = x_train.shape
x_train, x_train.shape, y_train.min(), y_train.max()
print(x_train, y_train)
print(x_train.shape)
print(y_train.min(), y_train.max())

out


tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]]) tensor([5, 0, 4,  ..., 8, 4, 8])
torch.Size([50000, 784])
tensor(0) tensor(9)

You can see that the number of training data is 50,000 (× 784), and the teacher data is a number from 0 to 9.

  1. Neural net from scratch (no torch.nn)

First, create a neural network only with Tensor without using torch.nn. The model to be created is a simple linear model $ y = w \times x + b$ is.

Initialize the weight $ w $ with PyTorch's random method randn. randn is a standardized (0 mean, 1 standard deviation) random value. Since we don't want to include the gradient when initializing, we do require_grad_ () after initialization and set requires_grad = True. "Xavier initialization" is used for weight initialization. (There is, but I feel that the calculation formula is a little different) Bias is initialized to zero.

import math

weights = torch.randn(784, 10) / math.sqrt(784)
weights.requires_grad_()
bias = torch.zeros(10, requires_grad=True)

We also need an activation function, so create a log_softmax function. PyTorch provides many loss and activation functions, but you can also create your own in this way.

def log_softmax(x):
    return x - x.exp().sum(-1).log().unsqueeze(-1)

def model(xb):
    return log_softmax(xb @ weights + bias)

@ Represents the dot product operation. This function is called in batch size (64 images this time).

bs = 64  # batch size

xb = x_train[0:bs]  #Mini batch
preds = model(xb)  #Expect with model
preds[0], preds.shape
print(preds[0], preds.shape)

out


tensor([-2.8486, -2.2823, -2.2740, -2.7800, -2.1906, -1.3280, -2.4680, -2.2958,
        -2.8856, -2.8650], grad_fn=<SelectBackward>) torch.Size([64, 10])

If you output the predicted value preds, you can see that the Tensor contains a gradient function (grad_fn). We will later use this gradient function to calculate the backpropagation. Implement the negative log-likelihood of the teacher data and the predicted value as a loss function. Negative log-likelihood is commonly referred to as the cross-entropy error function.

def nll(input, target):
    return -input[range(target.shape[0]), target].mean()

loss_func = nll

Calculate the loss with the predicted values and the teacher data and check the parameters after training.

yb = y_train[0:bs]
print(loss_func(preds, yb))

out


tensor(2.4101, grad_fn=<NegBackward>)

It also implements an evaluation function that calculates the accuracy of the model. Since the probabilities of handwritten numbers 0 to 9 are held in an array in out, the maximum value of argmax is the handwritten number with the highest probability. The correct answer rate is calculated by taking the matching average of the value and the teacher data.

def accuracy(out, yb):
    preds = torch.argmax(out, dim=1)
    return (preds == yb).float().mean()
print(accuracy(preds, yb))

out


tensor(0.0781)

Now you are ready to learn. Repeat the following to learn.

--Acquire training data in mini-batch units. --Use a model to make predictions from training data. --Calculate the loss. --Update the model gradient (weights and biases) with loss.backward ().

After updating the weights and biases, I'm initializing the gradient with grad.zero_ (). This is because when you calculate the gradient with loss.backward (), it will be added to what is already saved.

from IPython.core.debugger import set_trace

lr = 0.5  # learning rate
epochs = 2  # how many epochs to train for

for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        #set_trace()
        start_i = i * bs
        end_i = start_i + bs
        xb = x_train[start_i:end_i]
        yb = y_train[start_i:end_i]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        with torch.no_grad():
            weights -= weights.grad * lr
            bias -= bias.grad * lr
            weights.grad.zero_()
            bias.grad.zero_()

You can see that the accuracy has improved after learning.

print(loss_func(model(xb), yb), accuracy(model(xb), yb))

out


tensor(0.0822, grad_fn=<NegBackward>) tensor(1.)

Before learning, the correct answer rate was 7%, but after learning, it is 100%.

Now you have a simple neural network built from scratch. This network using the softmax function without a hidden layer is called logistic regression.

  1. Using torch.nn.functional

From here, we'll use PyTorch's nn package to refactor our code. In the first step, let's replace the activation function and the loss function. torch.nn.functional has F.cross_entropy, which combines the log_softmax function with a negative log-likelihood. Replace the loss function with F.cross_entropy. Since F.cross_entropy includes the log_softmax function, you can also remove the def log_softmax (x) defined as the activation function.

import torch.nn.functional as F

loss_func = F.cross_entropy

def model(xb):
    return xb @ weights + bias

The log_softmax that was called by model is no longer needed. (Included in cross_entropy) Make sure the loss and accuracy are the same as before.

print(loss_func(model(xb), yb), accuracy(model(xb), yb))

out


tensor(0.0822, grad_fn=<NllLossBackward>) tensor(1.)
  1. Refactor using nn.Module

Next, we will refactor using nn.Module and nn.Parameter. nn.Module is the base class for Pytorch's neural networks. Implement nn.Module as a subclass. Define the weight and bias parameters in the subclass you created. In addition, describe the process to connect from input to output in order in the forward method. nn.Module also comes with parameters (), which returns model parameters.

from torch import nn

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(784, 10) / math.sqrt(784))
        self.bias = nn.Parameter(torch.zeros(10))

    def forward(self, xb):
        return xb @ self.weights + self.bias

Since we are using objects instead of functions, we need to instantiate the model first.

model = Mnist_Logistic()

Now you can learn as you did before the refactoring. The nn.Module object can be called and used like a function.

print(loss_func(model(xb), yb))

out


tensor(2.3918, grad_fn=<NllLossBackward>)

In the implementation so far, the weight and bias updates were calculated respectively as follows, and the gradient was manually set to zero.

  with torch.no_grad():
      weights -= weights.grad * lr
      bias -= bias.grad * lr
      weights.grad.zero_()
      bias.grad.zero_()

Weight and bias updates can be simplified by replacing them with parameters () and zero_grad () defined in nn.Module.

  #Cannot be executed because it is an explanatory code (a run-time error will occur)
  with torch.no_grad():
      for p in model.parameters(): p -= p.grad * lr
      model.zero_grad()

Define the learning loop as a fit function so that it can be called.

def fit():
    for epoch in range(epochs):
        for i in range((n - 1) // bs + 1):
            start_i = i * bs
            end_i = start_i + bs
            xb = x_train[start_i:end_i]
            yb = y_train[start_i:end_i]
            pred = model(xb)
            loss = loss_func(pred, yb)

            loss.backward()
            with torch.no_grad():
                for p in model.parameters():
                    p -= p.grad * lr
                model.zero_grad()

fit()

Let's reconfirm that the loss is reduced.

print(loss_func(model(xb), yb))

out


tensor(0.0796, grad_fn=<NllLossBackward>)
  1. Refactor using nn.Linear

I first defined weights and bias myself and implemented the linear function $ w \ times x + b $, but let's replace it with nn.Linear (linear layer).

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)

    def forward(self, xb):

        return self.lin(xb)

Instantiate the model as before and calculate the loss.

model = Mnist_Logistic()
print(loss_func(model(xb), yb))

out


tensor(2.3661, grad_fn=<NllLossBackward>)

Learn by calling a functionalized fit.

fit()
print(loss_func(model(xb), yb))

out


tensor(0.0813, grad_fn=<NllLossBackward>)

The loss value has changed from 2.3661 to 0.0813, confirming that learning is possible.

  1. Refactor using optim

Then refactor the optimization algorithm. There are various optimization algorithms in Pytorch's torch.optim package. Also, each class in torch.optim updates the parameters by executing the step method instead of updating the parameters manually.

  with torch.no_grad():
      for p in model.parameters(): p -= p.grad * lr
      model.zero_grad()

You can rewrite the above code as follows.

  #Cannot be executed because it is a descriptive code
  opt.step()
  opt.zero_grad()
from torch import optim

Functionalizing model and optimizer generation simplifies the code.

def get_model():
    model = Mnist_Logistic()
    return model, optim.SGD(model.parameters(), lr=lr)

model, opt = get_model()
print(loss_func(model(xb), yb))

for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        start_i = i * bs
        end_i = start_i + bs
        xb = x_train[start_i:end_i]
        yb = y_train[start_i:end_i]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

out


tensor(2.3423, grad_fn=<NllLossBackward>)
tensor(0.0819, grad_fn=<NllLossBackward>)
  1. Refactor using Dataset

PyTorch has an abstract Dataset class. Dataset makes it easier to handle training data (x_train) and teacher data (y_train) during training. The Dataset must implement a __len__ function that returns the number of elements and a __getitem__ function that returns elements by specifying an index. TensorDataset wraps the dataset in a Tensor.

from torch.utils.data import TensorDataset

Create TensorDataset by specifying x_train and y_train when creating it.

train_ds = TensorDataset(x_train, y_train)

Previously, the training data (x_train) and the teacher data (y_train) were iteratively processed separately.

    xb = x_train[start_i:end_i]
    yb = y_train[start_i:end_i]

You can use TensorDataset to process all at once.

    xb,yb = train_ds[i*bs : i*bs+bs]
model, opt = get_model()

for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        xb, yb = train_ds[i * bs: i * bs + bs]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

out


tensor(0.0803, grad_fn=<NllLossBackward>)
  1. Refactor using DataLoader

DataLoader can be used to simplify looping with Datasets. Create a DataLoader based on the Dataset.

from torch.utils.data import DataLoader

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs)

In the first code, we specified the start position for each batch size and sliced it.

for i in range((n-1)//bs + 1):
    xb,yb = train_ds[i*bs : i*bs+bs]
    pred = model(xb)

DataLoader makes the loop simpler because (xb, yb) are automatically loaded sequentially from DataLoader.

for xb,yb in train_dl:
    pred = model(xb)
model, opt = get_model()

for epoch in range(epochs):
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print(loss_func(model(xb), yb))

out


tensor(0.0802, grad_fn=<NllLossBackward>)

So far, we have used nn.Module, nn.Parameter, Dataset, and DataLoader. I was able to write the code concisely and easily. Next, let's add the basic functions needed to create an effective model.

  1. Add validation

Up to this point, learning has proceeded only with training data, but in actual learning, validation data is used to check whether overfitting has occurred and whether learning has progressed. Set up the data set for validation below.

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)

valid_ds = TensorDataset(x_valid, y_valid)
valid_dl = DataLoader(valid_ds, batch_size=bs * 2)

At the end of each epoch, the validation data is used to calculate the loss. Put model.train () into training mode before training and model.eval () into evaluation mode before validation. This is to enable nn.Dropout etc. only during training.

model, opt = get_model()

for epoch in range(epochs):
    model.train()
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

    model.eval()
    with torch.no_grad():
        valid_loss = sum(loss_func(model(xb), yb) for xb, yb in valid_dl)

    print(epoch, valid_loss / len(valid_dl))

out


0 tensor(0.3679)
1 tensor(0.2997)
  1. Create fit() and get_data()

Next, create a function loss_batch that can do both training and validation. Passing the optimizer to loss_batch will calculate the backpropagation and update the parameters. Backpropagation is not calculated by not passing the optimizer during verification.

def loss_batch(model, loss_func, xb, yb, opt=None):
    loss = loss_func(model(xb), yb)

    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()

    return loss.item(), len(xb)

Define the fit function. The fit function iterates training and validation on each epoch and displays the loss.

import numpy as np

def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
    for epoch in range(epochs):
        model.train()
        for xb, yb in train_dl:
            loss_batch(model, loss_func, xb, yb, opt)

        model.eval()
        with torch.no_grad():
            losses, nums = zip(
                *[loss_batch(model, loss_func, xb, yb) for xb, yb in valid_dl]
            )
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)

        print(epoch, val_loss)

get_data returns a DataLoader for training and validation data.

def get_data(train_ds, valid_ds, bs):
    return (
        DataLoader(train_ds, batch_size=bs, shuffle=True),
        DataLoader(valid_ds, batch_size=bs * 2),
    )

Now you can write the process of getting the DataLoader and performing the learning in three lines of code.

train_dl, valid_dl = get_data(train_ds, valid_ds, bs)
model, opt = get_model()
fit(epochs, model, loss_func, opt, train_dl, valid_dl)

out


0 0.45953697173595426
1 0.3061695278286934

You can build different models by refactoring the three lines of code. Let's see if we can build a convolutional neural network (CNN)!

  1. Switch to CNN

From here, we will build a neural network with three convolution layers. The functions created so far have no model restrictions, so you can switch to CNN without making any changes.

Use the Conv2d class provided by Pytorch as a convolution layer. Define the CNN with three convolution layers. The activation function for each convolution layer is ReLU. Finally, add an average pooling layer. (The view is a PyTorch version of a variant of numpy)

class Mnist_CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1)
        self.conv2 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1)

    def forward(self, xb):
        xb = xb.view(-1, 1, 28, 28)
        xb = F.relu(self.conv1(xb))
        xb = F.relu(self.conv2(xb))
        xb = F.relu(self.conv3(xb))
        xb = F.avg_pool2d(xb, 4)
        return xb.view(-1, xb.size(1))

lr = 0.1

Momentum is a variation of stochastic gradient descent that also takes into account the last update, which generally leads to faster training.

model = Mnist_CNN()
opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

fit(epochs, model, loss_func, opt, train_dl, valid_dl)

out


0 0.7808464194297791
1 0.6988550303936004
  1. nn.Sequential

torch.nn has another handy class, Sequential, that you can use to simplify your code. The Sequential object executes each module contained therein in sequence. The feature is that you can easily describe the network.

Custom layers may be required to take advantage of Sequential. PyTorch doesn't have a layer to transform the dimensions of the network (layer), so you'll have to create your own view layer. The following Lambda defines the input / output layer handled by Sequential.

class Lambda(nn.Module):
    def __init__(self, func):
        super().__init__()
        self.func = func

    def forward(self, x):
        return self.func(x)


def preprocess(x):
    return x.view(-1, 1, 28, 28)

Sequential makes it easy to describe your network as follows:

model = nn.Sequential(
    Lambda(preprocess),
    nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.AvgPool2d(4),
    Lambda(lambda x: x.view(x.size(0), -1)),
)

opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

fit(epochs, model, loss_func, opt, train_dl, valid_dl)

out


0 0.4288556560516357
1 0.2115058801174164
  1. Wrapping DataLoader

The CNN I created is fairly concise, but it only works with MNIST data (handwritten digit images) due to the following restrictions:

--The input must be 28 * 28 data. --The final grid size is assumed to be 4 * 4 (because it uses 2D average pooling with kernel size 4).

Remove these two assumptions so that the model works with any 2D single-channel image (monochromatic image). First, delete the first Lambda layer and move the data preprocessing to the DataLoader.

def preprocess(x, y):
    return x.view(-1, 1, 28, 28), y


class WrappedDataLoader:
    def __init__(self, dl, func):
        self.dl = dl
        self.func = func

    def __len__(self):
        return len(self.dl)

    def __iter__(self):
        batches = iter(self.dl)
        for b in batches:
            yield (self.func(*b))

train_dl, valid_dl = get_data(train_ds, valid_ds, bs)
train_dl = WrappedDataLoader(train_dl, preprocess)
valid_dl = WrappedDataLoader(valid_dl, preprocess)

Then replace nn.AvgPool2d with nn.AdaptiveAvgPool2d. This allows you to define the size of the output tensor you want, not the input tensor. As a result, the average pooling layer works with inputs of any size.

model = nn.Sequential(
    nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.AdaptiveAvgPool2d(1),
    Lambda(lambda x: x.view(x.size(0), -1)),
)

opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

Let's try it.

fit(epochs, model, loss_func, opt, train_dl, valid_dl)

out


0 0.3351769802570343
1 0.2583931807518005
  1. Using your GPU

If you have a CUDA-enabled GPU available (most cloud providers cost about $ 0.50 per hour), you can speed up your learning. First, make sure your GPU is running on Pytorch.

print(torch.cuda.is_available())

out


True

Next, create a device object. The device object is set to "cuda" if the GPU is available and "cpu" if it is not available.

dev = torch.device(
    "cuda") if torch.cuda.is_available() else torch.device("cpu")

Add preprocessing to move the batch to the GPU.

def preprocess(x, y):
    return x.view(-1, 1, 28, 28).to(dev), y.to(dev)

train_dl, valid_dl = get_data(train_ds, valid_ds, bs)
train_dl = WrappedDataLoader(train_dl, preprocess)
valid_dl = WrappedDataLoader(valid_dl, preprocess)

Finally, move the model to the GPU.

model.to(dev)
opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

You can see that the processing speed has increased.

fit(epochs, model, loss_func, opt, train_dl, valid_dl)

out


0 0.1938392831325531
1 0.18594802458286286

When I checked with Google Colaboratory, the above process, which took about 15 seconds for the CPU, was completed in about 5 seconds.

15. Closing thoughts (Summary)

In this tutorial, you have created model-independent data processing and training processing. There are many things we want to add, such as data augmentation, hyperparameter tuning, monitoring training, and transfer learning. These features are available in the fastai library. The fastai library was developed using the same design approach shown in this tutorial and will be a good step for anyone further learning machine learning.

For this tutorial, we used torch.nn, torch.optim, Dataset, and DataLoader. Let's summarize what we have seen so far.

16. Finally

That's "What is torch.nn really?" It was similar to the last time, but I was able to deepen my understanding of Pytorch and neural networks. Next time, I would like to proceed with "Visualizing Models, Data, and Training with TensorBoard".

History

2020/10/10 First edition released

Recommended Posts

[PyTorch Tutorial ⑥] What is torch.nn really?
What is namespace
What is copy.copy ()
What is Django? .. ..
What is dotenv?
What is POSIX?
What is Linux
What is SALOME?
What is python
What is hyperopt?
What is Linux
What is pyvenv
What is __call__
What is Linux
What is Python
What is a distribution?
What is Piotroski's F-Score?
What is Raspberry Pi?
[Python] What is Pipeline ...
What is Calmar Ratio?
What is hyperparameter tuning?
What is a hacker?
PyTorch DataLoader is slow
What is Linux for?
What is ensemble learning?
[PyTorch Tutorial ③] NEURAL NETWORKS
What is Python's __init__.py?
What is an iterator?
What is UNIT-V Linux?
[Python] What is virtualenv
What is machine learning?
What is Minisum or Minimax?
What is Linux? [Command list]
What is Logistic Regression Analysis?
What is the Linux kernel?
[PyTorch Tutorial ④] TRAINING A CLASSIFIER
[PyTorch] Tutorial (Japanese version) ② ~ AUTOGRAD ~
What is a decision tree?
What is a Context Switch?
What is Google Cloud Dataflow?
[DL] What is weight decay?
What is a super user?
Competitive programming is what (bonus)
[Python] * args ** What is kwrgs?
What is a system call
[PyTorch] Tutorial (Japanese version) ① ~ Tensor ~
[Definition] What is a framework?
[PyTorch Tutorial ②] Autograd: Automatic differentiation
Pytorch Neural Network (CNN) Tutorial 1.3.1.
What is the interface for ...
What is Project Euler 3 Acceleration?
What is a callback function?
What is the Callback function?
What is a python map?
What is your "Tanimoto coefficient"?
Python Basic Course (1 What is Python)