A collection of tips for speeding up learning and reasoning with PyTorch

In this article, I will introduce a collection of tips for accelerating deep learning with PyTorch announced by NVIDIA.

About this article

This article was announced by Arun Mallya of NVIDIA. 「PyTorch Performance Tuning Guide - Szymon Migacz, NVIDIA」 I will explain by adding explanations and programs to.

The point of this article is exactly what Andrew Karpathy mutters on Twitter.

good quick tutorial on optimizing your PyTorch code ⏲️: https://t.co/7CIDWfrI0J
quick summary: pic.twitter.com/6J1SJcWJsl
— Andrej Karpathy (@karpathy) August 30, 2020

I will explain this in an easy-to-understand manner.

※Andrej Karpathy Obtained a PhD from Dr. Fei-Fei Li, who prepared ImageNet Currently the director of Tesla's AI division Human performance with ImageNet has an error rate of 5%, but in order to get that result, the person who challenged ImageNet, which was done by human representative

Contents of this article

Precautions for contents
About DataLoader (num_workers, pin_memory)
About torch.backends.cudnn.benchmark = True
Increase the mini-batch size (AMP, LARS and LAMB)
Multi-GPU settings
Tensor conversion function to JIT
Other tips 　7. non_blocking=True

0. Precautions for contents

The content described in this article depends on the GPU environment you are using.

Try to see which one works for your environment.

All the programs in this article https://github.com/YutaroOgawa/Qiita/tree/master/pytorch_performance It is open to the public at.

It is in Jupyter Notebook format.

In this article, we will easily check [Performance change] in the time of "MNIST training 1 epoch".

1. About DataLoader

PyTorch's DataLoader has two things, the default settings aren't very good. https://pytorch.org/docs/stable/data.html

1.1 num_workers

First, the argument defaults to num_workers = 0. As a result, the mini-batch retrieval is a single process.

By setting num_workers = 2 etc., it becomes multi-process data loading and processing speeds up.

You can check the number of CPU cores below.

#Check the number of CPU cores
import os
os.cpu_count()  #Number of cores

As for the number of cores, 2 is enough for 1 GPU.

Create a DataLoader as follows.

#For Data Loader with default settings
train_loader_default = torch.utils.data.DataLoader(dataset1,batch_size=mini_batch_size)
test_loader_default = torch.utils.data.DataLoader(dataset2,batch_size=mini_batch_size)

#Data loader: 2
train_loader_nworker = torch.utils.data.DataLoader(
    dataset1, batch_size=mini_batch_size, num_workers=2)
test_loader_nworker = torch.utils.data.DataLoader(
    dataset2, batch_size=mini_batch_size, num_workers=2)


#Data loader: full
train_loader_nworker = torch.utils.data.DataLoader(
    dataset1, batch_size=mini_batch_size, num_workers=os.cpu_count())
test_loader_nworker = torch.utils.data.DataLoader(
    dataset2, batch_size=mini_batch_size, num_workers=os.cpu_count())

[Performance change] Check the performance change in MNIST training 1epoch.

#Check GPU
!nvidia-smi

You can check the GPU of the usage environment with.

This time, ● Case 1: p3.2xlarge (NVIDIA® VOLTA V100 Tensor Core GPU) ● Case 2: Google Colaboratory (Tesla Turing T4 Tensor Core GPU)

It will be. Please note that Google Colaboratory has a different GPU type each time.

[For default] ● Case 1: p3.2xlarge: 14.73 seconds ● Case 2: Google Colaboratory: 10.01 seconds

[In the case of num_workers = os.cpu_count ()] ● Case 1: p3.2xlarge: 3.47 seconds ● Case 2: Google Colaboratory: 9.43 seconds

Both are faster, but Case 1 is very fast, up to about 1/3.

Note that p3.2xlarge has 8 CPU cores and Goole Colaboratory has 2 CPU cores.

However, the number of cores is 2 and num_workers = 2 gives a sufficient impression.

Even in the original announcement, it seems that there is not much difference between 2 and more.

1.2 pin_memory

PyTorch's DataLoader defaults to the argument pin_memory = False.

** automatic memory pinning ** can be used by setting pin_memory = True.

The memory area of the CPU will not be paged, which is expected to speed up.

(reference) https://pytorch.org/docs/stable/data.html#memory-pinning https://zukaaax.com/archives/301 https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/

(Explanation of memory paging) https://wa3.i-3-i.info/word13352.html

The implementation is as follows.

#For Data Loader with default settings
train_loader_default = torch.utils.data.DataLoader(dataset1,batch_size=mini_batch_size)
test_loader_default = torch.utils.data.DataLoader(dataset2,batch_size=mini_batch_size)

#Data loader pin memory
train_loader_pin_memory = torch.utils.data.DataLoader(
    dataset1, batch_size=mini_batch_size, pin_memory=True)
test_loader_pin_memory = torch.utils.data.DataLoader(
    dataset2, batch_size=mini_batch_size, pin_memory=True)

As before, check the performance change in MNIST training 1 epoch.

[For default] ● Case 1: p3.2xlarge: 14.73 seconds ● Case 2: Google Colaboratory: 10.01 seconds

[When pin_memory = True] ● Case 1: p3.2xlarge: 13.65 seconds ● Case 2: Google Colaboratory: 9.82 seconds

[In the case of num_workers = os.cpu_count ()] ● Case 1: p3.2xlarge: 3.47 seconds ● Case 2: Google Colaboratory: 9.43 seconds

[When num_workers = os.cpu_count () & pin_memory = True] ● Case 1: p3.2xlarge: 3.50 seconds ● Case 2: Google Colaboratory: 9.35 seconds

Compared to the default settings, you can see that it is faster.

If num_workers is set, the effect of pin_memory cannot be seen, probably because the scale is too small for this MNIST.

1.3 Conclusion of how to make DataLoader

[1] When creating a DataLoader with PyTorch, change the arguments num_workers and pin_memory and implement as follows.

#Default configuration
train_loader_default = torch.utils.data.DataLoader(dataset1,batch_size=mini_batch_size)
test_loader_default = torch.utils.data.DataLoader(dataset2,batch_size=mini_batch_size)

#Data loader recommended
train_loader_pin_memory = torch.utils.data.DataLoader(
    dataset1, batch_size=mini_batch_size, num_workers=os.cpu_count(), pin_memory=True)
test_loader_pin_memory = torch.utils.data.DataLoader(
    dataset2, batch_size=mini_batch_size, num_workers=os.cpu_count(), pin_memory=True)

#Or data loader num_workers=2
train_loader_pin_memory = torch.utils.data.DataLoader(
    dataset1, batch_size=mini_batch_size, num_workers=2, pin_memory=True)
test_loader_pin_memory = torch.utils.data.DataLoader(
    dataset2, batch_size=mini_batch_size, num_workers=2, pin_memory=True)

2. About torch.backends.cudnn.benchmark = True

2.1 Explanation

Make sure to run torch.backends.cudnn.benchmark = True when conducting the training.

This optimizes and speeds up network calculations on the GPU side when the network shape is fixed.

Set to True if the data input size does not change at the beginning or in the middle like a normal CNN.

However, please note that the reproducibility of the calculation will be lost.

(About PyTorch calculation reproducibility) https://pytorch.org/docs/stable/notes/randomness.html

The implementation is as follows, for example.

def MNIST_train_cudnn_benchmark_True(optimizer, model, device, train_loader, test_loader):
    #Training by default
    epochs = 1

    #add to
    torch.backends.cudnn.benchmark = True

    #processing
    for epoch in range(1, epochs+1):
        train(model, device, train_loader, optimizer, epoch)
        test(model, device, test_loader)

Here, the function train () has the following form.

def train(model, device, train_loader, optimizer, epoch):
    model.train()  #In training mode
    for batch_idx, (data, target) in enumerate(train_loader):
        #Data retrieval
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()

        #propagation
        output = model(data)

        #Loss calculation and backpropagation
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

Compare speeds. Leave the DataLoader at the default setting.

[For default] ● Case 1: p3.2xlarge: 14.73 seconds ● Case 2: Google Colaboratory: 10.01 seconds

[When torch.backends.cudnn.benchmark = True] ● Case 1: p3.2xlarge: 14.47 seconds ● Case 2: Google Colaboratory: 9.66 seconds

This time, I'm only solving MNIST, so the network is small, so this effect is weak, but it's a little faster.

2.2 Conclusion

** Enter torch.backends.cudnn.benchmark = True when running the program **

3. Increase the mini batch size

The larger the mini-batch size, the more stable the learning. Therefore, increase the mini-batch size.

Due to PyTorch's ** AMP (Automatic Mixed Precision) function **, there are cases where a mini-batch size larger than the expected calculation is actually possible.

3.1 About AMP (Automatic Mixed Precision) function

AMP (Automatic Mixed Precision) means mixing accuracy.

Normally, it is calculated in FP32 (32-bit floating point), but half FP16 (16-bit floating point) is a function that saves memory usage and improves calculation speed without compromising accuracy.

In addition, GPUs with a Tensor core will be more than doubled, 8 to 16 times faster. (Up to 12 times for training, up to 6 times for inference)

(reference) https://www.nvidia.com/ja-jp/data-center/tensor-cores/

In addition, Volta of V100 is equipped with the first generation of TENSOR core, The T series is equipped with the TURING TENSOR core 2nd generation. The 2nd generation of TURING TENSOR core is said to be about twice as fast as the 1st generation.

3.2 Use AMP (Automatic Mixed Precision)

Click here for explanations on how to use it.

(reference) https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/ https://pytorch.org/docs/stable/notes/amp_examples.html

Implement according to the examples above.

Rewrite the previous function test ().

def train_PyTorchAMP(model, device, train_loader, optimizer, epoch):
    model.train()  #In training mode

    scaler = torch.cuda.amp.GradScaler()

    for batch_idx, (data, target) in enumerate(train_loader):
        #Data retrieval
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()

        #propagation
        # Runs the forward pass with autocasting.
        with torch.cuda.amp.autocast():
            output = model(data)
            loss = F.nll_loss(output, target)

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

Create a scaler with scaler = torch.cuda.amp.GradScaler () and wrap the forward calculation, loss calculation, backpropagation, and parameter updates with scaler.

Set DataLoader as the default setting and use AMP to compare speeds.

[For default] ● Case 1: p3.2xlarge: 14.73 seconds ● Case 2: Google Colaboratory: 10.01 seconds

[For AMP] ● Case 1: p3.2xlarge: 14.21 seconds ● Case 2: Google Colaboratory: 11.97 seconds

In this MNIST, the amount of calculation at one time is small, so I didn't feel much effect.

By using this AMP, it is possible to increase the mini-batch size more than expected, but Be aware of the following points to note when increasing the mini-batch size.

[1] Adjustment of learning rate value [2] Weight decay adjustment: The magnitude of the optimizer's penalties [3] Incorporate warmup into learning: In the early stages of learning, the learning rate is gradually increased linearly from 0 to a certain level. [4] Incorporate learning rate decay into learning: Gradually reduce the learning rate at the end of learning

3.3 About LARS and LAMB

Also, for large mini-batch, consider using LARS, LAMB, NVIDIA's LAMB NVLAMB, etc. for the Optimizer.

For large mini-batch

** Over the same amount of time, the total number of epochs will be less than when the mini-batch size is small. If you simply increase the learning rate to make up for it, this time the learning rate will be too high and training will be difficult to stabilize **

The problem occurs.

Therefore, LARS (Layerwise Adaptive Rate Scaling) is a method of multiplying the learning rate by a coefficient called "trust ratio" according to the gradient.

In addition, LAMB (Layer-wise Adaptive Moments optimizer for Batch training) is an optimization method that considers the rate of change of each weight parameter for each epoch in LARS.

By using LAMB, learning BERT, which normally takes 81 hours, can be accelerated to 76 minutes, which is about 100 times faster.

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes https://arxiv.org/abs/1904.00962

(From NVIDIA's A Guide to Optimizer Implementation for BERT at Scale)

(reference) https://medium.com/nvidia-ai/a-guide-to-optimizer-implementation-for-bert-at-scale-8338cc7f45fd https://developer.nvidia.com/blog/pretraining-bert-with-layer-wise-adaptive-learning-rates/ https://postd.cc/optimizing-gradient-descent/ https://towardsdatascience.com/an-intuitive-understanding-of-the-lamb-optimizer-46f8c0ae4866

3.4 How to use LAMB etc. with NVIDIA

First, install apex by referring to the following NVIDIA APEX (A PyTorch Extension) page.

https://github.com/NVIDIA/apex https://nvidia.github.io/apex/

$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

The implementation is as follows. First, rewrite train ().

from apex import amp


def trainAMP(model, device, train_loader, optimizer, epoch):
    model.train()  #In training mode

    for batch_idx, (data, target) in enumerate(train_loader):
        #Data retrieval
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()

        #propagation
        output = model(data)

        #Loss calculation and backpropagation
        loss = F.nll_loss(output, target)

        # AMP Train your model
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()

        optimizer.step()

Then write a training function using trainAMP ().

def MNIST_trainAMP(optimizer, model, device, train_loader, test_loader): 
    epochs = 1

    start = time.time()
    torch.backends.cudnn.benchmark = True

    #processing
    for epoch in range(1, epochs+1):
        trainAMP(model, device, train_loader, optimizer, epoch)
        test(model, device, test_loader)

    #Time taken
    print("=======Time taken========")
    print(time.time() - start)

Set the optimizer to apex.optimizers.FusedLAMB. NVIDIA's LAMBA is called NVLAMB.

import apex


#Set model, learning rate and optimizer
model = Net().to(device)
lr_rate = 0.1
optimizer = apex.optimizers.FusedLAMB(model.parameters(), lr=lr_rate)

# Initialization
opt_level = 'O1'
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)

Initialize the model and optimizer with AMP.

Finally, we will carry out training.

MNIST_trainAMP(optimizer, model, device,
               train_loader_pin_memory, test_loader_pin_memory)

That's how NVIDIA's LAMB optimizer is used for large mini-batch.

4. Multi-GPU settings

When training with Multi-GPU Not the DATAPARALLEL torch.nn.DataParallel DISTRIBUTEDDATAPARALLEL torch.nn.parallel.DistributedDataParallel Use the.

This is like the lecture slide below This is because DATA PARALLEL uses only one core of the CPU. If it is DISTRIBUTEDDATAPARALLEL, 1 CPU core is allocated to 1 GPU.

Also, NVIDIA's APEX ʻapex.parallel.DistributedDataParallel can be used in the same way as torch.nn.parallel.DistributedDataParallel`, but with its advantages.

That is, NVIDIA's ʻapex.parallel.DistributedDataParallel` is now ** Synchronized Batch Normalization **.

In the case of Multi-GPU, PyTorch's batch normalization layer performs batch normalization within a mini-batch assigned to each GPU, calculates the average and standard deviation for each GPU, and averages them to average the batch normalization. Let's learn the standard deviation.

This is called ** Asynchronized Batch Normalization ** because it does batch normalization for each GPU.

Batch normalization and calculation results for all data distributed on the Multi-GPU will change.

In the case of PyToch, there is a strategy to use torch.nn.SyncBatchNorm, but it is quite troublesome to implement.

Using NVIDIA's APEX ʻapex.parallel.DistributedDataParallel`

sync_bn_model = apex.parallel.convert_syncbn_model(model) Simply convert the model with to get * Synchronized Batch Normalization **.

(reference) https://nvidia.github.io/apex/parallel.html https://github.com/NVIDIA/apex/tree/master/apex/parallel https://github.com/NVIDIA/apex/tree/master/examples/simple/distributed

5. Tensor conversion function to JIT

Attach the [email protected] to the function of the individual operation to the tensor and make it PyToch JIT (C ++ executable format). Speed up.

JIT (Just-In-Time Compiler) is a compiler that compiles code when software is executed to improve execution speed.

Tensorflow and Keras are define and run, compile and then run (the code was awkward to write)

PyTorch is define by run, which builds calculations while flowing data. However, since it is better to compile the fixed calculation function first, use JIT to make it a C ++ executable format (execute from Python for operation).

For example, when you want to define the activation function gelu, the normal definition and the definition in JIT are as follows.

def gelu(x):
    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))

@torch.jit.script
def fused_gelu(x):
    return x * 0.5 * (1.0 + torch.erf(x / 1.41421))

To make it a PyTorch JIT, attach the decorator @ torch.jit.script to the function.

Comparing the execution speed of this,

import time

x = torch.randn(2000, 3000)

start = time.time()

for i in range(200):
    gelu(x)

#Time taken
print("=======Time taken========")
print(time.time() - start)

When

import time

x = torch.randn(2000, 3000)

start = time.time()

for i in range(200):
    fused_gelu(x)

#Time taken
print("=======Time taken========")
print(time.time() - start)

Then

● Case 1: p3.2xlarge (NVIDIA® VOLTA V100 Tensor Core GPU)

So, 9.8 seconds → 6.6 seconds

● Case 2: Google Colaboratory (Tesla Turing T4 Tensor Core GPU)

So, 13.94 seconds → 13.91 seconds

was.

There is almost no change in Google Colaboratory, but in AWS p3.2xlarge, the time is shortened to about 60%.

6. Other tips

6.1 Setting up a model that does not require backpropagation

When using a model that does not require backpropagation for a part of the whole, such as for GAN calculation,

model.zero_grad()

not,

for param in model.parameters():
    param.grad = None

Set the gradient calculation to None with. This is because model.zero_grad () actually consumes memory space.

6.2 Layer before batch normalization set bias parameter to False

If you standardize with batch normalization and average it to 0, then batch normalization also learns a constant term to counteract any bias parameters in the previous layer.

Since it is a waste of calculation time and amount of calculation, the bias parameter should be set to False for the layer before batch normalization, and the bias term should not be used.

non_blocking=True

7.1 Perform asynchronous GPU copies

About DataLoader (num_workers, pin_memory) So, I explained how to use pin_memory.

PyTorch's DataLoader defaults to the argument pin_memory = False, but you can use ** automatic memory pinning ** by setting pin_memory = True.

The memory area of the CPU will not be paged, which is expected to speed up.

The implementation at this time was as follows.

#Data loader recommended
train_loader_pin_memory = torch.utils.data.DataLoader(
    dataset1, batch_size=mini_batch_size, num_workers=os.cpu_count(), pin_memory=True)
test_loader_pin_memory = torch.utils.data.DataLoader(
    dataset2, batch_size=mini_batch_size, num_workers=os.cpu_count(), pin_memory=True)

Now, for even faster speeds, enable asynchronous GPU copies.

Because, as it is ** The CPU cannot work while transferring data from the CPU's Pinned Memory to the GPU **.

So use the non_blocking = True setting.

Then, the CPU can operate even while transferring from Pinned Memory to GPU, and speedup is expected.

The implementation is simple, rewriting the part that sends data to cuda.

for batch_idx, (data, target) in enumerate(train_loader):
        #Data retrieval
        data, target = data.to(device), target.to(device)

# non_blocking=True
for batch_idx, (data, target) in enumerate(train_loader):
        #Data retrieval
        data, target = data.to(device, non_blocking=True), target.to(device, non_blocking=True)

And give the argument non_blocking = True in to ().

reference https://stackoverflow.com/questions/55563376/pytorch-how-does-pin-memory-works-in-dataloader https://pytorch.org/docs/stable/notes/cuda.html#use-pinned-memory-buffers https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234 https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/

7.2 Comparison of speed differences

Compare when non_blocking = True and when it is not (pin_memory = True only).

● Case 1: p3.2xlarge (NVIDIA® VOLTA V100 Tensor Core GPU)

Then, 13.126 seconds → 13.125 seconds

● Case 2: Google Colaboratory (Tesla P100-PCIE Tensor Core GPU) * GPU is a little because it was redone in the postscript It's changed. It's the same Tesla.

Then, 8.370 seconds → 8.298 seconds

It became a little faster.

The above was executed with num_workers = 0 of DataLoader, so if it is executed with num_workers = 2,

● Case 1: p3.2xlarge So, 6.843 seconds → 6.776 seconds

● Case 2: Google Colaboratory Then 8.059 seconds → 7.935 seconds

This is also a little faster.

Since the scale is small with MNIST, it is difficult to see the benefits, but it may be significantly effective for complicated processing and large data with a heavy CPU load.

Summary

So far, I have introduced the tips for speeding up learning and inference with PyTorch.

Some techniques, in the case of Google Colaboratory, I got the feeling, "Is something happening behind the scenes, is it not working, or is it working automatically?"

On the other hand, I think that this article can be used in many ways when you normally set up a GPU instance in the cloud and do deep learning with PyTorch.

I hope you will take advantage of it ♪

Remarks

** [Author] ** Dentsu International Information Services (ISID) AI Transformation Center Development Gr Yutaro Ogawa (Main book "Learn while making! Deep learning by PyTorch", etc. ["Self-introduction details"](https: / /github.com/YutaroOgawa/about_me))

【Twitter】 Focusing on IT / AI-related and business / management, I send out articles that I find interesting and impressions of new books that I recently read. If you want to collect information on these fields, please follow us ♪ (There is a lot of overseas information)

Yutaro Ogawa @ISID_AI_team

** [Other] ** The "AI Transformation Center Development Team" that I lead is looking for members. If you are interested, please visit this page to apply.

** [Sokumen-kun] ** If you want to apply suddenly, we will have a casual interview with "Sokumen-kun". Please use this as well ♪ https://sokumenkun.com/2020/08/17/yutaro-ogawa/

[Disclaimer] The content of this article itself is the opinion / transmission of the author, not the official opinion of the company to which the author belongs.