In this article, I will introduce a collection of tips for accelerating deep learning with PyTorch announced by NVIDIA.
This article was announced by Arun Mallya of NVIDIA. 「PyTorch Performance Tuning Guide - Szymon Migacz, NVIDIA」 I will explain by adding explanations and programs to.
The point of this article is exactly what Andrew Karpathy mutters on Twitter.
good quick tutorial on optimizing your PyTorch code ⏲️: https://t.co/7CIDWfrI0J
— Andrej Karpathy (@karpathy) August 30, 2020
quick summary: pic.twitter.com/6J1SJcWJsl
I will explain this in an easy-to-understand manner.
※Andrej Karpathy Obtained a PhD from Dr. Fei-Fei Li, who prepared ImageNet Currently the director of Tesla's AI division Human performance with ImageNet has an error rate of 5%, but in order to get that result, the person who challenged ImageNet, which was done by human representative
The content described in this article depends on the GPU environment you are using.
Try to see which one works for your environment.
All the programs in this article https://github.com/YutaroOgawa/Qiita/tree/master/pytorch_performance It is open to the public at.
It is in Jupyter Notebook format.
In this article, we will easily check [Performance change] in the time of "MNIST training 1 epoch".
PyTorch's DataLoader has two things, the default settings aren't very good. https://pytorch.org/docs/stable/data.html
1.1 num_workers
First, the argument defaults to num_workers = 0
.
As a result, the mini-batch retrieval is a single process.
By setting num_workers = 2
etc., it becomes multi-process data loading and processing speeds up.
You can check the number of CPU cores below.
#Check the number of CPU cores
import os
os.cpu_count() #Number of cores
As for the number of cores, 2 is enough for 1 GPU.
Create a DataLoader as follows.
#For Data Loader with default settings
train_loader_default = torch.utils.data.DataLoader(dataset1,batch_size=mini_batch_size)
test_loader_default = torch.utils.data.DataLoader(dataset2,batch_size=mini_batch_size)
#Data loader: 2
train_loader_nworker = torch.utils.data.DataLoader(
dataset1, batch_size=mini_batch_size, num_workers=2)
test_loader_nworker = torch.utils.data.DataLoader(
dataset2, batch_size=mini_batch_size, num_workers=2)
#Data loader: full
train_loader_nworker = torch.utils.data.DataLoader(
dataset1, batch_size=mini_batch_size, num_workers=os.cpu_count())
test_loader_nworker = torch.utils.data.DataLoader(
dataset2, batch_size=mini_batch_size, num_workers=os.cpu_count())
[Performance change] Check the performance change in MNIST training 1epoch.
#Check GPU
!nvidia-smi
You can check the GPU of the usage environment with.
This time, ● Case 1: p3.2xlarge (NVIDIA® VOLTA V100 Tensor Core GPU) ● Case 2: Google Colaboratory (Tesla Turing T4 Tensor Core GPU)
It will be. Please note that Google Colaboratory has a different GPU type each time.
[For default] ● Case 1: p3.2xlarge: 14.73 seconds ● Case 2: Google Colaboratory: 10.01 seconds
[In the case of num_workers = os.cpu_count ()] ● Case 1: p3.2xlarge: 3.47 seconds ● Case 2: Google Colaboratory: 9.43 seconds
Both are faster, but Case 1 is very fast, up to about 1/3.
Note that p3.2xlarge has 8 CPU cores and Goole Colaboratory has 2 CPU cores.
However, the number of cores is 2 and num_workers = 2 gives a sufficient impression.
Even in the original announcement, it seems that there is not much difference between 2 and more.
1.2 pin_memory
PyTorch's DataLoader defaults to the argument pin_memory = False
.
** automatic memory pinning ** can be used by setting pin_memory = True
.
The memory area of the CPU will not be paged, which is expected to speed up.
(reference) https://pytorch.org/docs/stable/data.html#memory-pinning https://zukaaax.com/archives/301 https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
(Explanation of memory paging) https://wa3.i-3-i.info/word13352.html
The implementation is as follows.
#For Data Loader with default settings
train_loader_default = torch.utils.data.DataLoader(dataset1,batch_size=mini_batch_size)
test_loader_default = torch.utils.data.DataLoader(dataset2,batch_size=mini_batch_size)
#Data loader pin memory
train_loader_pin_memory = torch.utils.data.DataLoader(
dataset1, batch_size=mini_batch_size, pin_memory=True)
test_loader_pin_memory = torch.utils.data.DataLoader(
dataset2, batch_size=mini_batch_size, pin_memory=True)
As before, check the performance change in MNIST training 1 epoch.
[For default] ● Case 1: p3.2xlarge: 14.73 seconds ● Case 2: Google Colaboratory: 10.01 seconds
[When pin_memory = True] ● Case 1: p3.2xlarge: 13.65 seconds ● Case 2: Google Colaboratory: 9.82 seconds
[In the case of num_workers = os.cpu_count ()] ● Case 1: p3.2xlarge: 3.47 seconds ● Case 2: Google Colaboratory: 9.43 seconds
[When num_workers = os.cpu_count () & pin_memory = True] ● Case 1: p3.2xlarge: 3.50 seconds ● Case 2: Google Colaboratory: 9.35 seconds
Compared to the default settings, you can see that it is faster.
If num_workers is set, the effect of pin_memory cannot be seen, probably because the scale is too small for this MNIST.
[1] When creating a DataLoader with PyTorch, change the arguments num_workers and pin_memory and implement as follows.
#Default configuration
train_loader_default = torch.utils.data.DataLoader(dataset1,batch_size=mini_batch_size)
test_loader_default = torch.utils.data.DataLoader(dataset2,batch_size=mini_batch_size)
#Data loader recommended
train_loader_pin_memory = torch.utils.data.DataLoader(
dataset1, batch_size=mini_batch_size, num_workers=os.cpu_count(), pin_memory=True)
test_loader_pin_memory = torch.utils.data.DataLoader(
dataset2, batch_size=mini_batch_size, num_workers=os.cpu_count(), pin_memory=True)
#Or data loader num_workers=2
train_loader_pin_memory = torch.utils.data.DataLoader(
dataset1, batch_size=mini_batch_size, num_workers=2, pin_memory=True)
test_loader_pin_memory = torch.utils.data.DataLoader(
dataset2, batch_size=mini_batch_size, num_workers=2, pin_memory=True)
Make sure to run torch.backends.cudnn.benchmark = True
when conducting the training.
This optimizes and speeds up network calculations on the GPU side when the network shape is fixed.
Set to True if the data input size does not change at the beginning or in the middle like a normal CNN.
However, please note that the reproducibility of the calculation will be lost.
(About PyTorch calculation reproducibility) https://pytorch.org/docs/stable/notes/randomness.html
The implementation is as follows, for example.
def MNIST_train_cudnn_benchmark_True(optimizer, model, device, train_loader, test_loader):
#Training by default
epochs = 1
#add to
torch.backends.cudnn.benchmark = True
#processing
for epoch in range(1, epochs+1):
train(model, device, train_loader, optimizer, epoch)
test(model, device, test_loader)
Here, the function train () has the following form.
def train(model, device, train_loader, optimizer, epoch):
model.train() #In training mode
for batch_idx, (data, target) in enumerate(train_loader):
#Data retrieval
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
#propagation
output = model(data)
#Loss calculation and backpropagation
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
Compare speeds. Leave the DataLoader at the default setting.
[For default] ● Case 1: p3.2xlarge: 14.73 seconds ● Case 2: Google Colaboratory: 10.01 seconds
[When torch.backends.cudnn.benchmark = True] ● Case 1: p3.2xlarge: 14.47 seconds ● Case 2: Google Colaboratory: 9.66 seconds
This time, I'm only solving MNIST, so the network is small, so this effect is weak, but it's a little faster.
** Enter torch.backends.cudnn.benchmark = True
when running the program **
The larger the mini-batch size, the more stable the learning. Therefore, increase the mini-batch size.
Due to PyTorch's ** AMP (Automatic Mixed Precision) function **, there are cases where a mini-batch size larger than the expected calculation is actually possible.
AMP (Automatic Mixed Precision) means mixing accuracy.
Normally, it is calculated in FP32 (32-bit floating point), but half FP16 (16-bit floating point) is a function that saves memory usage and improves calculation speed without compromising accuracy.
In addition, GPUs with a Tensor core will be more than doubled, 8 to 16 times faster. (Up to 12 times for training, up to 6 times for inference)
(reference) https://www.nvidia.com/ja-jp/data-center/tensor-cores/
In addition, Volta of V100 is equipped with the first generation of TENSOR core, The T series is equipped with the TURING TENSOR core 2nd generation. The 2nd generation of TURING TENSOR core is said to be about twice as fast as the 1st generation.
Click here for explanations on how to use it.
(reference) https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/ https://pytorch.org/docs/stable/notes/amp_examples.html
Implement according to the examples above.
Rewrite the previous function test ().
def train_PyTorchAMP(model, device, train_loader, optimizer, epoch):
model.train() #In training mode
scaler = torch.cuda.amp.GradScaler()
for batch_idx, (data, target) in enumerate(train_loader):
#Data retrieval
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
#propagation
# Runs the forward pass with autocasting.
with torch.cuda.amp.autocast():
output = model(data)
loss = F.nll_loss(output, target)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer's assigned params.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
Create a scaler with scaler = torch.cuda.amp.GradScaler ()
and wrap the forward calculation, loss calculation, backpropagation, and parameter updates with scaler.
Set DataLoader as the default setting and use AMP to compare speeds.
[For default] ● Case 1: p3.2xlarge: 14.73 seconds ● Case 2: Google Colaboratory: 10.01 seconds
[For AMP] ● Case 1: p3.2xlarge: 14.21 seconds ● Case 2: Google Colaboratory: 11.97 seconds
In this MNIST, the amount of calculation at one time is small, so I didn't feel much effect.
By using this AMP, it is possible to increase the mini-batch size more than expected, but Be aware of the following points to note when increasing the mini-batch size.
[1] Adjustment of learning rate value [2] Weight decay adjustment: The magnitude of the optimizer's penalties [3] Incorporate warmup into learning: In the early stages of learning, the learning rate is gradually increased linearly from 0 to a certain level. [4] Incorporate learning rate decay into learning: Gradually reduce the learning rate at the end of learning
Also, for large mini-batch, consider using LARS, LAMB, NVIDIA's LAMB NVLAMB, etc. for the Optimizer.
For large mini-batch
** Over the same amount of time, the total number of epochs will be less than when the mini-batch size is small. If you simply increase the learning rate to make up for it, this time the learning rate will be too high and training will be difficult to stabilize **
The problem occurs.
Therefore, LARS (Layerwise Adaptive Rate Scaling) is a method of multiplying the learning rate by a coefficient called "trust ratio" according to the gradient.
In addition, LAMB (Layer-wise Adaptive Moments optimizer for Batch training) is an optimization method that considers the rate of change of each weight parameter for each epoch in LARS.
By using LAMB, learning BERT, which normally takes 81 hours, can be accelerated to 76 minutes, which is about 100 times faster.
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes https://arxiv.org/abs/1904.00962
(From NVIDIA's A Guide to Optimizer Implementation for BERT at Scale)
(reference) https://medium.com/nvidia-ai/a-guide-to-optimizer-implementation-for-bert-at-scale-8338cc7f45fd https://developer.nvidia.com/blog/pretraining-bert-with-layer-wise-adaptive-learning-rates/ https://postd.cc/optimizing-gradient-descent/ https://towardsdatascience.com/an-intuitive-understanding-of-the-lamb-optimizer-46f8c0ae4866
First, install apex by referring to the following NVIDIA APEX (A PyTorch Extension) page.
https://github.com/NVIDIA/apex https://nvidia.github.io/apex/
$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
The implementation is as follows. First, rewrite train ().
from apex import amp
def trainAMP(model, device, train_loader, optimizer, epoch):
model.train() #In training mode
for batch_idx, (data, target) in enumerate(train_loader):
#Data retrieval
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
#propagation
output = model(data)
#Loss calculation and backpropagation
loss = F.nll_loss(output, target)
# AMP Train your model
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
Then write a training function using trainAMP ().
def MNIST_trainAMP(optimizer, model, device, train_loader, test_loader):
epochs = 1
start = time.time()
torch.backends.cudnn.benchmark = True
#processing
for epoch in range(1, epochs+1):
trainAMP(model, device, train_loader, optimizer, epoch)
test(model, device, test_loader)
#Time taken
print("=======Time taken========")
print(time.time() - start)
Set the optimizer to apex.optimizers.FusedLAMB. NVIDIA's LAMBA is called NVLAMB.
import apex
#Set model, learning rate and optimizer
model = Net().to(device)
lr_rate = 0.1
optimizer = apex.optimizers.FusedLAMB(model.parameters(), lr=lr_rate)
# Initialization
opt_level = 'O1'
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
Initialize the model and optimizer with AMP.
Finally, we will carry out training.
MNIST_trainAMP(optimizer, model, device,
train_loader_pin_memory, test_loader_pin_memory)
That's how NVIDIA's LAMB optimizer is used for large mini-batch.
When training with Multi-GPU
Not the DATAPARALLEL torch.nn.DataParallel
DISTRIBUTEDDATAPARALLEL torch.nn.parallel.DistributedDataParallel
Use the.
This is like the lecture slide below This is because DATA PARALLEL uses only one core of the CPU. If it is DISTRIBUTEDDATAPARALLEL, 1 CPU core is allocated to 1 GPU.
Also, NVIDIA's APEX ʻapex.parallel.DistributedDataParallel can be used in the same way as
torch.nn.parallel.DistributedDataParallel`, but with its advantages.
That is, NVIDIA's ʻapex.parallel.DistributedDataParallel` is now ** Synchronized Batch Normalization **.
In the case of Multi-GPU, PyTorch's batch normalization layer performs batch normalization within a mini-batch assigned to each GPU, calculates the average and standard deviation for each GPU, and averages them to average the batch normalization. Let's learn the standard deviation.
This is called ** Asynchronized Batch Normalization ** because it does batch normalization for each GPU.
Batch normalization and calculation results for all data distributed on the Multi-GPU will change.
In the case of PyToch, there is a strategy to use torch.nn.SyncBatchNorm
, but it is quite troublesome to implement.
Using NVIDIA's APEX ʻapex.parallel.DistributedDataParallel`
sync_bn_model = apex.parallel.convert_syncbn_model(model)
Simply convert the model with to get * Synchronized Batch Normalization **.
(reference) https://nvidia.github.io/apex/parallel.html https://github.com/NVIDIA/apex/tree/master/apex/parallel https://github.com/NVIDIA/apex/tree/master/examples/simple/distributed
Attach the [email protected] to the function of the individual operation to the tensor and make it PyToch JIT (C ++ executable format). Speed up.
JIT (Just-In-Time Compiler) is a compiler that compiles code when software is executed to improve execution speed.
Tensorflow and Keras are define and run, compile and then run (the code was awkward to write)
PyTorch is define by run, which builds calculations while flowing data. However, since it is better to compile the fixed calculation function first, use JIT to make it a C ++ executable format (execute from Python for operation).
For example, when you want to define the activation function gelu, the normal definition and the definition in JIT are as follows.
def gelu(x):
return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
@torch.jit.script
def fused_gelu(x):
return x * 0.5 * (1.0 + torch.erf(x / 1.41421))
To make it a PyTorch JIT, attach the decorator @ torch.jit.script
to the function.
Comparing the execution speed of this,
import time
x = torch.randn(2000, 3000)
start = time.time()
for i in range(200):
gelu(x)
#Time taken
print("=======Time taken========")
print(time.time() - start)
When
import time
x = torch.randn(2000, 3000)
start = time.time()
for i in range(200):
fused_gelu(x)
#Time taken
print("=======Time taken========")
print(time.time() - start)
Then
● Case 1: p3.2xlarge (NVIDIA® VOLTA V100 Tensor Core GPU)
So, 9.8 seconds → 6.6 seconds
● Case 2: Google Colaboratory (Tesla Turing T4 Tensor Core GPU)
So, 13.94 seconds → 13.91 seconds
was.
There is almost no change in Google Colaboratory, but in AWS p3.2xlarge, the time is shortened to about 60%.
When using a model that does not require backpropagation for a part of the whole, such as for GAN calculation,
model.zero_grad()
not,
for param in model.parameters():
param.grad = None
Set the gradient calculation to None with. This is because model.zero_grad () actually consumes memory space.
If you standardize with batch normalization and average it to 0, then batch normalization also learns a constant term to counteract any bias parameters in the previous layer.
Since it is a waste of calculation time and amount of calculation, the bias parameter should be set to False for the layer before batch normalization, and the bias term should not be used.
PyTorch's DataLoader defaults to the argument pin_memory = False
, but you can use ** automatic memory pinning ** by setting pin_memory = True
.
The memory area of the CPU will not be paged, which is expected to speed up.
The implementation at this time was as follows.
#Data loader recommended
train_loader_pin_memory = torch.utils.data.DataLoader(
dataset1, batch_size=mini_batch_size, num_workers=os.cpu_count(), pin_memory=True)
test_loader_pin_memory = torch.utils.data.DataLoader(
dataset2, batch_size=mini_batch_size, num_workers=os.cpu_count(), pin_memory=True)
Now, for even faster speeds, enable asynchronous GPU copies.
Because, as it is ** The CPU cannot work while transferring data from the CPU's Pinned Memory to the GPU **.
So use the non_blocking = True
setting.
Then, the CPU can operate even while transferring from Pinned Memory to GPU, and speedup is expected.
The implementation is simple, rewriting the part that sends data to cuda.
for batch_idx, (data, target) in enumerate(train_loader):
#Data retrieval
data, target = data.to(device), target.to(device)
To
# non_blocking=True
for batch_idx, (data, target) in enumerate(train_loader):
#Data retrieval
data, target = data.to(device, non_blocking=True), target.to(device, non_blocking=True)
And give the argument non_blocking = True
in to ().
reference https://stackoverflow.com/questions/55563376/pytorch-how-does-pin-memory-works-in-dataloader https://pytorch.org/docs/stable/notes/cuda.html#use-pinned-memory-buffers https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234 https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
Compare when non_blocking = True and when it is not (pin_memory = True only).
● Case 1: p3.2xlarge (NVIDIA® VOLTA V100 Tensor Core GPU)
Then, 13.126 seconds → 13.125 seconds
● Case 2: Google Colaboratory (Tesla P100-PCIE Tensor Core GPU) * GPU is a little because it was redone in the postscript It's changed. It's the same Tesla.
Then, 8.370 seconds → 8.298 seconds
It became a little faster.
The above was executed with num_workers = 0 of DataLoader, so if it is executed with num_workers = 2,
● Case 1: p3.2xlarge So, 6.843 seconds → 6.776 seconds
● Case 2: Google Colaboratory Then 8.059 seconds → 7.935 seconds
This is also a little faster.
Since the scale is small with MNIST, it is difficult to see the benefits, but it may be significantly effective for complicated processing and large data with a heavy CPU load.
So far, I have introduced the tips for speeding up learning and inference with PyTorch.
Some techniques, in the case of Google Colaboratory, I got the feeling, "Is something happening behind the scenes, is it not working, or is it working automatically?"
On the other hand, I think that this article can be used in many ways when you normally set up a GPU instance in the cloud and do deep learning with PyTorch.
I hope you will take advantage of it ♪
** [Author] ** Dentsu International Information Services (ISID) AI Transformation Center Development Gr Yutaro Ogawa (Main book "Learn while making! Deep learning by PyTorch", etc. ["Self-introduction details"](https: / /github.com/YutaroOgawa/about_me))
【Twitter】 Focusing on IT / AI-related and business / management, I send out articles that I find interesting and impressions of new books that I recently read. If you want to collect information on these fields, please follow us ♪ (There is a lot of overseas information)
** [Other] ** The "AI Transformation Center Development Team" that I lead is looking for members. If you are interested, please visit this page to apply.
** [Sokumen-kun] ** If you want to apply suddenly, we will have a casual interview with "Sokumen-kun". Please use this as well ♪ https://sokumenkun.com/2020/08/17/yutaro-ogawa/
[Disclaimer] The content of this article itself is the opinion / transmission of the author, not the official opinion of the company to which the author belongs.
Recommended Posts