This is the 8th installment of PyTorch Official Tutorial following Last time. This time, we will proceed with TorchVision Object Detection Finetuning Tutorial.
TorchVision Object Detection Finetuning Tutorial
In this tutorial, we will use the pre-trained Mask R-CNN to see fine tuning and transfer learning. The data used for learning is Penn-Fudan data for pedestrian detection and segmentation. For this data, 170 images with 345 pedestrians (instances) are prepared.
First, you need to install the pycocotools library. This library is used to calculate a rating called "Intersection over Union". "Intersection over Union" is one of the methods to evaluate the matching condition of areas in object detection.
%%shell
pip install cython
#Install pycocotools. The default version of Colab is https://github.com/cocodataset/cocoapi/pull/There is a bug fixed in 354.
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
Define the dataset. The dataset requires certain attributes to take advantage of the model trained by Mask R-CNN. Data required by using torchvision script (object detection, instance segmentation, person keypoint detection library) You can create a set.
The dataset requires the following attributes:
(Roughly speaking, the dataset defines a rectangle containing an object in boxes, and masks defines whether it is an object in pixels.) If the model returns the above method, the model works for both training and evaluation, and the evaluation script of pycocotools is used.
Let's output the dataset of Penn-Fudan dataset. First, https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip Download and unzip the zip file.
%%shell
# download the Penn-Fudan dataset
wget https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip .
# extract it in the current folder
unzip PennFudanPed.zip
The data has the following structure.
PennFudanPed/
PedMasks/
FudanPed00001_mask.png
FudanPed00002_mask.png
FudanPed00003_mask.png
FudanPed00004_mask.png
...
PNGImages/
FudanPed00001.png
FudanPed00002.png
FudanPed00003.png
FudanPed00004.png
Let's display the first image.
from PIL import Image
Image.open('PennFudanPed/PNGImages/FudanPed00001.png')
(Although it is described in the unzipped readme.txt, the mask image is an image with a background of "0" and a label of 1 or more for each pedestrian.)
mask = Image.open('PennFudanPed/PedMasks/FudanPed00001_mask.png')
#Each mask instance has a different color from zero to N.
#Where N is the number of instances (pedestrians). For ease of visualization
#Let's add a color palette to the mask.
mask.putpalette([
0, 0, 0, # black background
255, 0, 0, # index 1 is red
255, 255, 0, # index 2 is yellow
255, 153, 0, # index 3 is orange
])
mask
This data has a mask that identifies each image and pedestrian, and each color of the mask corresponds to an individual pedestrian. Let's create the torch.utils.data.Dataset class for this dataset.
import os
import numpy as np
import torch
import torch.utils.data
from PIL import Image
class PennFudanDataset(torch.utils.data.Dataset):
def __init__(self, root, transforms=None):
self.root = root
self.transforms = transforms
#Load and sort all image files
self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))
def __getitem__(self, idx):
#Load images and masks
img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
img = Image.open(img_path).convert("RGB")
#Because each color corresponds to a different instance and 0 is the background
#Note that we have not converted the mask to RGB
mask = Image.open(mask_path)
mask = np.array(mask)
#Instances are encoded as different colors
obj_ids = np.unique(mask)
#The first ID is the background, so delete it
obj_ids = obj_ids[1:]
# split the color-encoded mask into a set
# of binary masks
#Divide the color-coded mask into a set of binary masks
masks = mask == obj_ids[:, None, None]
# get bounding box coordinates for each mask
#Gets the bounding box coordinates for each mask
num_objs = len(obj_ids)
boxes = []
for i in range(num_objs):
pos = np.where(masks[i])
xmin = np.min(pos[1])
xmax = np.max(pos[1])
ymin = np.min(pos[0])
ymax = np.max(pos[0])
boxes.append([xmin, ymin, xmax, ymax])
boxes = torch.as_tensor(boxes, dtype=torch.float32)
# there is only one class
#There is only one class
labels = torch.ones((num_objs,), dtype=torch.int64)
masks = torch.as_tensor(masks, dtype=torch.uint8)
image_id = torch.tensor([idx])
area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
# suppose all instances are not crowd
#Suppose all instances are not congested
iscrowd = torch.zeros((num_objs,), dtype=torch.int64)
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["masks"] = masks
target["image_id"] = image_id
target["area"] = area
target["iscrowd"] = iscrowd
if self.transforms is not None:
img, target = self.transforms(img, target)
return img, target
def __len__(self):
return len(self.imgs)
That's it for the dataset. Let's see how the output of this dataset is organized
dataset = PennFudanDataset('PennFudanPed/')
dataset[0]
out
(<PIL.Image.Image image mode=RGB size=559x536 at 0x7FC7AC4B62E8>,
{'area': tensor([35358., 36225.]), 'boxes': tensor([[159., 181., 301., 430.],
[419., 170., 534., 485.]]), 'image_id': tensor([0]), 'iscrowd': tensor([0, 0]), 'labels': tensor([1, 1]), 'masks': tensor([[[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]],
[[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]]], dtype=torch.uint8)})
You can see that the dataset returns PIL.Image and a dictionary containing some fields such as boxes
, labels
, masks
.
Although not in the tutorial, the following code can illustrate boxes and masks. boxes are rectangles containing instances (people), and masks are the instances themselves.
import matplotlib.pyplot as plt
import matplotlib.patches as patches
fig, ax = plt.subplots()
target = dataset[0][1]
#Masks of the first instance
masks_0 = target['masks'][0,:,:]
#Boxes of the first instance
boxes_0 = target['boxes'][0]
#Output mask
ax.imshow(masks_0)
#Output boxes
ax.add_patch(
patches.Rectangle(
(boxes_0[0], boxes_0[1]),boxes_0[2] - boxes_0[0], boxes_0[3] - boxes_0[1],
edgecolor = 'blue',
facecolor = 'red',
fill=True,
alpha=0.5
) )
plt.show()
Defining your model
This tutorial uses MaskR-CNN, which is based on FasterR-CNN. Faster R-CNN is an object detection algorithm that predicts both the bounding box and class score of a potential object in an image (a rectangle containing the object and what the object is). (The image below is a processed image of Faster R-CNN)
Mask R-CNN is an improved version of Faster R-CNN that judges not only the object detection by the rectangle (box) but also by the pixel unit (mask). (The image below is a processed image of Mask R-CNN)
There are two main reasons for customizing a model with torchvision. The first is when you want to take advantage of a pre-trained model and tweak the last layer. The other is when you want to replace the model backbone with another backbone. (For example, for faster predictions) Let's take a concrete example.
Here's how to use a pre-trained model to fine-tune to the class you want to identify.
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
#Load pre-trained models at COCO
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
#Classifier, user-defined num_Replace with a new classifier with classes
num_classes = 2 # 1 class (person) + background :1 class (person)+background
#Gets the number of input features of the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
#Replace the pre-trained HEAD with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
The other case is when you want to replace the model backbone with another backbone. For example, the current default backbone (ResNet-50) may be too large in some situations and you may want to take advantage of a smaller model. The following describes how to use torchvision to change the backbone.
import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator
#Loads a pre-trained model for classification and returns only features
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
#FasterRCNN needs to know the number of output channels in the backbone.
# mobilenet_For v2 it's 1280, so you need to add it here
backbone.out_channels = 1280
#In RPN, with 5 different sizes and 3 different aspect ratios
#Let's generate 5 x 3 anchors for each spatial position.
#Because the size and aspect ratio of each feature map can be different
# Tuple [Tuple [int]]there is.
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
aspect_ratios=((0.5, 1.0, 2.0),))
#With the feature map used to perform trimming of the area of interest,
#Let's define the size of the trim after rescaling.
#If the backbone returns a Tensor, featmap_names is[0]Is expected to be.
#More generally, the backbone is OrderedDict[Tensor]Must be returned,
# featmap_You can select the feature map to use with names.
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
output_size=7,
sampling_ratio=2)
#Put pieces together in a Faster RCNN model
model = FasterRCNN(backbone,
num_classes=2,
rpn_anchor_generator=anchor_generator,
box_roi_pool=roi_pooler)
In this case, the dataset is very small, so we'll tweak the pre-trained model. Therefore, follow approach number 1. We'll use Mask R-CNN here to also calculate the segmentation mask for the instance (to determine the area of the person in pixels).
maskrcnn_resnet50_fpn
is described in the Official Documentation (https://pytorch.org/docs/stable/torchvision/models.html#torchvision.models.detection.maskrcnn_resnet50_fpn), but ResNet-50-FPN It is a customized model.
maskrcnn_resnet50_fpn is pre-trained on the COCO train2017 dataset.import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
def get_instance_segmentation_model(num_classes):
#Load a COCO pre-trained instance segmentation model
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
#Gets the number of input features of the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
#Replace the pre-trained HEAD with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
#mask Gets the number of input features of the classifier
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
# and replace the mask predictor with a new one
#Replace the mask predictor with a new one
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
hidden_layer,
num_classes)
return model
You are now ready to train and evaluate your model on this dataset.
(Comparing this model with torchvision.models.detection.maskrcnn_resnet50_fpn, you can see that the dimensions of the following parts have changed.)
(roi_heads): RoIHeads(
・ ・ ・
(box_predictor): FastRCNNPredictor(
(cls_score): Linear(in_features=1024, out_features=2, bias=True)
(bbox_pred): Linear(in_features=1024, out_features=8, bias=True)
)
・ ・ ・
(mask_predictor): MaskRCNNPredictor(
・ ・ ・
(mask_fcn_logits): Conv2d(256, 2, kernel_size=(1, 1), stride=(1, 1))
)
)
The torchvision vision / references / detection /
has a number of helper functions to simplify training and evaluation of object detection models. Here we use references / detection / engine.py
, references / detection / utils.py
, references / detection / transforms.py
.
Copy these files (and their associated files) for use.
%%shell
# Download TorchVision repo to use some files from
# references/detection
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.3.0
cp references/detection/utils.py ../
cp references/detection/transforms.py ../
cp references/detection/coco_eval.py ../
cp references/detection/engine.py ../
cp references/detection/coco_utils.py ../
Let's use the copied refereces / detection
to create some helper functions for data extension / transformation.
from engine import train_one_epoch, evaluate
import utils
import transforms as T
def get_transform(train):
transforms = []
#Convert image to Tensor
transforms.append(T.ToTensor())
if train:
#For training, the image and teacher data are randomly flipped horizontally. (Image reflected in the mirror)
transforms.append(T.RandomHorizontalFlip(0.5))
return T.Compose(transforms)
The above code is the preparation of the data. Converts the image to a Tensor and inverts it randomly for training data. No data standardization or image rescaling is required. The Mask R-CNN model does it internally.
The dataset, model, and data preparation are now ready. Let's instantiate them.
#Uses a dataset and a defined transformation
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))
#Split the dataset with training and test data
torch.manual_seed(1)
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])
#Define a training and validation data loader
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=2, shuffle=True, num_workers=4,
collate_fn=utils.collate_fn)
data_loader_test = torch.utils.data.DataLoader(
dataset_test, batch_size=1, shuffle=False, num_workers=4,
collate_fn=utils.collate_fn)
Instantiate the model.
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
#Teacher data is only two classes, background and person
num_classes = 2
#Get the model using the helper function
model = get_instance_segmentation_model(num_classes)
#Move the model to the appropriate device
model.to(device)
#Build the optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
momentum=0.9, weight_decay=0.0005)
#Learning rate scheduler that reduces the learning rate to 1/10 every 3 epochs
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=3,
gamma=0.1)
Train with 10 epochs. Evaluate with the evaluate function at each epoch. (It takes about 8 minutes to learn in the GPU environment of Colaboratory. A run-time error occurs in the None GPU.)
#Training with 10 epochs
num_epochs = 10
for epoch in range(num_epochs):
print(epoch)
#1 Epoch training
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
#Update learning rate
lr_scheduler.step()
#Evaluate with test dataset
evaluate(model, data_loader_test, device=device)
out
...
Averaged stats: model_time: 0.1179 (0.1174) evaluator_time: 0.0033 (0.0051)
Accumulating evaluation results...
DONE (t=0.01s).
Accumulating evaluation results...
DONE (t=0.01s).
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.831
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.990
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.955
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.543
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.841
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.386
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.881
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.881
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.787
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.887
IoU metric: segm
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.760
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.990
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.921
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.492
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.771
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.345
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.808
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.808
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.725
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.814
Now that the training is over, let's see what the test dataset will look like.
#Select one image from the test set
img, _ = dataset_test[4]
#Put the model in evaluation mode
model.eval()
with torch.no_grad():
prediction = model([img.to(device)])
When you output the prediction, it is a list of dictionaries. Since we specified one test data, the example below has one element in the list. The dictionary contains image predictions. In this case, you can see that it contains boxes, labels, masks, and scores.
prediction
out
[{'boxes': tensor([[173.1167, 27.6446, 240.8375, 313.0114],
[325.5737, 64.3967, 453.1539, 352.3020],
[222.4494, 24.5255, 306.5306, 291.5595],
[296.8205, 21.3736, 379.0592, 263.7513],
[137.4137, 38.1588, 216.4886, 276.1431],
[167.8121, 19.9211, 332.5648, 314.0146]], device='cuda:0'),
'labels': tensor([1, 1, 1, 1, 1, 1], device='cuda:0'),
'masks': tensor([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]]], device='cuda:0'),
'scores': tensor([0.9965, 0.9964, 0.9942, 0.9696, 0.3053, 0.1552], device='cuda:0')}]
Check the image and the prediction result. The image (img) is a [color, vertical, horizontal] Tensor. Colors are 0 -1 so scale to 0-255 and swap with [Vertical, Horizontal, Color].
Image.fromarray(img.mul(255).permute(1, 2, 0).byte().numpy())
Next, visualize the predicted mask. masks are predicted as [N, 1, H, W]
. Where N
is the predicted number of instances (people).
Each value of mask stores the probability of determining "person" in pixels as 0-1.
Image.fromarray(prediction[0]['masks'][0, 0].mul(255).byte().cpu().numpy())
(Other predicted instances (people) can also be visualized by changing the value of N as shown below.)
Image.fromarray(prediction[0]['masks'][1, 0].mul(255).byte().cpu().numpy())
Image.fromarray(prediction[0]['masks'][2, 0].mul(255).byte().cpu().numpy())
Image.fromarray(prediction[0]['masks'][3, 0].mul(255).byte().cpu().numpy())
I can predict it well.
In this tutorial, you learned how to train an object detection model using a dataset you defined yourself.
For the dataset, we created the torch.utils.data.Dataset
class that holds the box and mask to define the dataset specific to object detection.
We also leveraged the MaskR-CNN model pre-trained at COCOtrain 2017 to perform transfer learning on this new dataset.
For more detailed examples of multi-machine / multi-GPU training, see references / detection / train at torchvision GitHub repo Check the .py.
In this tutorial, we learned "transfer learning" and "fine tuning" using a pre-trained model. (This time it's apparently called fine tuning, and the difference between transfer learning and fine tuning will be explained next time.) In the tutorial, I tried with 120 training data and 50 verification data, but even with about 40 training data, I was able to predict fairly correctly. Transfer learning is amazing to be able to learn with such a small amount of test data. Next time, I would like to proceed with "Transfer Learning for Computer Vision Tutorial".
2020/11/15 First edition released
Recommended Posts