This article is the 15th day article of Kinki University Advent Calendar 2019. In this article, I will briefly write about object detection. I wrote it with reference to the contents of the survey paper "Object Detection in 20 Years: A Survey".

1. What is Object Detection?

Object detection is the task of detecting an instance of a class or semantic object in an image or video. In the past, it was a difficult task to achieve accuracy because there are two types of guidelines, classify and semantic. In recent years, as with other CV field tasks, the power of deep learning has continued to improve accuracy at a dizzying rate.

1.1 What is the latest object detection method?

You can investigate SOTA (State-of-the-Art) methods at Object Detection on COCO test-dev.

スクリーンショット 2019-11-13 16.48.32.png

At the time of writing (November 13, 2019), SOTA was Cascade Mask R-CNN (Triple-) using CBNet: A Novel Composite Backbone Network Architecture for Object Detection. ResNeXt152, multi-scale) ”has mAP 53.3 (COCO).

Due to fierce competition in the CV field, it will be updated immediately in 3 months to 6 months. M2Det, which was SOTA around January 2019, is mAP 44.2 (COCO), so it has increased by nearly 10 points in one year. To learn about the latest object detection using deep learning, you can refer to the paper summary by hoya012 (https://github.com/hoya012/deep_learning_object_detection).

2. History of object detection

The history of object detection is roughly divided into two terms. "Before the invasion of deep learning" from 2001 to 2013 and "After the invasion of deep learning" from 2014. Neural networks have been booming in recent years due to improvements in machine specifications and the use of GPUs in recent years, but object detection is also evolving in conjunction with them.

The figure is taken from Object Detection in 20 Years: A Survey.

Before the invasion of deep learning, object detection was performed by considering the process of extracting features by looking at the numerical values of the image, but after the invasion, the configuration and mechanism of the neural network are considered and adjusted. (Of course, it is important to learn the process of extracting features)

In SNS, human resources who have been called feature extraction entertainers have been required. In the deep world, human resources called hyperparameter adjustment entertainers are needed [citation needed]

The main technologies of each term are briefly explained below.

2.1 Before the deep attack

Object detection before the deep invasion has been performed by sliding window detection, which extracts features and moves the window showing the area to make a judgment. The following is a typical method.

VJ Det(2000)

It is a real-time detector for human face called Viola Jones Detector. This is a type of detector that extracts Haar-like features focusing on the difference in brightness and performs cascade classification using a sliding window. Haar-like features are simply the sum of pixels in a certain area. "Face detection using Harr Cascades ”, The face detection method is described, and the sample code for cascade detection with OpenCV is written.

It is an operation that calculates the addition value of pixels and checks whether the patterns match.

You can download the learning result (xml) of the detector here. https://github.com/opencv/opencv/tree/master/data/haarcascades

If you don't build the OpenCV installation, it looks like this.

pip install opencv-python opencv-contrib-python

VJDetector itself can be used like this.

import cv2
img = cv2.imread("input.jpg ")
detector = cv2.CascadeClassifier("haarcascade_frontalface_default.xml"）
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
face = detector.detectMultiScale(gray_img,scaleFactor=1.1, minNeighbors=3, minSize=(30, 30))
for (x, y, w, h) in face:
  cv2.rectangle(img, (x, y), (x+w, y+h), (0, 0, 300), 4)
cv2.imwrite("output.jpg ", img)

You can easily use it just by throwing the xml file to cv2.CascadeClassifier.

This is an example that I actually used. You recognize the president's face and apply a mosaic! (I'm worried that this will be an international issue)

HOG Det(2006)

It is a detector that extracts HOG features focusing on the distribution of brightness in the gradient direction and classifies them by SVM while performing a sliding window. For human detection, it is said that HOG features that can capture contour information are better than Haar features that differ in brightness. In OpenCV, it is implemented as cv2.HOGDescriptor. The code looks like this.

import cv2
img = cv2.imread("input.jpg ")
hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())
hogParams = {'winStride': (8, 8), 'padding': (32, 32), 'scale': 1.05}
human, r = hog.detectMultiScale(img, **hogParams)
for (x, y, w, h) in human:
  cv2.rectangle(img, (x, y), (x+w, y+h), (0, 0, 300), 4)
cv2.imshow("results human detect by def detector", img)
cv2.waitKey(0)

2.2 After the deep attack

After the arrival of Deep, the field of object detection has dramatically improved in accuracy. There are various technical terms such as the appearance of CNN, the appearance of VGG, and the appearance of ResNet. These technologies are basically tasks derived from the image classification task, and the task of object detection tends to improve based on them. The methods are divided according to the handling of the two processes of classification and position estimation. A one-shot Detector that performs both at the same time and a two-shot Detector that performs classification after performing position estimation. スクリーンショット 2019-12-15 0.05.22.png In general, the One-shot Detector has excellent detection speed, and the Two-shot Detector has excellent detection accuracy. (Honestly, I feel that there is not much difference with the latest method)

One-shot Detector One-shot Detector is a detection method that performs image classification and position detection at the same time. In many cases, it can be classified into YOLO type and SSD type. Among the One-shot Detectors with excellent detection speed, SSD type is often faster, and it is often said that YOLO type is superior in detection accuracy.

YOLO
SSD

Two-shot Detector

In the two-stage classification, the technology starting with RCNN is mainly used. In recent years, semantic segmentation, which classifies by pixel, is sometimes performed, and I get the impression that it often exceeds the One-shot Detector in terms of accuracy. (Of course, it's based on the labeled dataset that supports it.)

RCNN

3. Deep library

Basically, there is a tendency to use GPU to gain processing power, so it is necessary to cooperate with CUDA for most libraries. (Of course, you can use only the CPU) To be honest, it is really a penance to match the version around that.

tensorflow

This is a library developed by Google. The version upgrade is progressing with great momentum, and it will be difficult to match the version with CUDA. Stop changing the API mercilessly (quiet)

pytorch

A library of images that are in battle with tensorflow. I have the impression that there are many pytorch in the reproduction implementation of recent Tsuyotsuyo papers. Especially popular with young people (selfish image)

chainer

This is a library that PFN has recently stopped updating. I like it a lot, but I think it will be a tough option for new users.

keras

It is a library that works with tensorflow etc. as the back end. The degree of abstraction is very high, and even beginners of deep learning can easily create a network of image classification using CNN. However, if you try to handle something more advanced than classification, you will have to write tensorflow code, and the definition of the loss function will be troublesome.

4. Tools that support the deep environment

Here are some keywords for tools that may help support a deep development environment. I personally recommend nvidia-docker + native python + pip + etc ..

Docker (kubernetes)

It is a great tool that makes it possible to save the environment by using a virtual container (vocabulary) From v19, it supports GPU natively. Before that, you can create a container that uses GPU by installing nvidia-docker. In particular, it has the advantages of easy version matching with CUDA and easy reproduction of the library environment. The disadvantage is the learning cost of Docker ...

Anaconda

A python version and library management tool used for scientific and technological calculations. It pollutes the terminal and tends to be dismissed by chance for religious reasons.

pyenv

There are various Python library management tools, divided into denominations (I don't quite understand the difference between virtualenv and pyenv).

Five. At the end

All the tasks in the CV field are in fierce competition, so they are evolving at a tremendous speed. I do not recommend it because it is a hot industry to enter as basic research, but I think that it is a task that can still be challenged as an application using these technologies. Especially when the paper is reproduced and implemented, it is very educational because it is necessary to write the processing that is not written in the paper, the processing time that is written in the paper, and the processing that is not in the library in the first place. I will. Recommended when you have time. Problems such as resources and processing time that have existed so far are being solved, and it is becoming very easy to handle, so why not give it a try?

I'm exhausted on the way, so I may add it in the future.

References

Object Detection in 20 Years: A Survey: The history and methods of object detection are described in detail, so if you are interested, please read it. ..
Deep Learning for Generic Object Detection: A Survey: Mainly describes methods using deep learning.
OpenCV-Python Tutorial
SSD:Single Shot multi-box Detector
History summary about object detection

A light introduction to object detection