Image alignment: from SIFT to deep learning

Overview

There are few Japanese articles about image alignment (as of January 22, 2020), so I thought it was very easy to understand Image Registration: From SIFT to Deep Learning -07-16-image-registration-deep-learning) was translated or summarized. Some parts have been omitted or added, so if you want to refer to the original text, please read the original article.

Source code

What is image registration?

Image alignment is the process of correcting the misalignment of two images. Image alignment is used when comparing multiple images in the same scene. For example, it often appears in the fields of satellite image analysis, optical flow, and medical imaging. Let's look at a concrete example. The image above is an alignment of the image of the Muscat bonbon I ate at Shin-Okubo. It was very delicious, but I wouldn't recommend it to anyone who isn't a sweet tooth. In this example, you can see that the position of the second Muscat bonbon from the left is aligned with the position of the leftmost Muscat bonbon while keeping the brightness etc. In the following, the unconverted and referenced image, such as the left edge, is referred to as the reference image, and the converted image, such as the second from the left, is referred to as the floating image. This article describes some techniques for aligning between floating and reference images. Note that the iterative / signal strength based method is not very common and will not be mentioned in this article.

Feature-based approach

Since the early 2000s, a feature-based approach has been used for image alignment. This approach consists of three steps: keypoint detection and feature description, feature matching, and image transformation. Simply put, select the points of interest in both images, associate each point of interest in the reference image with the corresponding point in the floating image, and transform the floating image so that both images are aligned.

Key point detection and feature description

Key points define important and characteristic things in an image (such as corners and edges). Each key point is represented by a descriptor. A descriptor is a feature vector that contains the essential features of a key point. Descriptors must be robust to image transformations (localization, scale, brightness, etc.). There are many algorithms for detecting key points and describing features.

-SIFT (Scale-invariant feature transform) is the original algorithm used to detect keypoints, but is commercially available. There is a charge for use. SIFT feature descriptors are invariant to uniform scaling, directional, and luminance conversions, and partially invariant to affine distortion. -SURF (Speeded Up Robust Features) is a SIFT-influenced detector and descriptor. It is several times faster than SIFT. We also have a patent. -ORB FAST Brief (Oriented FAST and Rotated BRIEF) is FAST. pdf) A fast binary descriptor based on a combination of the keypoint detector and the Brief descriptor. It is invariant to rotation and robust to noise. Developed by OpenCV Labs, it is an efficient and free alternative to SIFT. -AKAZE (Accelerated-KAZE) is [KAZE](http://citeseerx.ist.psu.edu/ viewdoc / download? doi = 10.1.1.304.4980 & rep = rep1 & type = pdf) is a speedup version. It provides a fast multi-scale feature detection and description approach for nonlinear scale spaces. It is immutable for both scale and rotation and is free.

These algorithms are easy to use with OpenCV. In the example below, we used AKAZE's OpenCV implementation. You can use other algorithms simply by renaming the algorithm.

import cv2 as cv

#Import images in grayscale to make key points easier to see
img = cv.imread('img/float.jpg')
gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)

#Key point detection and feature description
akaze = cv.AKAZE_create()
kp, descriptor = akaze.detectAndCompute(gray, None)

keypoints_img = cv.drawKeypoints(gray, kp, img)
cv.imwrite('keypoints.jpg', keypoints_img)

For more information on feature detection and descriptors, see OpenCV Tutorial (https://docs.opencv.org/3.4/d7/d66/tutorial_feature_detection.html).

Feature matching

After finding the keypoints in both images, the corresponding keypoints must be associated or "matched". One of the methods for that is BFMatcher.knnMatch (). It measures the distance between each pair of keypoint descriptors and matches k keypoints with the closest distance to each keypoint. Then apply a ratio filter to keep only the correct matches. For reliable matching, the matched key points must be significantly closer than the closest false match.

import cv2 as cv

float_img = cv.imread('img/float.jpg', cv.IMREAD_GRAYSCALE)
ref_img = cv.imread('img/ref.jpg', cv.IMREAD_GRAYSCALE)

akaze = cv.AKAZE_create()
float_kp, float_des = akaze.detectAndCompute(float_img, None)
ref_kp, ref_des = akaze.detectAndCompute(ref_img, None)

#Feature matching
bf = cv.BFMatcher()
matches = bf.knnMatch(float_des, ref_des, k=2)

#Keep only correct matching
good_matches = []
for m, n in matches:
    if m.distance < 0.75 * n.distance:
        good_matches.append([m])

matches_img = cv.drawMatchesKnn(
    float_img,
    float_kp,
    ref_img,
    ref_kp,
    good_matches,
    None,
    flags=cv.DrawMatchesFlags_NOT_DRAW_SINGLE_POINTS)
cv.imwrite('matches.jpg', matches_img)

See the documentation (https://docs.opencv.org/trunk/dc/dc3/tutorial_py_matcher.html) for other feature matching methods implemented in OpenCV.

Image conversion

After matching at least 4 sets of keypoints, convert one image relative to the other. This is called image warping. Two images on the same plane in space are associated by Homography (https://docs.opencv.org/3.4.1/d9/dab/tutorial_homography.html). Homography is a geometric transformation that has eight free parameters and is represented by a 3x3 matrix. They represent the distortion applied to the entire image (as opposed to local transformations). Therefore, to get the transformed floating image, calculate the homography matrix and apply it to the floating image. To ensure optimal conversion, the RANSAC algorithm is used to detect outliers and remove them to determine the final homography. I will. It is built directly into the OpenCV findHomography (https://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html?highlight=findhomography#findhomography) method. An alternative to RANSAC is LMEDS: Robust estimation methods such as the minimal median method.

import numpy as np
import cv2 as cv

float_img = cv.imread('img/float.jpg', cv.IMREAD_GRAYSCALE)
ref_img = cv.imread('img/ref.jpg', cv.IMREAD_GRAYSCALE)

akaze = cv.AKAZE_create()
float_kp, float_des = akaze.detectAndCompute(float_img, None)
ref_kp, ref_des = akaze.detectAndCompute(ref_img, None)

bf = cv.BFMatcher()
matches = bf.knnMatch(float_des, ref_des, k=2)

good_matches = []
for m, n in matches:
    if m.distance < 0.75 * n.distance:
        good_matches.append([m])

#Choose the right key point
ref_matched_kpts = np.float32(
    [float_kp[m[0].queryIdx].pt for m in good_matches]).reshape(-1, 1, 2)
sensed_matched_kpts = np.float32(
    [ref_kp[m[0].trainIdx].pt for m in good_matches]).reshape(-1, 1, 2)

#Calculate homography
H, status = cv.findHomography(
    ref_matched_kpts, sensed_matched_kpts, cv.RANSAC, 5.0)

#Convert image
warped_image = cv.warpPerspective(
    float_img, H, (float_img.shape[1], float_img.shape[0]))

cv.imwrite('warped.jpg', warped_image)

If you are interested in the details of these three steps, OpenCV has put together a series of useful tutorials (https://docs.opencv.org/3.1.0/db/d27/tutorial_py_table_of_contents_feature2d.html).

Deep learning approach

Most recent studies of image alignment relate to the use of deep learning. Over the past few years, deep learning has enabled cutting-edge performance in computer vision tasks such as classification, detection, and segmentation. Image alignment is no exception.

Feature extraction

Deep learning was first used for image alignment for feature extraction. A continuous layer of convolutional neural networks (CNNs) captures increasingly complex image features and learns task-specific features. Since 2014, researchers have applied these networks to feature extraction steps rather than SIFT or similar algorithms.

--In 2014, Dosovitskiy et al. Proposed to learn CNN using only unsupervised data. The versatility of these features makes them robust to conversions. These features or descriptors were superior to SIFT descriptors. --In 2018, Yang et al. Developed a non-rigid alignment method based on the same idea. They used a layer of pre-learned VGG networks to generate feature descriptors that hold both convolution information and localization features. These descriptors appear to be superior to detectors like SIFT, especially if SIFT contains many outliers or cannot match a sufficient number of feature points. The code for the latter paper can be found here. You can try this alignment method on your image within 15 minutes, but it's about 70 times slower than the SIFT-like method implemented in the first half.

Homography learning

Researchers have not limited the use of deep learning to feature extraction, but have tried to achieve alignment by directly learning geometric transformations using neural networks.

Supervised learning

In 2016, DeTone et al. Describe a regression homography net, a VGG-style model that learns homography associated with two images [deep image homography estimation](https://arxiv.org/pdf/1606.03798. pdf) has been released. This algorithm has the advantage of learning homography and CNN model parameters end-to-end at the same time. No feature extraction and matching process is required.

The network produces eight real numbers as output. Supervised learning is done by the loss between the output and the ground truth homography.

Like other supervised learning approaches, this homography estimation method requires a pair of supervised data. However, it is not easy to get grand truth homography with real data.

Unsupervised learning

Nguyen et al. Presented an unsupervised learning approach to deep image homography estimation. They used the same CNN, but had to use a loss function suitable for the unsupervised approach. So we chose photometric loss, which does not require a grand truth label. Calculates the similarity between the reference image and the converted floating image.

\mathbf{L}_{PW} = \frac{1}{|\mathbf{x} _i|} \sum _{\mathbf{x} _i}|I^A(\mathscr{H}(\mathbf{x} _i))-I^B(\mathbf{x} _i)|

Their approach introduces two new network structures, the Tensor Direct Linear Transform and the Spatial Transformation Layer.

The authors argue that this unsupervised method has faster inference speeds, comparable or better accuracy, and robustness to lighting variations compared to traditional feature-based methods. In addition, it is more adaptable and performant than supervised methods.

Other approaches

Reinforcement learning

Deep reinforcement learning is attracting attention as a method of aligning medical applications. In contrast to the predefined optimization algorithms, this approach uses a trained agent to perform the alignment.

--In 2016, Liao et al. First used reinforcement learning for image alignment. Their method is based on a greedy supervised algorithm for end-to-end learning. The goal is to align the image by finding the optimal sequence of motion actions. This approach was superior to some state-of-the-art methods, but was only used for rigid transformations. --Reinforcement learning is also used for more complex transformations. Krebs et al. Performed robust non-rigid body alignment through agent-based action learning. Apply an artificial agent to optimize the parameters of the transformation model. This method was evaluated on subject-to-subject alignment on prostate MRI and showed promising results in 2D and 3D.

Complex conversion

Current research in a significant proportion of image alignment is relevant in the field of medical imaging. In many cases, the transformation between two medical images cannot simply be described by a homography matrix due to the subject's local transformations (such as respiration and anatomical changes). We need more complex transformation models, such as diffeomorphisms that can be represented by displacement vector fields.

Researchers have tried to use neural networks to estimate these large transformation models with many parameters.

--The first example is the reinforcement learning method of Krebs et al. Above. --In 2017, De Vos et al. Proposed DIRNet. This is a network that uses CNN to predict the grid of control points, which is used to generate a displacement vector field and warp the floating image according to the reference image.

--Quicksilver Alignment (https://www.researchgate.net/publication/315748621_Quicksilver_Fast_Predictive_Image_Registration_-_a_Deep_Learning_Approach) addresses a similar issue. Quicksilver uses a deep code-decoding network to directly predict patch-by-patch transformations in the appearance of images.

Summary

We have introduced some techniques for image alignment. It was found that the method using feature points is changing to the method of directly converting images by deep learning.