When I was reading the dissertation diagonally at Papers with Code, I was introduced to the technique of color imaging of black and white images, which I wanted to learn once. I have translated the outline, so I hope you find it helpful.
Instance-aware Image Colorization https://paperswithcode.com/paper/instance-aware-image-colorization
A color imaging technology for black and white images using object division was recently posted on arxiv.
Converting black-and-white images into plausible color images is a hot research theme. However, predicting two missing channels from a black-and-white image poses an inherently difficult problem. In addition, because there are multiple options for coloring objects, the coloring process may have multiple interpretations (eg, white, black, red for vehicles, etc.).
The conventionally reported technology has a problem that it is not well colored when there are many objects on a cluttered background (see the figure below).
In this paper, in order to solve the above problems, we have realized a new deep learning framework and color coding that is conscious of region division. As a particular point, it was found that ** clearly separating the object from the background ** is effective in improving the colorization performance.
The authors' framework consists of the following three.
In recent years, attention has been paid to the automation of colorization processing using machine learning. In existing research, deep convolutional neural networks have become the mainstream for learning color predictions from large datasets.
The process that considers the area division makes the separation between the object and the ground clear, which facilitates the composition and operation of the visual appearance.
In this system, the black-and-white image $ X ∈ R ^ {H × W × 1} $ is input, and the two missing color channels $ Y ∈ R ^ {H × W × 2} $ are $ CIE L ∗ a ∗. b ∗ End-to-end prediction within the color space $.
The figure below shows the network configuration. First, a pre-learned object detector is used to obtain multiple object bounding boxes $ (B_i) ^ N_ {i = 1} $ ($ N $ is the number of instances) from a black and white image.
Next, the image cut out from the black-and-white image is resized using the detected bounding box to generate an instance image $ (X_i) ^ N_ {i = 1} $.
Next, each instance image $ X_i $ and the input grayscale image $ X $ are sent to the instance colorization network and the full image colorization network, respectively. Here, the extracted feature maps of the instance image $ X_i $ and the grayscale image $ X $ in the $ j $ th network layer are called $ f ^ {Xi} _j $ and $ f ^ X_j $.
Finally, we use a fusion module that fuses the instance features $ (f_j ^ {Xi}) ^ N_ {i = 1} $ of each layer and the full image features $ {f_j ^ X} $. All fused image features $ f ^ X_j $ are transferred to the $ j + 1 $ th layer. Repeat this step until the last layer to get the predicted color image $ Y $.
In this research, we adopt a sequential approach of first learning the entire image network, then learning the instance network, and finally freezing the above two networks to learn the fusion module.
Color the image using the detected object instance. For this purpose, a commercially available pre-trained network Mask R-CNN was used as the object detector.
The fusion module receives input similar to the following: The fusion module has (1) full image features $ f ^ X_j $, (2) a bundle of instance features and the corresponding object boundary box $ (f_j ^ {Xi}) ^ N_ {i = 1} $. Input. For both types of features, we devise a small neural network with three convolution layers to predict the full image weight map $ W_F $ and the per-instance weight map $ W_I ^ i $.
Follow the steps below to learn the entire network. First, it learns all image colorization and transfers the learned weights to the instance colorization network for initialization. Next, learn the instance coloring network. Finally, we release the weights of all image models and instance models and move on to learning the fusion module.
The following three training processes were performed on the ImageNet dataset.
Comparisons with the state-of-the-arts.
The table above shows a comparison of quantitative values for the three datasets. All indicators scored better than previous methods.
※ LPIPS: Distance between the original image and the regenerated image after projecting into the latent space (the lower the distance, the closer and similar) SSIM: Peripheral pixel average, variance / covariance based on brightness, contrast, and structure PSNR: Two images squared by the difference in pixel brightness between the same positions (higher is higher quality)
User study Show participants the pair of colored results and ask their preferences (compulsory selection comparison). As a result, the authors' method was preferred on average compared to Zhanget al. (61% vs. 39%) and DeOldify (72% vs. 28%). Interestingly, DeOld-ify does not give the exact coloring results evaluated in benchmark experiments, but saturated coloring results may be preferred by users.
The figure above shows two examples of failures. The authors' approach can result in visible artifacts that appear to be washed out of color or straddle the boundaries of objects.
In this study, features were extracted from the instance branch and the full image branch by cutting out an image using a ready-made object detection model. Then, it was confirmed that a better feature map can be obtained by fusing with the newly proposed fusion module. As a result of the experiment, it was shown that the results of this study were superior to the existing method in the dataset of three branch marks.
I learned the technology of color imaging that incorporates the area segmentation (instance segmentation) technology. I understood the technology itself, but I found it difficult to quantitatively discuss that it is a plausible image when it is converted to a color image. If you have multiple choices, such as car color or vegetation color, how do you decide which algorithm is most plausible?
The authors are also testing to let people judge, but if an algorithm can be created in this multimodal area, it will be a technology with a more artificial intelligence feeling.