Reconstruction of moving images by Autoencoder using 3D-CNN

Overview

How is everyone doing. I think that the epidemic of the new coronavirus infection has subsided a little, and many people are gradually returning to work or school. Now, this time, I would like to focus on the reconstruction task that utilizes the generative model again. In particular, ** I will try to reconstruct the "video". ** ** (What you can understand in this article is the experimental results and discussion of the reconstruction method using the encoder-decoder-based ** 3d-convolution, which can be extended to detect anomalies in moving images, and ** theories such as mathematical formulas. I will not follow the background.) ** All of this implementation is available on here. ** Implemented by PyTorch.

In the previous article, Verification and implementation of video reconstruction method using GRU and Autoencoder, we considered the following model.

図4.png

The reason why I thought about this model is that latent variables can be expressed as series data. In other words, 3d-conv encodes a video into a single latent variable, but that's overkill, isn't it? That was the motivation. In fact, in some papers it is difficult to "reconstruct" 3d-conv. There is a mention. (On the other hand, recent CVPR is expected to utilize 3d-conv when a large amount of data is collected in video recognition, but since this is a discriminative model, it will be a different field from this generation task.)

Now, however, I wanted to experimentally confirm that the reconstruction of 3d-conv would work. ** Is video recognition working, but reconstruction really working? Although I somehow understood it theoretically, I always wondered. That's why I came to write this article. I will start from the explanation of the model immediately.

Video reconstruction model

The model to be implemented this time is shown below.

40.png

$ \ boldsymbol {x_1, x_2, ..., x_T} $ means a video with a length of T, and $ \ boldsymbol {x_t} $ is each frame. The encoder using 3D-CNN receives the moving image and maps it to one point of $ \ boldsymbol {z_T} $. Using this, the procedure is to map to the observation space with the decoder. As a reconstruction task, the parameters are optimized to minimize the I / O difference.

I think that the method using 3D-CNN is very simple and easy to understand. It is possible to extract the features of time and space at once ** by 3D convolution without biting the time series model. Regarding 3D-CNN processing, there are other commentary articles, so I will hand over to that lol

Model learning / verification

** The flow of reconstruction is as follows. ** **

  1. Prepare a human action dataset
  2. Learn 3D-CNN Autoencoder
  3. Reconstruct the video using the model learned in 2.

1.human action dataset Use the familiar human action dataset. This data was used for verification in a video generation model called MocoGAN, and as the name suggests, it contains the appearance of people walking and waving.

epoch_real_60.png epoch_real_30.png

You can download it from here. (The above image is also quoted from the data in this link.)

2. Learning 3D-CNN Autoencoder

Next, we will train the model using the above data. The loss function is MSE, which naturally minimizes the error between input and output. For more information on the model, please see here. Below is the implementation of model.

network.py



class ThreeD_conv(nn.Module):
    def __init__(self, opt, ndf=64, ngpu=1):
        super(ThreeD_conv, self).__init__()
        self.ngpu = ngpu
        self.ndf = ndf
        self.z_dim = opt.z_dim
        self.T = opt.T
        self.image_size = opt.image_size
        self.n_channels = opt.n_channels
        self.conv_size = int(opt.image_size/16)

        self.encoder = nn.Sequential(
            nn.Conv3d(opt.n_channels, ndf, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf),
            nn.ReLU(inplace=True),
            nn.Conv3d(ndf, ndf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf * 2),
            nn.ReLU(inplace=True),
            nn.Conv3d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf * 4),
            nn.ReLU(inplace=True),
            nn.Conv3d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf * 8),
            nn.ReLU(inplace=True),
        )
        self.fc1 = nn.Sequential(
            nn.Linear(int((ndf*8)*(self.T/16)*self.conv_size*self.conv_size),self.z_dim ),#6*6
            nn.ReLU(inplace=True),
        )
        self.fc2 = nn.Sequential(
            nn.Linear(self.z_dim,int((ndf*8)*(self.T/16)*self.conv_size*self.conv_size)),#6*6
            nn.ReLU(inplace=True),
        )
        self.decoder = nn.Sequential(
            nn.ConvTranspose3d((ndf*8), ndf*4, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf * 4),
            nn.ReLU(inplace=True),
            nn.ConvTranspose3d(ndf*4, ndf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf * 2),
            nn.ReLU(inplace=True),
            nn.ConvTranspose3d(ndf * 2, ndf , 4, 2, 1, bias=False),
            nn.BatchNorm3d(ndf),
            nn.ReLU(inplace=True),
            nn.ConvTranspose3d(ndf , opt.n_channels, 4, 2, 1, bias=False),
            nn.Tanh(),
        )

The learning turned 5,000 itr, and the loss changed as follows. It became almost 0 in the second half, and I don't see much change, but I have the impression that it has converged safely.

image.png

3. Reconstruction of video by model

Make inferences using the model. In the above implementation, the reconstruction result is saved in generated_videos in the logs folder for each check point specified by the argument. As we learned, we showed the following behavior. The upper row of each itr is the input, and the lower row is the output.

--0 itr eyes Of course, it cannot be reconstructed at all. real_itr0_no1.png recon_itr0_no1.png

--1,000 itr eyes Although it is blurry, it has a human shape. real_itr1000_no0.png recon_itr1000_no0.png

--4,000 itr eyes It's a little clear, but it seems that blurring and blurring have occurred, and even the smallest details such as human hands have not been reproduced. real_itr4000_no1.png recon_itr4000_no1.png Furthermore, let's compare the results with GRU-AE. The following is the result of reconstruction by GRU-AE. This is a comparison of the methods in the previous article under the same conditions as above. The 0 itr eye is omitted.

--500 itr eyes Impression that it is not too terrible. Is it going well? real_itr500_no0.png recon_itr500_no0.png

―― 1,500 itr eyes Oh. That's a good idea. real_itr1500_no2.png recon_itr1500_no2.png

--4,000 itr eyes It became indistinguishable for a moment which was the real one. If you look closely, it may be blurry, but lol real_itr4000_no2.png recon_itr4000_no2.png

Summary

This time, I tried to reconstruct a moving image using 3DCNN-AE. As a result, it's as expected, but the generated video is not good. It is not that the movement cannot be reproduced, but it is inferior to GRU-AE in terms of the sharpness of each image. There are many voices who regard 3D-CNN as a problem in anomaly detection papers, and this time I was able to understand it experimentally. On the other hand, ** 3D-CNN is a promising star in recognition tasks. ** In an environment where a large amount of data can be collected, it seems that video recognition is treated as a favorite rather than a 2D approach like GRU. But in the generation task, it is different. There is little data, and it seems that "strong features" like supervised learning cannot be acquired. It seems that the day when 3D-conv will be used as a favorite for video anomaly detection is still ahead. .. .. Thank you for watching until the end.

Recommended Posts

Reconstruction of moving images by Autoencoder using 3D-CNN
Verification and implementation of video reconstruction method using GRU and Autoencoder
Low-rank approximation of images by HOSVD step by step
Low-rank approximation of images by Tucker decomposition
Face detection by collecting images of Angers.
Anomaly detection using MNIST by Autoencoder (PyTorch)
Classification of guitar images by machine learning Part 1
Low-rank approximation of images by singular value decomposition
Low-rank approximation of images by HOSVD and HOOI
Anonymous upload of images using Imgur API (using Python)
Classification of guitar images by machine learning Part 2
Reconstruction of Soviet propaganda poster by PyTorch x VAEGAN
Optical Flow, the dynamics of images captured by OpenCV
Try projective transformation of images using OpenCV with Python
I compared the identity of the images by Hu moment
Collection and automation of erotic images using deep learning