I am dealing with ** video frame interpolation by deep learning ** at the university, and I will output the implementation I am trying in the process. I will continue to post the continuation of the implementation of frame interpolation for videos, so please follow LGTM & follow me if you like.

What I did this time is to build a network that uses actual video frames to generate 1 intermediate frame from 6 frames before and after.

Implementation environment

Google Colab https://colab.research.google.com/notebooks/welcome.ipynb?hl=ja

Implementation overview

** Deep learning that generates an intermediate frame from the front and rear frames (3 front and 3 rear). ** The network is DnCNN [1]. I have this network at hand, so I am using it. （[1] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang, “Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising”, https://arxiv.org/abs/1608.03981）

The network of DnCNN is as follows. Originally, it was intended to remove noise. For input, the frame size is 160 * 90 and the number of channels is 18 channels (6 frames * RBG). For output, the frame size is similar and the number of channels is 3.

I messed with the parameters of the blue middle layer. It has 15 layers, a kernel size of 3 * 3, and 72 channels.

data set

I used MOT17 which took a picture of the city. https://motchallenge.net/ The number of sets is train 1320, test 1285.

result

The first image is the front 2 frames, the generated intermediate frame, and the rear 2 frames from the top. Actually, there is another input for each frame, but I omitted it because the image becomes small.

The following image is a comparison with the correct middle frame.

The result is that it cannot be said that it can be interpolated because it is pulled back and forth and the color is changing.

Performance evaluation

Click here for a graph of generalization performance. Since the values are close to each other, there seems to be no problem here. Click here for numerical data such as loss values.

The numerical value is not so bad. It is close to the loss value and average PSNR when one image was previously cropped and pseudo frame interpolation was performed. However, this seems to be because the front and back images are almost the same. The mid-top rate is low, so you need to raise it here.

Consideration

The reason why it could not be interpolated ・ Small number of data sets ・ Input 6 frames are not working well ・ Network (DnCNN) problems I think there are three points.

There are about 1300 sets of training and test data. The number of original images is large, but it is difficult to earn the amount because one set consumes 7 frames. I'm in the process of creating my own dataset, so I'd like to keep an eye on the number of datasets.

What about 6 input frames? .. Every paper looks at the interpolation in 2 frames before and after, so I'm worried if it will continue to work in 6 frames. I started thinking that I should go back to 2 frames for comparison.

Plan from now on

-Increase the number of sets with your own data set.

Data set by high-speed camera. The purpose of the research is to improve the accuracy using this data set. -Verify how many input frames to use. -Verification on another network.

Finally

Thank you for reading until the end. Please do not hesitate to point out any improvements. I will continue to post this system, so please follow LGTM & follow me if you like!

Video frame interpolation by deep learning Part1 [Python]