Introduction

Do you know this icon?

Yes, it's the icon of the famous Melville . It is known that there are many people who have Melville draw their favorite characters and use them as thumbnails on twitter, and have gained great support. The icon drawn by this person is often called "Mel Icon" because of its unique style. Examples of typical mel icons

(Respectively of Yukatayu and Shun (As of February 19, 2020))

I also want an icon like this! !! !! !! !! !! So I made a mel icon generator by machine learning. In this article, I would like to briefly introduce the method used for it.

What is GAN

GAN (Generative adversarial networks) is used for generation.

Figure quote Original

This method combines a neural network (Generator) that generates an image and a neural network (Discriminator) that identifies whether the input data is a mel icon or not. The Generator tries to generate an image that resembles a mel icon as much as possible in order to deceive the Discriminator, and the Discriminator learns to identify the image more accurately. As the two neural networks train each other, the Generator will be able to generate images that are close to the Mel Icon.

Data set collection

In order for the Generator to be able to generate images that look like Mel icons, and for the Discriminator to be able to identify whether the input image is a Mel icon, bring as many real Mel icons as possible to the teacher. You need to create a dataset that will be the data and use it for training. So I went around twitter, found the thumbnail of Mel Icon, and saved it repeatedly, and got more than 100 sheets. Use this for learning.

Creating a Generator

Let the Generator look at the Mel icon prepared earlier and learn to generate an image that looks like it. The image to be generated is 64 x 64 pixels, and the color is rgb 3 channels. If the Generator generates similar data every time, learning will not proceed well, so it is necessary to be able to generate as many types of images as possible. Therefore, input a sequence of random numbers to the Generator for image generation. For this sequence, a process called "transpose convolution", which will be described later, is applied to each convolution layer to gradually bring it closer to an image with 3 channels of 64 x 64 pixels and rgb.

What is transpose convolution?

For normal convolution, as shown below, the sum product is taken and output while shifting the kernel. In pytorch, it can be implemented with torch.nn.Conv2d, for example.

Source

On the other hand, in the transposed convolution used this time, the product with the kernel is calculated for each element, and the sum of the results obtained is taken. As an image, it feels like expanding the target element. In pytorch, it can be implemented with torch.nn.ConvTranspose2d, for example.

Source

This transposed convolution layer and the self_attention layer (described later) are overlapped, and the number of output channels is 3 in the last layer. (Corresponds to rgb respectively) From the above contents, the outline of the Generator you are trying to make is as shown in the figure below.

This Generator has a total of 5 transposed convolution layers, with a layer called self_attention between the 3rd and 4th layers and between the 4th and 5th layers. By looking at pixels with similar values at once, it is possible to evaluate the entire image with a relatively small amount of calculation.

The Generator configured in this way outputs, for example, such an image if it is in an unlearned state. (The result depends on the sequence of random numbers you enter.) Since it has not been learned yet, only something like noise can be output. However, by training each other with a neural network (Discriminator) that identifies whether the input data is a mel icon or not, which will be explained next, it will be possible to output such an image.

Creating a Discriminator

Ask Discriminator to look at the image generated by the Generator above to see if it is a mel icon. The point is to make an image recognizer. The input image is 64 x 64 pixels, the color is rgb 3 channels, and the output is a value (range 0 to 1) that indicates how much it looks like a mel icon. The composition is to stack 5 ordinary convolution layers and sandwich a self_attention layer between the 3rd and 4th layers and between the 4th and 5th layers. The figure is as follows.

Learning method / error function

The learning methods for Discriminator and Generator are as described below.

Discriminator learning

When an image is input, Discriminator returns a number 0 to 1 that indicates how much it looks like a mel icon. First, enter the real Mel icon, and set the output (value from 0 to 1) at that time to $ d_ {real} $. Next, enter a random number into the Generator and have it generate an image. Entering this image into Discriminator will return a value between 0 and 1 as well. Let's call this $ d_ {fake} $. Input the $ d_ {real} $ and $ d_ {fake} $ that come out in this way into the loss function described below to obtain the value used for error propagation.

Loss function

One of GAN's methods, SAGAN's "hinge version of the adversarial loss," uses the loss function described below. Simply put, this function labels $ l_ {i} $ and $ l_ {i} ^ {\ prime} $ with the correct labels, and $ y_ {i} $ and $ y_ {i} ^ {\ prime} $ from the Discriminator. When the output value, $ M $, is the number of data per mini-batch

-\frac{1}{M}\sum_{i=1}^{M}(l_{i}min(0,-1+y_{i})+(1-l_{i}^{\prime})min(0,-1-y_{i}^{\prime}))

Is expressed as. [^ 1] This time $ y_ {i} = d_ {real} $, $ y_ {i} ^ {\ prime} = d_ {fake} $, $ l_ {i} = 1 $ (indicates that it is a 100% mel icon) , $ L_ {i} ^ {\ prime} = 0 $ (indicating that it is not an absolute mel icon)

-\frac{1}{M}\sum_{i=1}^{M}(min(0,-1+d_{real})+min(0,-1-d_{fake}))

will do. This is the loss function of the Discriminator used this time. Adam was used as the error propagation optimization method, and the learning rate was set to 0.0004, and Adam's primary moment and secondary moment (exponential attenuation factor used for moment estimation) were set to 0.0 and 0.9, respectively.

Learning Generator

When a sequence of random numbers is input, Generator will generate an image while trying to make it look like a mel icon as much as possible. First, input the sequence $ z_ {i} $ made of random numbers into the Generator to get an image. Input it to Discriminator and output a value that shows how much it looks like a mel icon. Let's call this $ r_ {i} $.

Loss function

In SAGAN's "hinge version of the adversarial loss", the generator's loss function is defined as follows:

-\frac{1}{M}\sum_{i=1}^{M}r_{i}

In SAGAN, it seems that it is empirically known that this definition works well. [^ 1] Considering that $ M $ is the number of data per mini-batch, the judgment result of Discriminator is used as it is. I was a little surprised at this, but how about it? Adam was used as the error propagation optimization method, and the learning rate was set to 0.0001, and Adam's primary and secondary moments were set to 0.0 and 0.9, respectively. (Same as Discriminator except learning rate)

Overall picture

The image introduced above is reprinted, but the Generator and Discriminator created earlier are combined in this way to form a GAN.

Generate

Learn using the collected real mel icons and let the Generator generate mel icons. Keep the number of data $ M $ per mini-batch at 5. The result is as follows. __awesome! !! !! !! !! !! !! !! !! !! !! __ __ Impressed! !! !! !! !! !! !! !! !! !! !! __ For comparison, an example of the input data is displayed on the upper side, and the actually generated image is displayed on the lower side. Also, the generated result will change each time it is executed. Personally, I was quite surprised to be able to do this with source code that is not that long. GAN is really great! !! !! !! !! !! !! !!

Task

I made something that can do such a great thing, but there are still some points that have not been solved yet.

Mode collapse Someone pointed out on twitter, but if you look at the generation results, you can see that all of the 5 images that should have been generated using random numbers have become images of similar characters. This phenomenon is called mode collapse. This time, I was learning with 5 mini-batch and 3000 epochs, so I thought it was due to overfitting, but even with the number of epochs reduced to about 200, the same image is still generated as shown below. Will be done. (This is the output when self_attention has not been applied yet, but with about 200 epochs, something like this will appear) Certainly, a slightly different thing is generated every time it is executed compared to when the number of epochs is 3000 to some extent, but after all it looks like the same image at first glance and the quality is definitely because the number of epochs is small in the first place. You can see that it has become low. Considering that there are about 60,000 handwritten digit images in the MNIST example, it is possible to increase the data set by nearly 59,900, but that amount is virtually impossible because Melville is too bad. Mode collapse is difficult.

Source code

The code I wrote is in this repository. https://github.com/zassou65535/image_generator

Summary

GAN is an insanely great technique. Even though the mode collapsed, I was able to make something quite close to the Mel icon with only nearly 100 datasets. Let's generate a pounding image with GAN as well.

bonus

If you simply average all the Mel icons you have collected, you will see the following image.

References

[^ 1]: Learn while making-Development by PyTorch Deep learning

The story of making a mel icon generator

Introduction

What is GAN

Data set collection

Creating a Generator

What is transpose convolution?

Creating a Discriminator

Learning method / error function

Discriminator learning

Loss function

Learning Generator

Loss function

Overall picture

Generate

Task

Source code

Summary

bonus

References