Introduction

The icon drawn by Melville, called "Mel Icon", is gaining popularity from many because of its unique style. Above is the icon of Melville . In particular, it is known that there are many people who ask this person to create an icon and use it as a twitter icon. Examples of typical mel icons

(From left to right, Yukatayu , Shun Shun , kaage (as of August 5, 2020))

I also want a mel icon like this! !! !! !! !! !! That's why I implemented a mel icon generator by machine learning! !! !! !! !! ....... is a rough outline of previous work . This time, we made a big review of the algorithm to improve many points and greatly evolved the Mel Icon generator. In this article, I will introduce the method used for it.

What is GAN

GAN (Generative adversarial networks) is used to generate images.

Figure quote Original

This method combines a neural network (Generator) that generates an image and a neural network (Discriminator) that identifies whether the input data is a mel icon or not. The Generator tries to generate an image that resembles a mel icon as much as possible in order to deceive the Discriminator. Discriminator learns to identify images more accurately so as not to be fooled. As the two neural networks train each other, the Generator will be able to generate images that are close to the mel icon. In short, it's Generator VS Discriminator.

Progressive GAN There are various types of methods, even if it is called GAN. This time, I am using one of them, ** Progressive GAN **. This is done, for example, by first learning the number of times it is a convolution layer corresponding to a low resolution of 4x4, then adding a convolution layer corresponding to 8x8 and learning, and then adding 16x16 .. It is a way to proceed with learning while gradually increasing the resolution, such as ...

At the beginning of learning, the Generator is ready to output 4x4 resolution images as shown. Discriminator also takes an image with 4x4 resolution as input and outputs a value that indicates how much it looks like a mel icon.

The Generator generates an image, and the Discriminator is input with two types: a generated image and a real image (a learning data mel icon).

After learning at 4x4 resolution to some extent, we will add a convolution layer corresponding to 8x8 and continue learning.

When 8x8 is finished, add 16x16 and so on, and so on, and finally the structure will be like this. The goal this time was to output a 256x256 image.

GAN has a weakness that learning tends to be unstable when trying to learn images with relatively high resolution. However, Progressive GAN can overcome this by first looking at the general characteristics of the image and then gradually focusing on small and complex parts.

Data set preparation

In order for the Generator to be able to generate an image that looks like a Mel icon, or for the Discriminator to be able to identify whether the input image is a Mel icon, bring as many real Mel icons as possible. It is necessary to create a dataset that will be teacher data and use it for training. This time, Melville provided all the Mel icons I have created so far. That number is 751. (Overwhelming ..... Thanks .... !!!!!!) From here, we will search for Mel icons that can be used for learning. This time, I excluded Mel Icon, which is too irregular, from learning. In particular

It's like this. In addition, there were some icons that were almost the same but slightly different in hair length. Considering the impact of these on the overall learning, we added up to 4 similar mel icons to the dataset and excluded them if there were 5 or more.

The number of data sets that could be used in this way is about 640. Considering that the last time was at most 100 sheets, the amount that can be used has increased more than 6 times. These are used as training data.

Creating a Generator

The role of the Generator is to take a sequence of random numbers (which we will call noise) as input and generate a mel icon-like image based on it. When you enter the generated mel icon into Discriminator, you will learn to deceive it as a real mel icon. As a basic operation, the Generator generates an image by repeatedly convolving the input noise.

In the initial state, the neural network that makes up the Generator is as shown in the figure below.

It is an image of data input from the top layer, processed, passed to the bottom layer in sequence, and data is obtained from the bottom layer.

The top convolution layer receives the noise input to the Generator (the noise magnitude is 512 channels and 4x4 resolution), processes the convolution, and outputs data with 256 channels and 4x4 resolution. To do. The data is passed to the next convolution layer, and so on, and the last layer outputs an image with 3 channels and a resolution of 4x4. The number of output channels 3 in the last layer corresponds to each of (R, G, B), and 4x4 is the resolution of the image output by the Generator.

While learning these layers, we will "introduce" layers corresponding to the next resolution "8x8" little by little. ("Introduction little by little" will be described later.) We aim for the following states by "introducing little by little".

Here, a layer called Upsample is sandwiched between the 4x4 layer and the 8x8 layer. When data with a resolution of 4x4 is input, this layer converts it to a resolution of 8x8 and outputs it. This is achieved by nicely complementing the intermediate values of each pixel. This allows you to bridge the data between the 4x4 and 8x8 layers.

"Introduced little by little"

It is known that if you suddenly start introducing a new layer, it will have a negative effect on learning. Therefore, Progressive GAN will "introduce" layers little by little.

For example, when adding an 8x8 layer after a 4x4 layer, the output from the 4x4 layer multiplied by (1-α) and the output from the 8x8 layer You get the product of α. Next, add these two to make an output image. The value of α is set to 0 at the beginning, and it approaches 1 as the number of learnings increases.

When α is 0, the Generator neural network is the same as below.

When α is 1, the Generator neural network is the same as below.

By gradually approaching the state where α is 0 to the state of 1, it becomes possible to learn by gradually mixing high resolution layers instead of starting high resolution learning suddenly.

We will use this for the transition from 8x8 to 16x16, the transition from 16x16 to 32x32, and so on. Ultimately, we will study with the aim of creating a network that can generate 3-channel mel icons with a resolution of 256 x 256 and (R, G, B) as shown below.

Creating a Discriminator

The role of the Discriminator is to take image data as input and identify if it is a real mel icon. You will learn to improve the accuracy so that you will not be fooled by the Generator.

In the initial state, the neural network that composes Discriminator is as shown in the figure below. (The red part in the figure, MiniBatchStd, will be described later.)

The top convolution layer receives the image (corresponding to the number of channels 3 (corresponding to (R, G, B)), resolution 4 × 4) input to the Discriminator, processes the convolution, and processes the number of channels 256, resolution 4 Output x4 data and pass it to the next layer. The next layer processes the data, passes it to the next layer, and so on, and the last layer outputs data with 1 channel and 1x1 resolution. This output 1x1x1 data, in short, one value, is a value that indicates how much the input image looks like a mel icon.

Similar to Generator, while learning these layers, we will "introduce" layers corresponding to the next resolution "8x8" little by little, aiming for the following states.

In Generator, in order to bridge the data between the layers corresponding to each resolution, a process called Upsample was applied to increase the resolution and then pass the data to the next layer. Discriminator inserts a process called Downsample, which works in the exact opposite way. This makes it possible, for example, to convert 8x8 resolution data to 4x4 and bridge the data from the 8x8 layer to the 4x4 layer. (In pytorch, a function called AdaptiveAvgPool2d is useful for doing this.)

As with the Generator, the value of α is gradually increased from 0 to 1 in this way, and new layers are gradually mixed.

Ultimately, we will take the following 3-channel mel icon with a resolution of 256 x 256 and (R, G, B) as input, and learn aiming for a network that can judge whether it is genuine or fake.

VS mode collapse

The "Mini Batch Std", which is contained only in the 4x4 layer, prevents a phenomenon called "mode collapse".

What is mode collapse?

I want Generator to generate as many types of mel icons as possible. However, even though GAN inputs various random numbers, it may end up in a state where only images that can hardly tell the difference are generated. This phenomenon is called mode collapse.

This is the result of the previous work, but I will explain it using this because it is the best example.

The upper row shows 5 types of data used for training, and the lower row shows 5 types of images output by GAN. You can see that the output results are almost the same even though we have entered different random numbers 5 times.

This phenomenon is due to the Generator "tasting". Suppose a generated image successfully tricks Discriminator. If you generate another image that is almost the same as that image, it is likely that you will be able to deceive Discriminator again. As you repeat this, you will only be able to generate almost the same image.

Mini batch standard deviation

Progressive GAN has a feature that prevents the Generator from doing this. That is the layer called "Mini Batch Std". This finds a statistic called the mini-batch standard deviation and prevents mode collapse.

Discriminator receives several images at once when identifying images, and takes the standard deviation for each pixel of the image. For example, if you receive 8 images, you will need to identify whether the 8 images are output from the Generator or a real Mel icon, but for each pixel of the image for these 8 images. Takes the standard deviation to.

Furthermore, if the standard deviation is taken for each pixel, the average is taken for all channels and pixels.

As a result, the same data as that of the original image with 1 channel and the same resolution will finally come out from the MiniBatch Std layer. Pass this to the next 4x4 layer as a set with the original image.

This value is a quantity that indicates how diverse the input images that have been input multiple times. (It's an image like dispersion.) If this seems too small, Discriminator can determine that the Generator has started cheating and can detect that the input image is the generated image. If the Generator generates only similar images, it will be detected by Discriminator as a generated image. Therefore, it is forced to generate various kinds of images.

The Mini Batch Std layer, which can do this, is paired with the 4x4 layer near the end of the Discriminator to eliminate the possibility of mode collapse.

Learning method / error function

Generator and Discriminator use ** WGAN-GP ** as the loss function. The definition is as follows.

Generator loss function

-E[d_{fake}]

Discriminator loss function

E[d_{fake}] - E[d_{real}] + \lambda E_{\substack{\hat{x}\in P_{\hat{x}}}}[(||\nabla_{\hat{x}}D(\hat{x})||_{2}-1)^{2}]

I will explain these in order.

Input noise $ z $ to Generator and get as many images as there are mini-batch. (Hereafter, the number of mini-batch is $ M $. This time, $ M = 8 $.) Input it to Discriminator and output $ M $ for each image to show how much it looks like a mel icon. Let me do it. Let's call this $ d_ {fake} $. Also, input the real Mel icon into the Discriminator for $ M $, and let the output of $ M $ at that time be $ d_ {real} $.

WGAN-GP uses these $ d_ {real} $ and $ d_ {fake} $ to calculate the loss.

Learning Generator

The Generator tries to generate an image that looks like a mel icon as much as possible in order to deceive the Discriminator when a sequence of random numbers is input.

Loss function

In WGAN-GP, the generator loss function is defined as follows:

-E[d_{fake}]

The point is that the $ M $ images generated by the Generator are judged by the Discriminator, the average of the output is taken, and a minus is added. In WGAN-GP, it seems that it is empirically known that this definition works well. Adam was used as the error propagation optimization method, and the learning rate was set to 0.0005, and Adam's primary and secondary moments were set to 0.0 and 0.99, respectively.

In addition, only when learning the 256 x 256 layer, if the learning is repeated a certain number of times, the learning rate is reduced to 0.0001. (Mind, I feel that this will make the generation of the mel icon relatively successful .... (maybe because of my mind.) Maybe there is another better way.)

Discriminator learning

After error propagation of Generator, next is error propagation of Discriminator.

Loss function

In WGAN-GP, the Discriminator loss function is defined as follows.

E[d_{fake}] - E[d_{real}] + \lambda E_{\substack{\hat{x}\in P_{\hat{x}}}}[(||\nabla_{\hat{x}}D(\hat{x})||_{2}-1)^{2}]

1 item
The generated image $ M $ is judged by Discriminator and the average of the results is taken.
2 items
Discriminator judges $ M $ real images and averages the results
3 items
Gradient constraint term, called gradient penalty. I will explain this.

gradient penalty The definition of gradient penalty is as follows.

\lambda E_{\substack{\hat{x}\in P_{\hat{x}}}}[(||\nabla_{\hat{x}}D(\hat{x})||_{2}-1)^{2}]

However, the distribution of the generated image and the distribution of the real image are set to $ P_ {fake} $ and $ P_ {real} $, respectively.

\epsilon\in U[0,1],x_{fake}\in P_{fake},x_{real}\in P_{real}

\hat{x}=(1-\epsilon)x_{fake}+\epsilon x_{real}

I have decided.

Let me explain the image about this. (It's just an image. It's also quite about.)

There are a lot of images $ \ hat {x} $ that are a mixture of generated images and real images at random ratios. Consider the space created by the output when this is put into the Discriminator. In the optimized Discriminator, it is known that the gradient is 1 at almost every point in this space. Perhaps it is convenient to be around 1 so that the gradient does not disappear or diverge during error propagation. Therefore, even with the Mel Icon Generator Discriminator, we will proceed with learning so that this value becomes 1. The term for that is gradient penalty, that is

\lambda E_{\substack{\hat{x}\in P_{\hat{x}}}}[(||\nabla_{\hat{x}}D(\hat{x})||_{2}-1)^{2}]

is.

Also, this time, the constant $ \ lambda $ is set to 10.0. (Since the reference material was decided to be 10.0, I followed it.)

The above is the loss function of Discriminator in WGAN-GP. Here, take the square of $ d_ {real} $ and add $ E [{d_ {real}} ^ 2] $ which is the average of them. .. This section reduces the negative impact of extreme tilt on learning.

E[d_{fake}] - E[d_{real}] + \lambda E_{\substack{\hat{x}\in P_{\hat{x}}}}[(||\nabla_{\hat{x}}D(\hat{x})||_{2}-1)^{2}] + \beta E[{d_{real}}^2]

The constant $ \ beta $ is 0.001. (This is also because the reference material was decided to be 0.001.)

The above is the loss function of Discriminator used this time. Adam was used as the error propagation optimization method, and the learning rate was set to 0.0005, and Adam's primary moment and secondary moment (exponential attenuation factor used for moment estimation) were set to 0.0 and 0.99, respectively. Furthermore, only when learning the 256 x 256 layer, if the learning is repeated a certain number of times, the learning rate is reduced to 0.0001. (Except for the loss function, it is exactly the same as Generator.)

Overall picture

The image introduced above is reprinted, but the Generator and Discriminator created earlier are combined to form a Progressive GAN. With the following state as the final goal, we will learn from low resolution for each layer.

Generate

This time, we set to move to the next resolution every 8 mini-batch and 7500 learnings. Learn using the actual mel icon you received, and let the Generator generate the mel icon.

** Okay! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! ** **

I have succeeded in generating a different image every time. The resolution has also been improved. Progressive GAN Seriously great! !! !! !! !! !! !! !!

The output during learning is as follows.

Resolution 4x4

Resolution 8x8

Resolution 16x16

Resolution 32 x 32

Resolution 64 x 64

Resolution 128 x 128

Resolution 256 x 256

You can see that learning is progressing for each resolution.

Digression: Data augmentation

In machine learning, a technique called "data augmentation" is often used as one of the methods to increase the types of images in a dataset. For each learning, you can inflate the dataset by randomly changing the contrast and hue of the image, flipping left and right, changing the angle, and distorting the entire image.

However, there are problems with doing this with the Mel Icon Generator. First of all, the special feature of the Mel Icon is that the head is drawn so that it grows from the lower left.

(Icon: Minagi (as of August 5, 2020))

For this reason, there is a high possibility that image distortion, rotation, left-right inversion, etc. will be learned unintentionally, so it is better to stop it. Also, I don't use hue conversion because an icon with an eerie color appears. However, since there were few negative effects on contrast conversion, we also learned using data augmentation.

The left is the original image, and the right is the converted image with the contrast doubled. In this way, I thought that it would be possible to increase the number of data sets overwhelmingly compared to the previous time, so I greedily doubled the number of learnings and then executed and output the learning. The result is below.

I don't think it's a lot better than when I didn't use it, but this method seems to be good.

Summary

The Mel Icon Generator not only overcomes mode collapse with Progressive GAN, but also succeeds in increasing the resolution. Progressive GAN seems to be a technique that can even generate full HD high resolution images, depending on the method and dataset. (I think 256x256 is enough if you use it as a twitter icon.) Even in the real world, application examples in the medical field seem to be active, and it seems that this method will continue to attract more attention in the future.

Let's generate a pounding image with Progressive GAN.

Source code

The code I wrote is in this repository. https://github.com/zassou65535/image_generator_2

bonus

The following images were obtained when the average (torch.mean) was taken for each pixel for the approximately 640 images used in the data set this time.

I tried it with various statistics in the same way. The following are the standard deviation (torch.std), median (torch.median), and mode (torch.mode) in order from the left.

I also tried the minimum value (torch.min) and the maximum value (torch.max), but only almost black images and white images came out, respectively.

By the way, when I calculated the standard deviation (torch.std) for 5 randomly extracted images, I got this. It might be a little fashionable.

In addition, if the minimum value (torch.min) is calculated for all data sets of nearly 640 sheets, only images that are close to black will be output, but if the number is suppressed to about 7, a mail like this will be output. The icon will pop out. The following is the minimum value for 7 randomly selected images.