GAN: DCGAN Part2-Training DCGAN model

Target

This is a continuation of DCGAN using the Microsoft Cognitive Toolkit (CNTK).

In Part2, DCGAN training by CNTK will be performed using the image data prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.

Introduction

In GAN: DCGAN Part1 --Scraping Web images, I prepared the face image of my favorite artist from Bing Image Search.

In Part 2, we will create and train a face generation model using the Deep Convolutional Generative Adversarial Network (DCGAN).

Deep Convolutional Generative Adversarial Network The DCGAN [1] implemented this time consists of two neural networks as shown in the figure below. Where $ x $ is the real image dataset and $ z $ is the latent variable.

dcgan.png

The network structure of Generator and Discriminator was set as follows.

Generator Generator uses transpose convolution with kernel size 5 and stride 2.

Immediately after the transposed convolution layer, execute Batch Normalization [2] and then apply the activation function ReLU. Therefore, the bias term is not adopted in the transposed convolution layer.

The transpose convolution of the final layer employs a bias term and applies the activation function tanh without using Batch Normalization.

Layer Filters Size/Stride Input Output
ConvolutionTranspose2D 1024 4x4/2 100 4x4x1024
ConvolutionTranspose2D 512 5x5/2 4x4x1024 8x8x512
ConvolutionTranspose2D 256 5x5/2 8x8x512 16x16x256
ConvolutionTranspose2D 128 5x5/2 16x16x256 32x32x128
ConvolutionTranspose2D 64 5x5/2 32x32x128 64x64x64
ConvolutionTranspose2D 32 5x5/2 64x64x64 128x128x32
ConvolutionTranspose2D 3 5x5/2 128x128x32 256x256x3

The prior distribution of the latent variable $ z $ is the standard normal distribution $ \ mathcal {N} (0.0, 1.0) $.

Discriminator Discriminator uses a kernel size 3 and stride 2 convolution.

Execute Batch Normalization immediately after the convolution layer, and then apply the activation function Leaky ReLU [3]. Therefore, the bias term is not adopted in the convolution layer. The Leaky ReLU parameter was set to 0.2. However, Batch Normalization is not used for the first layer convolution.

The convolution of the final layer employs a bias term and applies the activation function sigmoid without using Batch Normalization.

Layer Filters Size/Stride Input Output
Convolution2D 32 3x3/2 256x256x3 128x128x32
Convolution2D 64 3x3/2 128x128x32 64x64x64
Convolution2D 128 3x3/2 64x64x64 32x32x128
Convolution2D 256 3x3/2 32x32x128 16x16x256
Convolution2D 512 3x3/2 16x16x256 8x8x512
Convolution2D 1024 3x3/2 8x8x512 4x4x1024
Convolution2D 1 4x4/1 4x4x1024 1

Settings in training

The initial values of the transposed convolution / convolution layer parameters were set to the normal distribution [1] with a variance of 0.02.

The loss function implemented this time is shown in the following equation. [4]

\max_{D} \mathbb{E}_{x \sim p_r(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z))]\\
\max_{G} \mathbb{E}_{z \sim p_z(z)} [\log D(G(z))]

Here, $ D $ and $ G $ represent Discriminator and Generator, respectively, $ x $ is the input image, $ z $ is the latent variable, $ p_r $ is the distribution of real image data, and $ p_z $ is the fake image. It represents the prior distribution that produces the data.

Adam [5] is used as the optimization algorithm for both Generator and Discriminator. The learning rate was set to 1e-4, Adam's hyperparameters $ β_1 $ were set to 0.5, and $ β_2 $ was set to the default value of CNTK.

Model training performed 50,000 Iterations with mini-batch training of mini-batch size 16.

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz ・ GPU NVIDIA GeForce GTX 1060 6GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Opencv-contrib-python 4.1.1.26 ・ Numpy 1.17.3 ・ Pandas 0.25.0

Program to run

The training program is available on GitHub.

dcgan_training.py


Commentary

Here is a list of some of the GAN training techniques introduced in How to Train a GAN? that were particularly helpful.

Input Normalization Normalize the input image to [-1, 1]. Therefore, make the output of the Generator tanh.

Latent Distribution Make the prior distribution of latent variables a normal distribution with a spherical distribution instead of a uniform distribution with a rectangular distribution.

dcgan_training.py


z_data = np.ascontiguousarray(np.random.normal(size=(minibatch_size, z_dim)), dtype="float32")

Network Architecture Since we dealt with images this time, we adopted DCGAN, which uses a convolution layer.

Set the initial value of the convolution layer weights to a normal distribution with a variance of 0.02.

For downsampling, set the average pooling or convolution layer stride to 2. For upsampling, set the Pixel Shuffle [7] or the stride of the transposed convolution layer to 2.

Batch Normalization did not work unless it was applied before the activation function for both Discriminator and Generator. Batch Normalization stabilizes training, but does not mix real and fake data.

In the implemented program, only fake data is received as input while sharing the Discriminator parameters in the following places.

dcgan_training.py


D_fake = D_real.clone(method="share", substitutions={x_real.output: G_fake.output})

ReLU has a problem called Deadly Neuron, which uses Leaky ReLU instead of ReLU to avoid the resulting gradient sparsification. However, ELU [6] may not work.

Loss Function The loss function of the original GAN [4] is as follows.

\max_{D} \mathbb{E}_{x \sim p_r(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z))]\\
\min_{G} \mathbb{E}_{z \sim p_z(z)} [\log(1 - D(G(z)))]

However, in the above equation, the loss function of the Discriminator is cross-entropy, so if the learning of the Discriminator goes too far, the gradient of the Generator disappears, so the loss function of the Generator is changed from the minimization problem to the maximization problem. To do.

\max_{G} \mathbb{E}_{z \sim p_z(z)} [\log D(G(z))]

That said, if the training goes well, there seems to be no difference in the results with either loss function.

Optimizer Adam is the best choice for the optimization algorithm. However, set the momentum to a small value of about 0.5 and the learning rate to a small value of about 1e-4.

I didn't use Cyclical Learning Rate [8] this time because of the instability of training.

Training It is difficult to statistically control the training of Discriminator and Generator, so avoid it as much as possible.

If you really want to control it, train Discriminator more often.

result

Discriminator and Generator loss functions

The figure below is a visualization of each loss function during training. The horizontal axis represents the number of repetitions, and the vertical axis represents the value of the loss function.

dcgan_logging.png

Generated images and transitions during training

The figure below shows the face image generated by the trained Generator. I get the impression that there are many similar images. This suggests that mode collapse is occurring and is a bad trend.

dcgan_image.png

The figure below shows the transition of image generation during training with animation. I'm trying my best to generate a face, but it's hard to say that it's working.

dcgan.gif

Quantitative evaluation by Inception Score

Quantitative evaluation of GAN is a difficult problem, but Inception Score [9] has been proposed as one of the evaluation indexes.

Inception Score for trained Generator

-Image Quality: Is it possible to generate some specific image? -Image Diversity: Is there a variation in the generated image (is there a mode collapse?)

To measure. Inception-v3 [10] was used as the base model for measuring the Inception Score. The result is as follows.

Inception Score 2.61

reference

CNTK 206: Part B - Deep Convolutional GAN with MNIST data How to Implement the Inception Score (IS) for Evaluating GANs

GAN : DCGAN Part1 - Scraping Web images

  1. Alec Radford, Luke Metz, and Soumith Chintal. "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", arXiv preprint arXiv:1511.06434 (2015).
  2. Ioffe Sergey and Christian Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", arXiv preprint arXiv:1502.03167 (2015).
  3. Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. "Rectifier Nonlinearities Improve Neural Network Acoustic Models", Proc. icml. Vol. 30. No. 1. 2013.
  4. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mira, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative Adversarial Nets", Advances in neural information processing systems. 2014, pp 2672-2680.
  5. Diederik P. Kingma and Jimmy Lei Ba. "Adam: A method for stochastic optimization", arXiv preprint arXiv:1412.6980 (2014).
  6. Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. "Fast and accurate deep network learning by exponential linear units (ELUs)." arXiv preprint arXiv:1511.07289 (2015).
  7. Wenzhe Shi, Jose Cabellero, Ferenc Huszar, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network", The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp 1874-1883.
  8. Leslie N. Smith. "Cyclical Learning Rates for Training Neural Networks", 2017 IEEE Winter Conference on Applications of Computer Vision. 2017, pp 464-472.
  9. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, "Improved Techniques for Training GANs", Neural Information Processing Systems. 2016. pp 2234-2242.
  10. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. "Rethinking the Inception Architecture for Computer Vision", The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp 2818-2826.

Recommended Posts

GAN: DCGAN Part2-Training DCGAN model
GAN: DCGAN Part1 --Scraping Web images
DCGAN