This article is a continuation of "[CGAN (conditional GAN) Generates MNIST (KMNIST)" (https://qiita.com/kyamada101/items/5195b1b32c60168b1b2f). It is a record when trying to do ACGAN based on cGAN.
I think it's a natural idea in terms of evolution from cGAN, but when I implement it, it's quite ...
Since I briefly introduced cGAN in the previous article, I will briefly explain ACGAN. ACGAN is, in a nutshell, ** "cGAN where Discriminator also performs classification tasks" **. It is a method that enables the output of images with more variations.
The original paper is [here](Conditional Image Synthesis With Auxiliary Classifier GANs)
A. Odena, C. Olah, J. Shlens. Conditional Image Synthesis With Auxiliary Classifier GANs. CVPR, 2016
As for the ACGAN paper, some people have published the original paper, so that will be helpful.
 Reference article 
Explanation of papers on AC-GAN (Conditional Image Synthesis with Auxiliary Classifier GANs)
In cGAN, the genuine / fake image and label information were input to Discriminator, and the identification of genuine or fake was output. On the other hand, in ACGAN, the input of Discriminator is only an image, and not only the identification of genuine or fake but also the class judgment to guess which class it is is added to the output. It looks like the following when written in a diagram.
 The
The class part in the figure is the output of the classification predicted by Discriminator. Like label, it is in the form of a vector in class several dimensions.
ACGAN has Implement PyTorch on GitHub. With this as a reference, let's modify the implementation of cGAN that I wrote in the previous article.
What to do
Is almost everything. Then, the structure of Discriminator looks like this.
 This is a drawing of the structure diagram of the cGAN Discriminator posted in the previous article, but the part shown in red is the change in ACGAN.
This is a drawing of the structure diagram of the cGAN Discriminator posted in the previous article, but the part shown in red is the change in ACGAN.
An implementation of Discriminator.
python
class Discriminator(nn.Module):
    def __init__(self, num_class):
        super(Discriminator, self).__init__()
        self.num_class = num_class
        
        self.conv = nn.Sequential(
            nn.Conv2d(1, 64, kernel_size=4, stride=2, padding=1), #Input is 1 channel(Because it's black and white),Number of filters 64,Filter size 4*4
            nn.LeakyReLU(0.2),
            nn.Conv2d(64, 128, kernel_size=4, stride=2, padding=1),
            nn.LeakyReLU(0.2),
            nn.BatchNorm2d(128),
        )
        
        self.fc = nn.Sequential(
            nn.Linear(128 * 7 * 7, 1024),
            nn.BatchNorm1d(1024),
            nn.LeakyReLU(0.2),
        )
        
        self.fc_TF = nn.Sequential(
            nn.Linear(1024, 1),
            nn.Sigmoid(),
        )
        
        self.fc_class = nn.Sequential(
            nn.Linear(1024, num_class),
            nn.LogSoftmax(dim=1),
        )
        
        self.init_weights()
        
    def init_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Conv2d):
                module.weight.data.normal_(0, 0.02)
                module.bias.data.zero_()
            elif isinstance(module, nn.Linear):
                module.weight.data.normal_(0, 0.02)
                module.bias.data.zero_()
            elif isinstance(module, nn.BatchNorm1d):
                module.weight.data.normal_(1.0, 0.02)
                module.bias.data.zero_()
            elif isinstance(module, nn.BatchNorm2d):
                module.weight.data.normal_(1.0, 0.02)
                module.bias.data.zero_()
        
    def forward(self, img):
        x = self.conv(img)
        x = x.view(-1, 128 * 7 * 7)
        x = self.fc(x)
        x_TF = self.fc_TF(x)
        x_class = self.fc_class(x)
        return x_TF, x_class
There seem to be various ways to add the output of classification. In the PyTorch implementation of the link I posted earlier, the Linear layer was bifurcated at the end, so I am implementing it in the same way here.
According to this change, the function per epoch looks like this.
python
def train_func(D_model, G_model, batch_size, z_dim, num_class, TF_criterion, class_criterion,
               D_optimizer, G_optimizer, data_loader, device):
    #Training mode
    D_model.train()
    G_model.train()
    #The real label is 1
    y_real = torch.ones((batch_size, 1)).to(device)
    D_y_real = (torch.rand((batch_size, 1))/2 + 0.7).to(device) #Noise label to put in D
    #Fake label is 0
    y_fake = torch.zeros((batch_size, 1)).to(device)
    D_y_fake = (torch.rand((batch_size, 1)) * 0.3).to(device) #Noise label to put in D
    
    #Initialization of loss
    D_running_TF_loss = 0
    G_running_TF_loss = 0
    D_running_class_loss = 0
    D_running_real_class_loss = 0
    D_running_fake_class_loss = 0
    G_running_class_loss = 0
    
    #Calculation for each batch
    for batch_idx, (data, labels) in enumerate(data_loader):
        #Ignore if less than batch size
        if data.size()[0] != batch_size:
            break
        
        #Noise creation
        z = torch.normal(mean = 0.5, std = 1, size = (batch_size, z_dim)) #Average 0.Generate random numbers according to a normal distribution of 5
        
        real_img, label, z = data.to(device), labels.to(device), z.to(device)
        
        #Discriminator update
        D_optimizer.zero_grad()
        
        #Put a real image in Discriminator and propagate forward ⇒ Loss calculation
        D_real_TF, D_real_class = D_model(real_img)
        D_real_TF_loss = TF_criterion(D_real_TF, D_y_real)
        CEE_label = torch.max(label, 1)[1].to(device)
        D_real_class_loss = class_criterion(D_real_class, CEE_label)
        
        #Put the image created by putting noise in Generator in Discriminator and propagate forward ⇒ Loss calculation
        fake_img = G_model(z, label)
        D_fake_TF, D_fake_class = D_model(fake_img.detach()) #fake_Stop Loss calculated in images so that it does not propagate back to Generator
        D_fake_TF_loss = TF_criterion(D_fake_TF, D_y_fake)
        D_fake_class_loss = class_criterion(D_fake_class, CEE_label)
        #Minimize the sum of two Loss
        D_TF_loss = D_real_TF_loss + D_fake_TF_loss
        D_class_loss = D_real_class_loss + D_fake_class_loss
        
        D_TF_loss.backward(retain_graph=True)
        D_class_loss.backward()
        D_optimizer.step()
        
        D_running_TF_loss += D_TF_loss.item()
        D_running_class_loss += D_class_loss.item()
        D_running_real_class_loss += D_real_class_loss.item()
        D_running_fake_class_loss += D_fake_class_loss.item()
        #Generator update
        G_optimizer.zero_grad()
        
        #The image created by putting noise in the Generator is put in the Discriminator and propagated forward ⇒ The detected part becomes Loss
        fake_img_2 = G_model(z, label)
        D_fake_TF_2, D_fake_class_2 = D_model(fake_img_2)
        
        #G loss(max(log D)Optimized with)
        G_TF_loss = -TF_criterion(D_fake_TF_2, y_fake)
        G_class_loss = class_criterion(D_fake_class_2, CEE_label) #From G's point of view, it would be nice if he thought that D was real and gave him a class.
        
        G_TF_loss.backward(retain_graph=True)
        G_class_loss.backward()
        G_optimizer.step()
        G_running_TF_loss += G_TF_loss.item()
        G_running_class_loss -= G_class_loss.item()
        
    D_running_TF_loss /= len(data_loader)
    D_running_class_loss /= len(data_loader)
    D_running_real_class_loss /= len(data_loader)
    D_running_fake_class_loss /= len(data_loader)
    G_running_TF_loss /= len(data_loader)
    G_running_class_loss /= len(data_loader)
    
    return D_running_TF_loss, G_running_TF_loss, D_running_class_loss, G_running_class_loss, D_running_real_class_loss, D_running_fake_class_loss
In addition to the changes mentioned earlier, I also changed the noise to be added a little. Last time, it was a normal distribution with 30 dimensions, mean 0.5, and standard deviation 0.2, but this time it is a normal distribution with 100 dimensions, mean 0.5, and standard deviation 1.
The classification loss is torch.nn.NLLLoss (). This also matched the implementation of the link earlier.
First is the loss graph.
In ACGAN, there are two types, genuine or fake identification loss and classification loss, and both loss is propagated to both Generator and Discriminator. It is also plotted separately in the graph.

T / F_loss is the genuine / fake identification loss (solid line), and class_loss is the class classification loss (dotted line).
Looking at this, it looks like it's working. However...
 This is a gif when one image of each label is generated for each epoch.
I entered the label information so that the top line is "Ah, I, U ..." from the left, and the lower right is "..., N, ゝ", but I gave it. There is almost no correspondence between the label and the generated image. But it looks like it's producing "characters on another label" rather than a completely meaningless image.
This is a gif when one image of each label is generated for each epoch.
I entered the label information so that the top line is "Ah, I, U ..." from the left, and the lower right is "..., N, ゝ", but I gave it. There is almost no correspondence between the label and the generated image. But it looks like it's producing "characters on another label" rather than a completely meaningless image.
Similar to cGAN, I tried to generate 5 "A" to "ゝ" by Generator after 100epoch training.
 Isn't it only "ke" that seems to correspond with Label?
(Rather, the mode collapses completely ...)
Isn't it only "ke" that seems to correspond with Label?
(Rather, the mode collapses completely ...)
By the way, this is the result of cGAN generation after 100 epoch training under the same conditions.
 Obviously, cGAN outputs characters that are closer to the label.
Obviously, cGAN outputs characters that are closer to the label.
At first glance at the output, at ACGAN, both Generator and Discriminator ** think that characters with different shapes are the characters on the label ** (Ex: Discriminator and Generator are both "I" Isn't it (the one that resembles the shape is treated as the "A" label)? I thought.
 This is a graph that divides the loss of the Discriminator classification into the loss derived from the real image and the loss derived from the fake image (= image created by the Generator).
This is a graph that divides the loss of the Discriminator classification into the loss derived from the real image and the loss derived from the fake image (= image created by the Generator). sum_class_loss is the total value (= same as the red dotted line in the previous graph).
Looking at this graph, Discriminator makes a mistake in judging the real image (especially in the early stages of learning) and guesses the judgment of the fake image.
(In numerical terms, real_class_loss is about 20 times the value at the beginning of fake_class_loss and about 5 times at the end)
In other words, ** the image created by the Generator with the label "A" is treated as "A" by Discriminator even if the actual shape is quite different from "A" **. I can imagine that.
Perhaps ideally, the loss of the classification should be about the same for both real and fake images.
As mentioned in the original paper of ACGAN, it seems that ** if there are too many classes, the quality of the output image will deteriorate on the same network **. In the original paper, ImageNet (1000 classes) is divided into 10 classes x 100 cases for experimentation.
Therefore, I decided to try this in 5 classes once.
Let's make the network structure the same and try to generate 5 characters from "A" to "O".
 The loss graph looks similar. The
The loss graph looks similar. The T / F_loss is likely to still have room to go down.
 There is some unevenness here as well, but the latter half is quite beautiful.
Next, let's generate 5 images each after 100 epoch training.
There is some unevenness here as well, but the latter half is quite beautiful.
Next, let's generate 5 images each after 100 epoch training.
 It seems that the mode does not collapse.
It seems that the mode does not collapse.
Then, it is the loss of the classification of Discriminator.
 On a numerical basis, there was a difference of about 10 times in the early stage, but it is almost the same value in the final stage, but it is difficult to see this graph, so I will display it only after 3 epoch.
On a numerical basis, there was a difference of about 10 times in the early stage, but it is almost the same value in the final stage, but it is difficult to see this graph, so I will display it only after 3 epoch.
 If you look at this, you can see that
If you look at this, you can see that real_class_loss and fake_class_loss become fairly close values.
In the first place, is there a 10 to 20 times difference between the genuine classification and the fake classification from the 1st epoch in the early stages of learning? ?? I thought, so I tried to display the loss for each iter (for each mini-batch).
 It is true that the loss value does not change between
It is true that the loss value does not change between real_class_loss and fake_class_loss at first, but you can see that fake_class_loss drops sharply.
I tried to train only real images in the first few epochs, but it didn't make much sense, so I decided to pre-learn only the classification task.
Get only the Discriminator and solve only the classification task.
 
 Convergence is pretty fast, so I'm only doing 20 epoch.
As a result, it is subtle, but for the time being, I will use this Discriminator after 20 epoch training.
Convergence is pretty fast, so I'm only doing 20 epoch.
As a result, it is subtle, but for the time being, I will use this Discriminator after 20 epoch training.
 True / False loss is almost the same as without pre-learning. Classification loss
As for, it has become quite small from the beginning.
True / False loss is almost the same as without pre-learning. Classification loss
As for, it has become quite small from the beginning.
Now, let's look at the classification loss derived from the real image and the fake image.
 I tried to learn up to 300 epoch. Compared to when not pre-learning, the loss value derived from the real image is also considerably lower. It is about four times as much as the loss derived from the fake image, but it is still not the same value.
I tried to learn up to 300 epoch. Compared to when not pre-learning, the loss value derived from the real image is also considerably lower. It is about four times as much as the loss derived from the fake image, but it is still not the same value.
Let's take a look at the image generated by ACGAN after this 300 epoch training.
 Hmm. ..
No effect is seen. There is no increase in the number of successful characters, and mode collapse is occurring.
Hmm. ..
No effect is seen. There is no increase in the number of successful characters, and mode collapse is occurring.
There are several data sets that have a lot of data per character, which is 6000 and only about 300 to 400. I think that the larger the number of data per class, the better, so I thought that it might work if the number of data is larger than CIFAR-10, but it was not good.
Personally, isn't the distance between the characters on each label in the latent space close (= characters with different labels are quite close in the latent space)? I think. In the experiment of the original paper, I experimented with CIFAR-10 and ImageNet every 10 classes, but in the case of junk characters, there were only a little over half of the characters that worked in 10 classes, and it worked only in 5 classes.
In any case, it seems to be quite difficult to aim and output the 49 class with ACGAN, so I will give up ...
Recommended Posts