We recently released a python module kawaii-gen that generates cute idol face images. In this article, I will explain the stylegan model used in this module and the data pipeline implemented to create a dataset for stylegan.
stylegan is a GAN that can generate 1024x1024 pixel resolution and photo-like images, and has a high linear separability and complementarity in the hidden variable space. This time, I used StyleGAN and tried it with 256x256 and super-resolution 512x512.
Regarding the data pipeline, stylegan needs a large number of images because the FFHQ dataset is 70,000 images, so we used the idea of active learning to reduce the number of manual annotations, and a little over 2000 images. I made it possible to automatically extract only the female face with the annotation of.
StyleGAN is a GAN with the following features.
Complementary performance is a property in which similar objects are placed close to each other in the hidden variable space. In the case of facial images, they are lined up in the order of the degree of smile, or glasses. The face image on which you are wearing is like being close in space.
A better disentangle is the ability to create a space where the axes are nearly independent. The relationship between such a space and a space close to linear separation, but sensuously, it is easier to create classification logic etc. using independent features, so a space with independent axes (= features are independent) Seems to have better linear separation.
In StyleGAN, Z is not a space that is easily linearly separable (Fig.2), but the network that converts Z to W in the architecture (Fig.1) converts Z to a space that is easily linearly separable.
Fig.1
Fig.2
I used the code from stylegan's Official Repository almost as it is. There are scripts that convert various datasets to TFRecords format and learning scripts that allow you to easily change options. I write all the options in Python's Dictionaryu,
It was easier to change than I expected, rather than adding arguments with argparse etc.
See the official README for details, but just use the dataset_tool.py create_from_images
command to create a tfrecord format dataset that the training script train.py
accepts and execute it by playing with the arguments of train.py
. ..
The data flow and processing components are as shown in the figure above. At the beginning of the process, a general-purpose, specific site crawler collects facial images. After that, the face is detected from the image, the male face is removed, various occlusions are removed, and then Alignment processing is performed to align the positions of eyes and flowers, and alignment is removed. The aligned image is saved in the training dataset.
This time I used dlib's cnn algorithm. This face detection algorithm uses cnn to determine in a sliding window that the inside of the window is a face. Dlib also has hog features and an algorithm that uses svm for discrimination, and hog that can only take the front image may be the most suitable when starting face image generation.
Another method is to use a model such as SSD or retina net, and it may be possible to speed up by batch execution on GPU with EndToEnd. dlib's cnn is also implementation-optimized, which is pretty fast.
Normalize the training data to reduce the training load of StyleGAN. By aligning the position and direction of the nose, you can reduce variations.
I used the StyleGAN alignment almost as it is. Recreate the rectangle and correct the direction based on the positions of the landmarks on the eyes and nose of the face image. Spaces created when rotating or moving are padded in white, but those with too large a padding area are removed. This is because some of the hair and face may be removed too much.
For landmark detection, I used Landmark detection module that can support occlusion.
Originally, I wanted to make an idol face classifier, so I annotated about 2000 face images. We extracted facial features with dlib, clustered them in the facial features space, labeled similar ones at once, and then labeled those with low confidence scores.
There were some pictures with microphones, hands, letters, food, etc. on my face, so I made a function to detect them. We also created a function to filter other than the face facing the front. Since it takes a long time to process hundreds of thousands of images, this function is turned off because there is no prospect of increasing the number of GPUs. However, there are some images where mysterious objects can be seen faintly on the face, and I think that occlusion support is essential to omit them.
Recommended Posts