The world of deep learning, especially Generative Adversarial Networks (GAN), has grown dramatically in recent years, and I think that research is progressing in various fields such as text-to-image, voice quality conversion, and sound source separation.
In this talk, I will write loosely about wav2pix, which generates face images from voice.
Paper: WAV2PIX: SPEECH-CONDITIONED FACE GENERATION USING GENERATIVEADVERSARIAL NETWORKS
https://imatge-upc.github.io/wav2pix/
The proposed model consists of the following three modules.
I will briefly explain each module.
First, regarding Speech Encoder, it seems that you are using the decoder of Speech Enhancement Generative Adversarial Network (SEGAN). SEGAN is an end-to-end model of speech enhancement using GAN. I will omit the detailed explanation, but please refer to the Demo here.
Next, it seems that the Generator Network and Discriminator Network are inspired by Least Squares Generative Adversarial Networks (LSGAN). LSGAN deepened my understanding at this site.
Quick Start From now on, we will explain the sample execution of wav2pix.
The environment I tried this time is as follows. OS: Ubuntu 18.04 LTS CPU: i3-4130 3.40GHz Memory: 16GB GPU: GeForce GTX 1660 Ti (6GB)
Docker Version: Docker version 19.03.8
imatge-upc / wav2pix describes how to execute it, but I made my own Dockerfile for those who have trouble preparing the execution environment. So, this time I will mainly write the operation with Docker.
First, let's get the Dockerfile I made. ★ Please note the following points!
--The image size to create is about 5.5GB --It takes a reasonable amount of time to create an image
host
$ git clone https://github.com/Nahuel-Mk2/docker-wav2pix.git
$ cd docker-wav2pix/
$ docker build . -t docker-wav2pix
When you're done, make sure you have an image.
host
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
docker-wav2pix latest 8265bc421f7a 4 hours ago 5.36GB
Let's start Docker.
host
$ docker run -it --rm --gpus all --ipc=host docker-wav2pix
2.2. Train Add some effort before doing the train. Overwrite the required path in the config file and save it. ★ If you do not do this, train and test will throw an error at runtime, so be careful!
container
$ echo -e "# training pickle files path:\n"\
"train_faces_path: /home/user/wav2pix/pickle/faces/train_pickle.pkl\n"\
"train_audios_path: /home/user/wav2pix/pickle/audios/train_pickle.pkl\n"\
"# inference pickle files path:\n"\
"inference_faces_path: /home/user/wav2pix/pickle/faces/test_pickle.pkl\n"\
"inference_audios_path: /home/user/wav2pix/pickle/audios/test_pickle.pkl" > /home/user/wav2pix/config.yaml
If you can do the above, let's run Train.
container
$ cd wav2pix
$ python runtime.py
★ It took about 3 hours in my environment to finish the train. If you wait patiently or specify the runtime epoch as follows, it will end early. ★ You can safely ignore the Visdom error.
container
$ python runtime.py --epochs 100
--epochs: Specifying the number of epochs (default 200)
2.3. Test When the train is finished, run Test. Since the operation to call the learned model is required, it is necessary to write an additional argument than when executing Train.
container
$ python runtime.py --pre_trained_disc /home/user/wav2pix/checkpoints/disc_200.pth --pre_trained_gen /home/user/wav2pix/checkpoints/gen_200.pth --inference
--pre_trained_disc: Trained Discriminator path --pre_trained_gen: Trained Generator path --inference: Inference execution flag
When you're done, check the generated image.
host
$ docker cp 89c8d43b0765:/home/user/wav2pix/results/ .
★ If you don't know the CONTAINER ID, run "docker ps" to check it.
host
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
89c8d43b0765 docker-wav2pix "/bin/bash" 4 hours ago Up 4 hours vigilant_murdock
←: Generated image Real image: → ←: Generated image Real image: →
For the two people in this sample data set, I think that the generated images that show that they are relatively faces have been created. I also found that I understand my personality to some extent. However, it was pointed out in the paper that the image was rough.
From here, I would like to generate an anime face image using this wav2pix. That said, there is no dataset that includes audio and animated face images, so you need to create your own. Therefore, we will create a Virtual YouTuber (VTuber) dataset by referring to the YouTuber dataset created in the paper.
The figure below shows how to create the dataset explained in the paper. It is a flow to process YouTuber video separately for video and Speech, and finally create it as a pair of data. The main change is only the face detection cascade file. The cascade file used is here.
The videos used to create the data with the VTubers targeted this time are as follows. For audio, the data is the section without BGM or SE. (Titles omitted)
--Kizuna AI
[Broadcast accidents will be issued as they are! ] 1 million people thank you commemorative LIVE delivery! !! [Live broadcast] We all talked about anime!
――Neko To become a virtual Youtuber [Live008]
--Suisei Hoshimachi [Official] "Suisei Hoshimachi's MUSIC SPACE" # 01 first half (broadcast on April 5, 2020) [Official] "Suisei Hoshimachi's MUSIC SPACE" # 01 second half (broadcast on April 5, 2020) [Official] "Suisei Hoshimachi's MUSIC SPACE" # 04 first half (broadcast on April 26, 2020)
First, I will show the animated face image generated for each epoch.
epoch10
epoch50
epoch100
epoch200
From each epoch number, a similar facial image was generated in epoch10, but as the number increases, it can be seen that the individuality is clearly reflected in the generated image. Now, let's compare the following two types to see how close the generated image is to the real thing.
◆ Comparison between the generated image (epoch200) and the real thing
part1 part2
←: Generated image Real image: → ←: Generated image Real image: →
From part1, it was confirmed that images that could firmly learn individuality were generated from each VTuber voice. However, I thought it should be noted that part2 may generate images that are different from the real ones.
This time, I explained wav2pix, which generates a face image from voice, and ran a sample. I also tried to generate an animated face image by changing the data set. As for the animated face image, I was able to generate something more tangible than I expected, so it may be good to try to increase the resolution in the future. Also, I can't help but wonder if it would be possible to generate illustrations from audio in the future if there were various facial images.
SPEECH-CONDITIONED FACE GENERATION USING GENERATIVE ADVERSARIAL NETWORKS I did machine learning to switch the voices of Kizuna AI and Nekomasu Unable to write to file </torch_18692_1954506624> Docker run reference
Recommended Posts