Introduction

The world of deep learning, especially Generative Adversarial Networks (GAN), has grown dramatically in recent years, and I think that research is progressing in various fields such as text-to-image, voice quality conversion, and sound source separation.

In this talk, I will write loosely about wav2pix, which generates face images from voice.

Paper: WAV2PIX: SPEECH-CONDITIONED FACE GENERATION USING GENERATIVEADVERSARIAL NETWORKS

Rough overview

無題45.png https://imatge-upc.github.io/wav2pix/

The proposed model consists of the following three modules.

Speech Encoder
Generator Network
Discriminator Network

I will briefly explain each module.

First, regarding Speech Encoder, it seems that you are using the decoder of Speech Enhancement Generative Adversarial Network (SEGAN). SEGAN is an end-to-end model of speech enhancement using GAN. I will omit the detailed explanation, but please refer to the Demo here.

Next, it seems that the Generator Network and Discriminator Network are inspired by Least Squares Generative Adversarial Networks (LSGAN). LSGAN deepened my understanding at this site.

Quick Start From now on, we will explain the sample execution of wav2pix.

Execution environment

The environment I tried this time is as follows. OS: Ubuntu 18.04 LTS CPU: i3-4130 3.40GHz Memory: 16GB GPU: GeForce GTX 1660 Ti (6GB)

Docker Version: Docker version 19.03.8

1. Get & build Dockerfile

imatge-upc / wav2pix describes how to execute it, but I made my own Dockerfile for those who have trouble preparing the execution environment. So, this time I will mainly write the operation with Docker.

First, let's get the Dockerfile I made. ★ Please note the following points!

--The image size to create is about 5.5GB --It takes a reasonable amount of time to create an image

`host`


$ git clone https://github.com/Nahuel-Mk2/docker-wav2pix.git
$ cd docker-wav2pix/
$ docker build . -t docker-wav2pix

When you're done, make sure you have an image.

`host`


$ docker images
REPOSITORY          TAG                               IMAGE ID            CREATED             SIZE
docker-wav2pix      latest                            8265bc421f7a        4 hours ago         5.36GB

2. Start Docker / Train / Test

2.1. Start Docker

Let's start Docker.

`host`


$ docker run -it --rm --gpus all --ipc=host docker-wav2pix

2.2. Train Add some effort before doing the train. Overwrite the required path in the config file and save it. ★ If you do not do this, train and test will throw an error at runtime, so be careful!

`container`


$ echo -e "# training pickle files path:\n"\
"train_faces_path: /home/user/wav2pix/pickle/faces/train_pickle.pkl\n"\
"train_audios_path: /home/user/wav2pix/pickle/audios/train_pickle.pkl\n"\
"# inference pickle files path:\n"\
"inference_faces_path: /home/user/wav2pix/pickle/faces/test_pickle.pkl\n"\
"inference_audios_path: /home/user/wav2pix/pickle/audios/test_pickle.pkl" > /home/user/wav2pix/config.yaml

If you can do the above, let's run Train.

`container`


$ cd wav2pix
$ python runtime.py

★ It took about 3 hours in my environment to finish the train. If you wait patiently or specify the runtime epoch as follows, it will end early. ★ You can safely ignore the Visdom error.

`container`


$ python runtime.py --epochs 100

--epochs: Specifying the number of epochs (default 200)

2.3. Test When the train is finished, run Test. Since the operation to call the learned model is required, it is necessary to write an additional argument than when executing Train.

`container`


$ python runtime.py --pre_trained_disc /home/user/wav2pix/checkpoints/disc_200.pth --pre_trained_gen /home/user/wav2pix/checkpoints/gen_200.pth --inference

--pre_trained_disc: Trained Discriminator path --pre_trained_gen: Trained Generator path --inference: Inference execution flag

When you're done, check the generated image.

`host`


$ docker cp 89c8d43b0765:/home/user/wav2pix/results/ .

★ If you don't know the CONTAINER ID, run "docker ps" to check it.

`host`


$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
89c8d43b0765        docker-wav2pix      "/bin/bash"         4 hours ago         Up 4 hours                              vigilant_murdock

3. Image generated from audio (partial excerpt)

←: Generated image Real image: → ←: Generated image Real image: → 　　　　

For the two people in this sample data set, I think that the generated images that show that they are relatively faces have been created. I also found that I understand my personality to some extent. However, it was pointed out in the paper that the image was rough.

bonus

Ex.1 Creating the data set required to generate an animated face image

From here, I would like to generate an anime face image using this wav2pix. That said, there is no dataset that includes audio and animated face images, so you need to create your own. Therefore, we will create a Virtual YouTuber (VTuber) dataset by referring to the YouTuber dataset created in the paper.

The figure below shows how to create the dataset explained in the paper. It is a flow to process YouTuber video separately for video and Speech, and finally create it as a pair of data. The main change is only the face detection cascade file. The cascade file used is here.

無題50.png

The videos used to create the data with the VTubers targeted this time are as follows. For audio, the data is the section without BGM or SE. (Titles omitted)

--Kizuna AI

[Broadcast accidents will be issued as they are! ] 1 million people thank you commemorative LIVE delivery! !! [Live broadcast] We all talked about anime!

――Neko To become a virtual Youtuber [Live008]

--Suisei Hoshimachi [Official] "Suisei Hoshimachi's MUSIC SPACE" # 01 first half (broadcast on April 5, 2020) [Official] "Suisei Hoshimachi's MUSIC SPACE" # 01 second half (broadcast on April 5, 2020) [Official] "Suisei Hoshimachi's MUSIC SPACE" # 04 first half (broadcast on April 26, 2020)

Ex.2 Image generated from voice (animated face)

First, I will show the animated face image generated for each epoch.

epoch10
epoch50
epoch100
epoch200

From each epoch number, a similar facial image was generated in epoch10, but as the number increases, it can be seen that the individuality is clearly reflected in the generated image. Now, let's compare the following two types to see how close the generated image is to the real thing.

◆ Comparison between the generated image (epoch200) and the real thing

part1　　　　　　　　　　　　　　　　　　part2

←: Generated image Real image: → ←: Generated image Real image: → 　　　　　　　　　　　　

From part1, it was confirmed that images that could firmly learn individuality were generated from each VTuber voice. However, I thought it should be noted that part2 may generate images that are different from the real ones.

Summary

This time, I explained wav2pix, which generates a face image from voice, and ran a sample. I also tried to generate an animated face image by changing the data set. As for the animated face image, I was able to generate something more tangible than I expected, so it may be good to try to increase the resolution in the future. Also, I can't help but wonder if it would be possible to generate illustrations from audio in the future if there were various facial images.

Reference site

SPEECH-CONDITIONED FACE GENERATION USING GENERATIVE ADVERSARIAL NETWORKS I did machine learning to switch the voices of Kizuna AI and Nekomasu Unable to write to file </torch_18692_1954506624> Docker run reference

[DOCKER] Try running wav2pix to generate a face image from voice (animation face generation is also available)

Introduction

Rough overview

Execution environment

1. Get & build Dockerfile

host

host

2. Start Docker / Train / Test

2.1. Start Docker

host

container

container

container

container

host

host

3. Image generated from audio (partial excerpt)

bonus

Ex.1 Creating the data set required to generate an animated face image

Ex.2 Image generated from voice (animated face)

Summary

Reference site

`host`

`host`

`host`

`container`

`container`

`container`

`container`

`host`

`host`