Proposal of multi-speaker voice conversion and voice morphing by RelGAN-VM

Imakita Sangyo

There is an image conversion model called RelGAN! Apply it to voice conversion! You can do voice morphing!

Introduction

The number of men who want to be beautiful girls is increasing. On the contrary, there are many women who want to be beautiful boys. In recent years, this tendency has become more prominent, especially with the popularity of virtual YouTuber. Regarding the appearance, MMD of CG technology and Live2D that can move illustrations have appeared, and it is entering an era where you can gradually transform into what you want to be, but you want to be the voice of a cute girl or the voice of a cool boy. Is also one of the big issues. In this paper, we propose voice quality conversion between multiple speakers using a model called RelGAN-VM and voice quality morphing that creates an intermediate voice between two speakers.

In reading this article

In this article, relatively difficult words fly around. I intend to make a minimum supplement, but it requires some knowledge. Specifically, it is written on the premise of the following knowledge.

If you search, you will find many articles that are easier to understand than the authors of this article explain, so if you find a term or word that you do not understand, please do a search (round throw).

Prior, related research or articles

Previous research

CycleGAN-VC converts the voice quality between two speakers, and CycleGAN-VC2 further improves the performance. StarGAN-VC converts voice quality between multiple speakers, and StarGAN-VC2 further improves performance and proposes voice quality morphing.

__ Related articles (including external links) __

It uses a CycleGAN based transducer to convert the voice quality of Virtual YouTuber. Although it is spectrogram-based, it shows amazing conversion performance by a careful method.

The voice of the author of the article is exchanged for the voice quality of "VOICEROID Yuzuki Yukari" using a method based on pix2pix. This is also high performance.

Proposed method

In this article, RelGAN: Multi-Domain Image-to-Image Translation via Relative Attributes (hereinafter referred to as RelGAN) and RelGAN-VM based on CycleGAN-VC2 are used. I will propose. For a detailed explanation of RelGAN and CycleGAN-VC2, please refer to some of the authors who have written very clear articles (the authors of these articles will be referred to as Lento).

Parallel conversion and non-parallel conversion

Parallel conversion

You need the same pronunciation content, scale information, and utterance timing dataset. These are not the same, so you need to align them. Since the conversion source and conversion destination require the same data except for the voice quality, it takes a huge amount of effort to build the data set, but the amount of data is relatively small. "I tried to make a voice related to Yuzuki with the power of deep learning" seems to adopt parallel conversion.

Non-parallel conversion

Learn with datasets with different utterance content, scale information, and utterance timing. It's relatively easy to build a dataset because it doesn't require alignment and you just need to read the text aloud. Non-parallel conversion is adopted as a method based on CycleGAN and StarGAN. In addition, this implementation and the original RelGAN also use non-parallel conversion.

Network structure

RelGAN-VM Generator and Discriminator are based on CycleGAN-VC2, Generator is Input with Relative attributes concatenate, Discriminator removes Convolution of the final layer, $ D_ {real} $, $ D_ {interp} $, $ It branches into three parts, D_ {match} $. See Implementation for details.

Loss function

The loss function of the base RelGAN is as follows. Orthogonal regularization was not used in this implementation.

{\min_{D}}L_{D}=-L_{adv}+{\lambda_1}L_{match}^D+{\lambda_2}L_{interp}^D
{\min_{G}}L_{G}=L_{adv}+{\lambda_1}L_{match}^G+{\lambda_2}L_{interp}^G+{\lambda_3}L_{cycle}+{\lambda_4}L_{self}

When training with this loss function, Mode collapse occurred at about 30000 steps, so this implementation adds some restrictions in addition to these. Triangle consistency loss Choose 3 out of N domains and call them A, B, C. The difference between the input and output when the domain is converted from A to B to C to A is the loss. When converting from domain A to B, we will write the input image as $ x $, Relative attributes as $ v_ {ab} $, and Generator as $ G (x, v_ {ab}) . The formula is as follows. $ L_{tri}=||x - G(G(G(x, v_{ab}), v_{bc}), v_{ca})||1 $$ Backward consistency loss Cycle consistency loss-like loss is taken even for the output converted by the interpolation rate $ {\ alpha} . The formula is as follows. $ L{back}=||x - G(G(x, {\alpha}v_{ab}), -{\alpha}v_{ab})||_1 $$

Mode seeking loss This method was proposed in Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis. Since some people have written Japanese commentary, please refer to [Latest paper / cGANs] Regularization term that may solve mode collapse. Please give me. For $ I_a = G (z_a) $, $ I_b = G (z_b) $ converted from latent variables $ z_a $ and $ z_b $ $ \frac{d_I(I_a, I_b)}{d_z(z_a, z_b)} $ Seems to be the problem of maximizing ($ d_I (I_a, I_b) $ is the distance between the generated images and $ d_z (z_a, z_b) $ is the distance between the latent variables). Based on this, this implementation has added a loss to minimize the following equation. $ L_{ms}=\frac{{||x_1-x_2||_ 1}}{||G(x_1,v_{ab})-G(x_2,v_{ab})||_1} $

As a result of adding these losses, the Loss on the Generator side is as follows. $ {\min_{G}}L_{G}=L_{adv}+{\lambda_1}L_{match}^G+{\lambda_2}L_{interp}^G+{\lambda_3}L_{cycle}+{\lambda_4}L_{self}+{\lambda_5}L_{tri}+{\lambda_6}L_{back}+{\lambda_7}L_{ms} $

Experiment

Implementation is uploaded to Github. It is modified based on GAN-Voice-Conversion, which is the implementation of CycleGAN-VC2 by Mr. njellinas. When rewriting to RelGAN-like, I referred to the official paper and Lento's implementation of RelGAN.

data set

I borrowed JVS (Japanese versatile speech) corpus as a dataset. I used parallel100 of jvs010, jvs016, jvs042, jvs054 as training data. I also used about 5 files from each speaker's nonpara30 for validation. The parallel100 stores a lot of utterance content, scale information, and utterance timing (not perfect), but this time it is treated as non-parallel data. Write the impression of each speaker subjectively.

Preprocessing

World Python wrapper, pyWorld for feature quantity extraction Use blob / master / README.md). However, some effort is added before extracting the features.

Silence removal

If there is a lot of silence in the training data, it may be learned as a feature of itself. Also, since the program sometimes ended without error when extracting features, we decided to remove silence (division by zero or divergence to infinity may have occurred somewhere, investigation of the cause required ). There are various methods for removing silence, but this time I simply used librosa. It is also possible to take $ f_o $ once and extract the $ f_o> 0 $ part.

Align audio data length

To create a batch, first select one audio data at random and randomly cut out the feature data of the specified length from it. Here, if the length of each audio data is different, each sample will not be selected with equal probability. To prevent this, we combine all the voices once and then separate them into voice data of equal length.

Extraction of features

Features are extracted by pyWorld for the separated voice data. I honestly don't understand what I'm doing (big problem), but I asked for $ f_o $ and MCEPs $ sp $, and from there $ {\ mu_ {f_o}} $, $ {\ sigma_ {f_o}} Is it a place where $, $ {\ mu_ {sp}} $, $ {\ sigma_ {sp}} $ is calculated and $ sp $ is normalized and saved?

Learning

At this point, you can finally start deep learning. The model and network to use are as described above. The normalized MCEPs are 36 dimensions, and you can input 128 frames randomly selected. It also inserts complete silence into the batch with a certain probability (0.0005%). The Batch size is 8. Therefore, the input size is (8, 36, 128) ((1, 36, 128) when inferring). GAN used LSGAN. I went the royal road because it is also used in CycleGAN-VC2 and StarGAN-VC2. Adam is used as the optimizer, and learning starts at 0.0002 for Generator and 0.0001 for Discriminator for learning late, and each step is attenuated by 0.9999 times. It is $ {\ beta_1} $ = 0.5, $ {\ beta_2} $ = 0.999. The loss function $ {\ lambda} $ is $ {\ lambda_1} = 1, {\ lambda_2} = 10, {\ lambda_3} = 10, {\ lambda_4} = 10, {\ lambda_5} = 5, {\ lambda_6 } = 5, {\ lambda_7} = 1 $. Also, $ {\ lambda_5} and {\ lambda_6} $ are attenuated 0.9 times every 10,000 steps. Under this condition, 100,000 steps were trained (about 62 hours with RTX 2070). As a result, the conversion did not work well after 80,000 steps, so in this paper, we will evaluate using the model trained at 80,000 steps. The figure below is a graph of Loss by TensorBoard. You can see that the Adversarial loss on the Discriminator side soars (the Generator side soars) around 80,000 steps. losses.png

Voice generation and evaluation

Put the generated audio on Github. I made it available on YouTube.

Audio reconstruction

The trained neural network is a model that converts only normalized MCEPs, and using this and other statistics, it is converted to speech again by pyWorld. The conversion flow when converting from domain A to domain B at an interpolation rate of $ {\ alpha} $ is described below.

  1. Load wav.
  2. Find $ f_ {o_A} $, $ sp_A $, $ ap_A $ (aperiodicity index) from the read wav by pyWorld.
  3. Transform $ f_ {o_A} $ with the following formula. Use linear interpolation to calculate $ \ mu_ {f_ {o_ {\ alpha}}} $ and $ {\ sigma_ {f_ {o_ {\ alpha}}}} $ (standard deviation should be returned to variance once) Or verification required). $ \mu_{f_{o_{\alpha}}}=(1-{\alpha})\mu_{f_{o_A}}+{\alpha}\mu_{f_{o_B}} $ $ {\sigma_{f_{o_{\alpha}}}}=(1-{\alpha})\sigma_{f_{o_A}}+{\alpha}\sigma_{f_{o_B}} $ $ f_{o_\alpha}=\frac{f_{o_A}-\mu_{f_{o_A}}}{\sigma_{f_{o_A}}}{\sigma_{f_{o_{\alpha}}}}+\mu_{f_{o_{\alpha}}} $
  4. Normalize $ sp_A . $ sp_{A_{norm}}=\frac{sp_A-\mu_{sp_A}}{\sigma_{sp_{A}}} $$
  5. Let the neural net infer $ sp_ {A_ {norm}} . $ sp_{{\alpha}_{norm}}=G({sp}_{A_{norm}}, {\alpha}v_{ab}) $$
  6. Denormalize $ sp_ {{\ alpha} _ {norm}} $. As usual, linear interpolation is used for mean and standard deviation.
\mu_{sp_{{\alpha}}}=(1-{\alpha})\mu_{sp_{A}}+{\alpha}\mu_{sp_{B}}
{\sigma_{sp_{{\alpha}}}}=(1-{\alpha})\sigma_{sp_{A}}+{\alpha}\sigma_{sp_{B}}
{sp_{\alpha}}={\mu_{sp_{{\alpha}}}}+{\sigma_{sp_{{\alpha}}}}{sp_{{\alpha}_{norm}}}
  1. Resynthesize the voice by World from the calculated $ f_ {o_ \ alpha} $, $ {sp_ {\ alpha}} $, $ ap_A $. Note that the aperiodicity indicator $ ap_A $ uses the original one.
  2. Normalize and export the volume as needed.

That's it.

Subjective evaluation

I don't know how to make a quantitative evaluation, so I will evaluate it at my own discretion (problem). I think the conversion to male speakers (jvs042, jvs054) has worked to some extent from any speaker. On the other hand, the conversion to female speakers (jvs010, jvs016) is particularly prominent in the conversion to jvs010 because the conversion accuracy between the same sex and the opposite sex is low. Also, morphing is quite subtle, isn't it just that $ f_o $ has changed? Some of the results are as follows. You can see that it is morphing between the opposite sex, but I feel that it is less natural.

Summary

In this paper, we proposed RelGAN-VM and conducted experiments on voice quality conversion and voice quality morphing. We are proud that the conversion to male speakers is comparable to the existing method, but the accuracy of conversion from male speakers to female speakers, especially to high-pitched voice, is not very high. did. Voice morphing was proposed in StarGAN-VC2 as a previous research, and it was a little insufficient to write in a dissertation, but I thought it would be a waste to throw it away, so I decided to submit it to Qiita. It's difficult to be a beautiful girl.

Future works In this article, only the voice quality between speakers has been converted. For example, Voice Actor Statistics Corpus publishes 9 types of datasets read by 3 speakers with 3 types of emotions. If you learn this, you may be able to morph not only the voice quality of the speaker but also the emotions. Also, the network structure, loss function, and hyperparameters of this implementation are not perfect yet, and I think there is room for improvement. I would like to continue to consider higher-performance voice conversion models.

Precautions for voice conversion

It's not a good topic, but Personal voice also has rights. For example, if you use voice quality conversion to impersonate another person's voice for commercial use or misuse without permission, you will be severely punished as privacy infringement and copyright infringement. I will. Of course, you are free to use this implementation to become the voice of others, but when using audio data that has not been copyrighted, please use it only for personal use. Please note that it will be gray if it can be viewed by an unspecified number of people even if it is non-commercial. For example, the JVS corpus used in this experiment has the following usage conditions.

__ The text data comes from the JSUT corpus, and the license information is described in the JSUT corpus. Tag information is licensed under CC-BY-SA 4.0. Audio data can be used only in the following cases. Research at academic institutions Non-commercial research (including research in commercial organizations) Personal use (including blogs) If you would like to use it for commercial purposes, please see below. Redistribution of this audio data is not permitted, but it is possible to publish a part of the corpus (for example, about 10 sentences) on your web page or blog. __

In addition, the author does not take any responsibility for the accident caused by using this implementation and performing voice quality conversion.

Acknowledgments

The people in the laboratory who gave me advice in posting this article, especially Rugiu-kun, gave me a lot of technical opinions about World and gave me incorrect information. I was able to fix it. I haven't been able to reach his required level, but it was a good study. We would also like to thank those who have uploaded clear articles, papers, implementations, and libraries, and those in the lab who have published useful datasets. Any deficiencies in this article are my responsibility.

Reflection

What are you doing without writing a thesis orz

Recommended Posts

Proposal of multi-speaker voice conversion and voice morphing by RelGAN-VM
Conversion between singular and plural of words
Low-rank approximation of images by HOSVD and HOOI
Calculation of technical indicators by TA-Lib and pandas
Parallel learning of deep learning by Keras and Kubernetes
Character dialogue generation and tone conversion by CVAE