Japanese speech synthesis starting with Tacotron2

Introduction

I think the most interesting thing is how much you can do, so first ask here Please give me.

This model is

--Transfer learning using pre-trained model --Preprocessed data for about 1 hour

I am learning and inferring. I'll show you how to do it for those who are just starting out.

Here is a reference for Tacotron 2. Research and development of Japanese TTS (Text-to-Speech) using Tacotron2 [Summary]

What to prepare

Audio file

--22050Hz 16bit monaural wav --Divided for each audio section

Exclude items that are noisy, laughter, and other items that are difficult to write. If it is too long, a memory error may occur during learning. I only do things within 10 seconds.

text

train.txt Create val.txt

Refer to ljs_audio_text_val_filelist.txt FILE PATH|TEXT I will write it as. I have a 9: 1 balance between train and val. Phoneme balance is not taken into consideration.

Phoneme notation

TEXT is written in phonemes with reference to the following. [wiki Japanese phonemes](https://ja.wikipedia.org/wiki/phonemes #Japanese phonemes) Voice Actor Statistics Corpus Phoneme Balance Sentence

Only the symbols.py element can be used.

Note that if you enter koNnichiwa at this time, inside Tacotron2,['k','o','n','n','i','c','h','i', Converted to'w','a']. If you want ['k','o','N','n','i','ch','i','w','a'], use {} Must be enclosed. However, you can only use the elements in valid_symbols in cmudict.py. So you need to say ko {N} ni {CH} iwa.

I also think that the notation such as k o {N} n i {CH} i w a may be used. I am konnnichiwa.

Added EOS at the end of the sentence

Model can not converge #254 It seems that the convergence of attention will be accelerated during learning.

Example

I am doing this.

train.txt


/wav/0126.wav|na&tanndesukedo-.
/wav/0022.wav|biyo-inndake-yoyakuwasimasita.
/wav/0149.wav|tasikani,ari!.
/wav/0092.wav|sositara-.
/wav/0063.wav|teyu-ne.
/wav/0202.wav|donndonn,tama&tekunndesuyo.

Setting

Edit hparams.py

Added exponential learning rate decay to train.py

Model can not converge #254

Learning

We will learn using the pre-trained model. The result of 10k iter. It took about 6 and a half hours with Colab T4. grad.norm grad.norm.png

training.loss training.loss.png

inference

The result of each checkpoint. sigma = 1, denoiser unused

――In addition, it is often placed in the center of the main Myo, which is called the Five Great Myo, like Toji. 2500|5000|7500|10000 --New England style is a milk-based white cream soup, also known as Boston clam chowder. 2500|5000|7500|10000 --Category of people related to computer game makers, industry groups, etc. 2500|5000|7500|10000

Recommended Posts

Japanese speech synthesis starting with Tacotron2
End-to-End Text Speech Synthesis Starting with ESPnet2
Use Windows 10 speech synthesis with Python
Japanese with matplotlib
Japanese input with pyautogui
Python starting with Windows 7
Speaking Japanese with OpenJtalk
GRPC starting with Python
Reinforcement learning starting with Python
PySpark life starting with Docker
Neural network starting with Chainer
Japanese morphological analysis with Python
Python starting with Hello world!