I think the most interesting thing is how much you can do, so first ask here Please give me.
This model is
--Transfer learning using pre-trained model --Preprocessed data for about 1 hour
I am learning and inferring. I'll show you how to do it for those who are just starting out.
Here is a reference for Tacotron 2. Research and development of Japanese TTS (Text-to-Speech) using Tacotron2 [Summary]
--22050Hz 16bit monaural wav --Divided for each audio section
Exclude items that are noisy, laughter, and other items that are difficult to write. If it is too long, a memory error may occur during learning. I only do things within 10 seconds.
Refer to ljs_audio_text_val_filelist.txt
FILE PATH|TEXT
I will write it as.
I have a 9: 1 balance between train and val. Phoneme balance is not taken into consideration.
TEXT is written in phonemes with reference to the following. [wiki Japanese phonemes](https://ja.wikipedia.org/wiki/phonemes #Japanese phonemes) Voice Actor Statistics Corpus Phoneme Balance Sentence
Only the symbols.py element can be used.
Note that if you enter koNnichiwa
at this time, inside Tacotron2,['k','o','n','n','i','c','h','i', Converted to'w','a']
.
If you want ['k','o','N','n','i','ch','i','w','a']
, use {}
Must be enclosed.
However, you can only use the elements in valid_symbols
in cmudict.py.
So you need to say ko {N} ni {CH} iwa
.
I also think that the notation such as k o {N} n i {CH} i w a
may be used. I am konnnichiwa
.
Model can not converge #254 It seems that the convergence of attention will be accelerated during learning.
I am doing this.
train.txt
/wav/0126.wav|na&tanndesukedo-.
/wav/0022.wav|biyo-inndake-yoyakuwasimasita.
/wav/0149.wav|tasikani,ari!.
/wav/0092.wav|sositara-.
/wav/0063.wav|teyu-ne.
/wav/0202.wav|donndonn,tama&tekunndesuyo.
['basic_cleaners']
Here is a reference for transliteration_cleaners.
Uncertainty of Japanese unite code in Tacotron 2 series32
. Looking at issues etc., it seems that there are many people who set it to about 8 ~ 16
. Please consult with the GPU to decide.We will learn using the pre-trained model. The result of 10k iter. It took about 6 and a half hours with Colab T4. grad.norm
training.loss
The result of each checkpoint. sigma = 1, denoiser unused
――In addition, it is often placed in the center of the main Myo, which is called the Five Great Myo, like Toji. 2500|5000|7500|10000 --New England style is a milk-based white cream soup, also known as Boston clam chowder. 2500|5000|7500|10000 --Category of people related to computer game makers, industry groups, etc. 2500|5000|7500|10000
Recommended Posts