--Kaldi is a toolkit that allows you to customize your speech recognizer to your liking. In this article, I will share how to use the JSUT corpus (Download) of the Japanese audio dataset to learn Kaldi. .. The JSUT corpus is a 10-hour voice corpus created for research purposes. Please note that commercial use requires contact with the author.
Text data is licensed under CC-BY-SA 4.0, etc. See the LICENCE file for details. Audio data can be used only in the following cases. Research at academic institutions Non-commercial research (including research in commercial organizations) Personal use (including blogs) If you would like to use it for commercial purposes, please see below. Redistribution of this audio data is not permitted, but it is possible to publish a part of the corpus (for example, about 100 sentences) on your web page or blog. If possible, it would be helpful if you could contact me when you publish your dissertation, blog post, or other achievements. Investigating the contribution of this corpus is very useful information for us.
――In general, if you want to learn Kaldi, Julius, etc. to accurately recognize voices such as daily conversations, you need thousands of hours of voice corpus. The following corpora are famous as voice corpora for research. In Japanese, the CSJ corpus is the largest corpus (probably) and has a data volume of about 600 hours. These corpora are charged and cannot be used easily. So this time, I would like to use the JSUT corpus that is distributed free of charge and share the basic usage of Kaldi. --CSJ (Japanese Spoken Language Corpus) --JNAS (newspaper article reading voice corpus) --S-JNAS (newspaper article reading aloud elderly voice corpus) --CE JC (Japanese Daily Conversation Corpus) ――It is quite difficult and difficult to learn Kaldi from scratch using CSJ corpus. Therefore, we have prepared a collection of programs necessary to use the corpus called *** Recipe ***. Currently Kaldi only has CSJ recipes for CSJ corpus (there are many others in English), so this time I would like to customize the CSJ recipes so that I can use the JSUT corpus.
--Installation will take some time. There is an easy-to-understand article on how to install it, so please refer to that. I would like to introduce two articles, both of which were created by Professor Shinozaki of Tokyo Institute of Technology. (I may write a detailed installation method someday) -Installation method explanation 1 -Installation method explanation 2 --* *** Please note that it will not work unless you install the necessary items such as srilm correctly. *** ***
――First, you need to prepare the JSUT corpus so that it can be used in Kaldi. If you can do this, the rest will be learned automatically with the power of the recipe. All you have to do is to maintain the JSUT in the same way that CSJ is input.
――The following five files need to be prepared in large size.
- wav.scp
- text
- lexicon.txt
- utt2spk
- spk2utt
--I will explain how to create it one by one.
-*** The current directory is always kaldi / egs / csj / jsut
***
--The jsut
directory is a copy of the
s5``` directory. *** Please note that the symbolic link file exists when copying. *** ***
--wav.scp: Text file with the path to the audio file --The default sampling frequency of JSUT corpus audio data is 44.1kHz. Since the size is large, it is converted to 16kHz. *** It may be okay depending on the performance of the PC without conversion, but since I have only done it at 16kHz, an error may occur **. --Wav.scp creation program -- Please note that most files such as wav.scp, text, utt2spk must be sorted by utterance ID (or speaker ID). *** ***
python
import os,sys
import glob
from sklearn.model_selection import train_test_split
import subprocess
import numpy as np
np.random.seed(seed=32)
def sort_file(fname):
subprocess.call(f'sort {fname} > {fname}.sorted',shell=True)
subprocess.call(f'rm {fname}',shell=True)
subprocess.call(f'mv {fname}.sorted {fname}',shell=True)
def convert_wav(wav_data_path,out_dir):
'''
* sampling frequency must be 16kHz
* wav file of JSUT is 48.1kHz, so convert to 16kHz using sox
e.g. FILE_ID sox [input_wavfilename] -r 16000 [output_wavfilename]
'''
for wav_data in wav_data_path:
fname = wav_data.split('/')[-1]
subprocess.call(f'sox {wav_data} -r 16000 {out_dir}/{fname}',shell=True)
subprocess.call(f'chmod 774 {out_dir}/{fname}',shell=True)
def make_wavscp(wav_data_path_list,out_dir,converted_jsut_data_dir):
'''
wav.scp: format -> FILE_ID cat PATH_TO_WAV |
'''
out_fname = f'{out_dir}/wav.scp'
with open(out_fname,'w') as out:
for wav_data_path in wav_data_path_list:
file_id = wav_data_path.split('/')[-1].split('.')[0]
out.write(f'{file_id} cat {converted_jsut_data_dir}/{file_id}.wav |\n')
sort_file(out_fname)
#Current directory-> kaldi/egs/csj/jsut (jsut is the same as s5, just changed the directory name. When copying from s5, be sure to inherit the symbolic link.(cp -Add option a like a))
data_dir = './data'
train_dir = f'{data_dir}/train'
eval_dir = f'{data_dir}/eval'
original_jsut_data_dir = '/path/to/JSUT/corpus'
converted_jsut_data_dir = '/path/to/converted/JSUT/corpus'
# make wav.scp of train and eval
wav_data_path = glob.glob(f'{original_jsut_data_dir}/*/wav/*.wav')
# convert JSUT wav data to 16kHz
convert_wav(wav_data_path,converted_jsut_data_dir)
# split data [train_size = 7196, test_size = 500]
train_wav_data_list, eval_wav_data_list = train_test_split(wav_data_path, test_size=500)
make_wavscp(train_wav_data_list,train_dir,converted_jsut_data_dir)
make_wavscp(eval_wav_data_list,eval_dir,converted_jsut_data_dir)
--The sox command is used when converting from 44.1kHz to 16kHz. You can also do it with ffmpeg etc.
--sox [audio file name you want to convert] -r 16000 [audio file name after conversion] `` `` --Since the audio file may not have execute permission, add it with
chmod. --Wav.scp format --
File ID cat Path to audio file | ` --The end of the sentence is a pipe (|). The cat voice information is passed to the command to be executed next. --Create separate wav.scp for training and testing. --In the sample program, 7196 voices for training and 500 voices for testing are randomly selected. --The random number seed is fixed.
np.random.seed (seed = 32) ```
--Text and utt2spk are also created for training and testing
--wav.scp should look like this
wav.scp
BASIC5000_0051 cat /home/kaldi/egs/csj/jsut/JSUT/BASIC5000_0051.wav |
BASIC5000_0053 cat /home/kaldi/egs/csj/jsut/JSUT/BASIC5000_0053.wav |
BASIC5000_0082 cat /home/kaldi/egs/csj/jsut/JSUT/BASIC5000_0082.wav |
BASIC5000_0094 cat /home/kaldi/egs/csj/jsut/JSUT/BASIC5000_0094.wav |
BASIC5000_0101 cat /home/kaldi/egs/csj/jsut/JSUT/BASIC5000_0101.wav |
...
...
--text: Simply put, it is a transcription (transcription) text for speech. In other words, it is a textual version of the spoken words. The transcribed text is provided in the JSUT corpus from the beginning. If it is a voice corpus, at least voice and transcription are included.
--text format
--``` [Utterance ID] [Transcribed text with part of speech information] `` `
--e.g. UTT001 Tomorrow + Noun is + Particle / Particle Sunny + Noun + Auxiliary verb
--Remove the punctuation.
--Strictly speaking, replace it with the
--First, in order to acquire part of speech information, morphological analysis of the transcribed text (at the same time, "reading words" is also acquired), and a `transcript``` file is created. Create a
text
file from this
transcript``` file. The ``
transcript file writes each morphologically parsed word in the form of `` `` word + katakana reading + part of speech''`` `. --*** MeCab (mecab-ipadic-neologd) *** was used for the morphological analyzer. --
chasen_tagger = MeCab.Tagger ("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd") Please change the part of
to suit your environment.
――Please note that the result will be different if you use another morphological analyzer.
python
def make_transcript(transcript_data_path_list,train_dir,eval_dir,error_dir,eval_wav_data_list):
'''
text: format -> UTT_ID TRANSCRIPT
* UTT_ID == FILE_ID (one wav file <-> one utterance)
transcript_data_path_list:JUST Corpus Transcription Text File(transcript_utf8.txt)Path list to(transcript_utf8.There are multiple txt)
train_dir:For training
'''
# change hankaku to zenkaku
ZEN = "".join(chr(0xff01 + i) for i in range(94))
HAN = "".join(chr(0x21 + i) for i in range(94))
HAN2ZEN = str.maketrans(HAN,ZEN)
eval_utt_id_list = []
for eval_wav_data in eval_wav_data_list:
eval_utt_id_list.append(eval_wav_data.split('/')[-1].split('.')[0])
word_reading_fname = './word_reading.txt'
word_reading_dict = {} # {'word':'reading'}
with open(word_reading_fname,'r') as f:
lines = f.readlines()
for line in lines:
split_line = line.strip().split('+')
word_reading_dict[split_line[0]] = split_line[1]
out_train_fname = f'{train_dir}/transcript'
out_eval_fname = f'{eval_dir}/transcript'
out_no_reading_word_fname = f'{error_dir}/no_reading_word.txt'
no_reading_word_list = []
chasen_tagger = MeCab.Tagger ("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd")
with open(out_train_fname,'w') as out_train, open(out_eval_fname,'w') as out_eval,\
open(out_no_reading_word_fname,'w') as no_reading:
for transcript_data_path in transcript_data_path_list:
with open(transcript_data_path,'r') as trans:
line = trans.readline()
while line:
split_line = line.strip().split(':')
utt_id = split_line[0]
transcript = split_line[1].translate(HAN2ZEN)
transcript = transcript.replace('・',' ').replace('-',' ').replace('』',' ').replace('『',' ').replace('」',' ').replace('「',' ')
node = chasen_tagger.parseToNode(transcript)
transcript_line = []
while node:
feature = node.feature
if feature != 'BOS/EOS,*,*,*,*,*,*,*,*':
surface = node.surface
split_feature = feature.split(',')
reading = split_feature[-1]
part_of_speech = '/'.join(split_feature[:2]).replace('/*','')
# extract no reading word to error/no_reading_word_list.txt
if reading == '*':
if surface not in no_reading_word_list:
no_reading_word_list.append(surface)
no_reading.write(f'{surface}\n')
if surface == '、' or surface == '。' or surface == ',' or surface == '.':
transcript_line.append('<sp>')
elif surface != '-':
if reading == '*':
reading = word_reading_dict[surface]
transcript_line.append('{}+{}+{}'.format(surface,reading,part_of_speech))
else:
transcript_line.append('{}+{}+{}'.format(surface,reading,part_of_speech))
node = node.next
transcript_line = ' '.join(transcript_line)
if utt_id in eval_utt_id_list:
out_eval.write(f'{utt_id} {transcript_line}\n')
else:
out_train.write(f'{utt_id} {transcript_line}\n')
line = trans.readline()
sort_file(out_train_fname)
sort_file(out_eval_fname)
data_dir = './data'
train_dir = f'{data_dir}/train'
eval_dir = f'{data_dir}/eval'
original_jsut_data_dir = '/path/to/JSUT/corpus'
# split data [train_size = 7196, test_size = 500]
train_wav_data_list, eval_wav_data_list = train_test_split(wav_data_path, test_size=500)
# make text of train and eval
transcript_data_path = glob.glob(f'{original_jsut_data_dir}/*/transcript_utf8.txt')
make_transcript(transcript_data_path,train_dir,eval_dir,error_dir,eval_wav_data_list)
make_text(train_dir,eval_dir)
--Both half-width and full-width are fine, but it is more convenient to use full-width, so we are converting to full-width.
--Next, write the word / part of speech pairs from the `transcript``` file to the
`text file. --
transcritp``` It's easy because you just remove the "katakana reading" part of the word from the contents written in the file.
python
def make_text(train_dir,eval_dir):
train_transcript_fname = f'{train_dir}/transcript'
eval_transcript_fname = f'{eval_dir}/transcript'
out_train_fname = f'{train_dir}/text'
out_eval_fname = f'{eval_dir}/text'
with open(train_transcript_fname,'r') as trian_trans, open(eval_transcript_fname,'r') as eval_trans, \
open(out_train_fname,'w') as out_train, open(out_eval_fname,'w') as out_eval:
train_trans_line = trian_trans.readline()
while train_trans_line:
split_train_trans_line = train_trans_line.strip().split(' ')
# if <sp> is in End of Sentence then remove it.
if split_train_trans_line[-1] == "<sp>":
split_train_trans_line.pop(-1)
out_train.write(split_train_trans_line[0]+' ') # write utt_id
for i,word in enumerate(split_train_trans_line[2:]):
if word == '<sp>':
out_train.write(' <sp>')
else:
split_word = word.split('+')
out_train.write(' {}+{}'.format(split_word[0],split_word[2]))
out_train.write('\n')
train_trans_line = trian_trans.readline()
eval_trans_line = eval_trans.readline()
while eval_trans_line:
split_eval_trans_line = eval_trans_line.strip().split(' ')
# if <sp> is in End of Sentence then remove it.
if split_eval_trans_line[-1] == "<sp>":
split_eval_trans_line.pop(-1)
out_eval.write(split_eval_trans_line[0]+' ') # write utt_id
for i,word in enumerate(split_eval_trans_line[2:]):
if word == '<sp>':
out_eval.write(' <sp>')
else:
split_word = word.split('+')
out_eval.write(' {}+{}'.format(split_word[0],split_word[2]))
out_eval.write('\n')
eval_trans_line = eval_trans.readline()
sort_file(out_train_fname)
sort_file(out_eval_fname)
data_dir = './data'
train_dir = f'{data_dir}/train'
eval_dir = f'{data_dir}/eval'
make_text(train_dir,eval_dir)
--lexicon.txt: A text file like a word dictionary in which `'word + part of speech phonetic symbol'`
is written for each line. Alphabet phonetic symbols are added using the'Katakana reading'information.
--The creation method is to write each word written in the `transcript``` file to the
lexicon.txt``` file so as not to duplicate it, and ``
kana2phone prepared in the CSJ recipe. Phonetic symbols are added using a file called `` `.
python
def make_lexicon(train_dir,lexicon_dir):
'''
lexicon: format -> 'word'+'part of speech'
'''
transcript_fname = f'{train_dir}/transcript'
out_lexicon_fname = f'{lexicon_dir}/lexicon.txt'
out_lexicon_htk_fname = f'{lexicon_dir}/lexicon_htk.txt'
with open(transcript_fname,'r') as trans, open(out_lexicon_fname,'w') as out:
trans_line = trans.readline()
while trans_line:
split_trans_line = trans_line.strip().split(' ')[2:]
for word in split_trans_line:
if word != '<sp>':
out.write(word+'\n')
trans_line = trans.readline()
subprocess.call(f'sort -u {out_lexicon_fname} > {out_lexicon_htk_fname}',shell=True)
subprocess.call(f'./local/csj_make_trans/vocab2dic.pl -p local/csj_make_trans/kana2phone -e ./data/lexicon/ERROR_v2d -o {out_lexicon_fname} {out_lexicon_htk_fname}',shell=True)
subprocess.call(f"cut -d'+' -f1,3- {out_lexicon_fname} >{out_lexicon_htk_fname}",shell=True)
subprocess.call(f"cut -f1,3- {out_lexicon_htk_fname} | perl -ape 's:\t: :g' >{out_lexicon_fname}",shell=True)
data_dir = './data'
train_dir = f'{data_dir}/train'
lexicon_dir = f'{data_dir}/lexicon'
# make lexicon fomr data/train/transcript
make_lexicon(train_dir,lexicon_dir)
--lexicon.txt is created only from the training text (jsut / data / train / transcript). You will not be able to make a correct evaluation using the test text.
--utt2spk: A text file that saves a pair of utterance ID and speaker ID. The utterance ID was also used in the `text``` file and
`wav.scp. In the case of JSUT corpus, the utterance ID and the file ID are the same. This is because one audio file and one utterance. Since one audio file contains multiple utterances such as CSJ Corpus, utterance ID = file ID does not hold. --Since the JSUT corpus is a voice corpus by one speaker, there is only one speaker ID. If there is only one speaker ID, Warning will be displayed, but you can ignore it. However, since Kaldi has a built-in speaker adaptation system, it may be better to have multiple speakers. (Create a fake speaker ID, etc.) --The creation method is just to read the utterance ID of the
jsut / data / train / text``` file.
python
def make_utt2spk(dir):
'''
In JSUT corpus, speaker number is one person.
It is not good for training Acoustic Model.
'''
text_fname = f'{dir}/text'
out_utt2spk_fname = f'{dir}/utt2spk'
speaker_id = "jsut_speaker"
with open(text_fname,'r') as text, open(out_utt2spk_fname,'w') as out:
text_line = text.readline()
while text_line:
utt_id = text_line.split(' ')[0]
out.write(f'{utt_id} {speaker_id}\n')
text_line = text.readline()
data_dir = './data'
train_dir = f'{data_dir}/train'
eval_dir = f'{data_dir}/eval'
# make utt2spk
make_utt2spk(train_dir)
make_utt2spk(eval_dir)
--spk2utt: The opposite of utt2spk --The creation method can be easily created from spk2utt.
python
def make_spk2utt(dir):
utt2spk_fname = f'{dir}/utt2spk'
out_spk2utt_fname = f'{dir}/spk2utt'
with open(utt2spk_fname,'r') as utt2spk, open(out_spk2utt_fname,'w') as out:
speaker_utt_dict = {} # {'speaker_id':'utt_id'}
utt2spk_line = utt2spk.readline()
while utt2spk_line:
split_utt2spk_line = utt2spk_line.strip().split(' ')
utt_id = split_utt2spk_line[0]
spk_id = split_utt2spk_line[1]
if spk_id in speaker_utt_dict:
speaker_utt_dict[spk_id].append(utt_id)
else:
speaker_utt_dict[spk_id] = [utt_id]
utt2spk_line = utt2spk.readline()
for spk_id, utt_id_list in speaker_utt_dict.items():
out.write(f'{spk_id}')
for utt_id in utt_id_list:
out.write(f' {utt_id}')
out.write('\n')
data_dir = './data'
train_dir = f'{data_dir}/train'
eval_dir = f'{data_dir}/eval'
# make spk2utt
make_ spk2utt(train_dir)
make_ spk2utt(eval_dir)
--Clone Kaldi from Github, install the necessary tools, etc., and first create a directory for jsut.
#Move to the directory where the CSJ recipe is located
cd /home/kaldi/egs/csj
#Copy the s5 directory with the name jsut(Be sure to copy with option a)
cp -a s5 jsut
--Since it is not possible to train with the JSUT corpus according to the CSJ recipe, some programs must be changed.
--Use nnet3 TDNN + Chain
for learning.
--- *** This time we will use the JSUT corpus, so we have to change `run.sh``` etc. *** *** --When using the CSJ corpus, you can execute the shell script
kaldi / egs / csj / run.sh``` to prepare the data, learn the acoustic model, the language model, and evaluate it. --In addition to run.sh, you need to change some files. Write the file name that needs to be changed below. - jsut/run.sh --Csj Corpus code partial removal --Add code to create wav.scp, text, etc. for jsut - jsut/local/csj_prepare_dict.sh --Csj Corpus code partial removal --Added path to use lexicon for just - jsut/local/chain/run_tdnn.sh --Change of parameters of parallel processing part - jsut/local/nnet3/run_ivector_common.sh --Change of parameters of parallel processing part - jsut/steps/online/nnet2/train_ivector_extractor.sh --Change of parameters of parallel processing part --Since the JSUT corpus has only one speaker, it is basically impossible to increase the number of parallel processes. If you have a large number of speakers, you can increase the number of parallels. --The changed program is put on github for reference. Programs for creating ``
wav.scp etc. are also included, so please put them under `` `kaldi / egs / csj / jsut
. After changing the above 5 files, just run run.sh.
-Program
--Files under kaldi / egs / csj / jsut
- prepare_data.py
- word_reading.txt
--It also contains a directory called src / prepared_data
. After prepare_data.py is executed, a directory called `jsut / data / lexicon``` will be created and a file called ```ERROR_v2d``` will be created. This file contains words for which phonemes could not be added. These words need to be corrected manually. I have prepared the modified version in ``` prepare_data``` just in case. You can use it by replacing it with
jsut / data / lexicon / lexicon.txt
``.
--Although it is possible to speed up learning by parallel processing, it may not be possible to learn well due to insufficient number of CPU threads. For the time being, the parameters are set so that parallel processing such as parallel processing is not performed as much as possible, so it will take some time, but I think that a PC with a certain level of performance can learn.
--Parameter for changing the number of parallel processes (set in run.sh)
- --nj N
--Change the N part
--When training on GPU, it is necessary to set to Exclusive mode
-- sudo nvidia-smi -c 3
and execute the command.
――Since the amount of data is only about 10 hours, I haven't been able to learn at all. - WER = 70.78 --In the case of CSJ corpus (600 hours), WER is about `` `0.09```.
Recommended Posts