Target

We have summarized the phoneme prediction using the Microsoft Cognitive Toolkit (CNTK).

In Part 1, we will prepare for phoneme prediction.

I will introduce them in the following order.

Download ATR sample audio dataset
Creating features and saving HTK format files
Create a file to be read by the built-in reader provided by CNTK

Introduction

Download ATR sample audio dataset

The ATR sample speech dataset [1] is an utterance dataset composed of the rhymes of the ATR database.

atr_503_v1.0.tar.gz

Download and unzip atr_503_v1.0.tar.gz from the link above. The audio data exists in the .ad file under the speech directory, and the phoneme label used this time is the .lab file under the old directory under label / monophone.

The directory structure this time is as follows.

CTCR 　|―atr_503 　　|―label 　　|―speech 　　|―... 　ctcr_atr503.py MGCC

Creating features and saving HTK format files

The audio data is stored in a big endian signed integer type 16bit with a sampling frequency of 16,000, so the value range is divided by the maximum value $ 2 ^ {16} / 2-1 = 32,767 $ [-1, 1] Normalize to.

This time, the Mel Frequency Cepstrum Coefficient (MFCC) was calculated from the voice data. The number of features used is 13 dimensions.

In addition, high frequency emphasis is applied to audio data as preprocessing. In addition, the 1st and 2nd derivative of MFCC are also included to make a total of 39-dimensional features.

The created features are written and saved as a binary file in HTK (Hidden Markov Toolkit) format.

Creating a file to be read by the built-in reader provided by CNTK

During this training, we will use HTKDeserializer and HTKMLFDeserializer, which are one of the built-in readers specializing in speech recognition.

The general processing flow of the program that prepares for phoneme prediction is as follows.

Separation of training data and verification data
Creating a list file of phoneme labels
Feature generation and HTK format file storage, frame and phoneme label writing

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Librosa 0.8.0 ・ Numpy 1.19.2 ・ Pandas 1.1.2 ・ Scikit-learn 0.23.2 ・ Scipy 1.5.2

Program to run

The implemented program is published on GitHub.

`ctcr_atr503.py`

Commentary

It supplements the essential contents of the program to be executed.

High frequency enhancement filter

The power of the voice is attenuated as it gets higher, so high-frequency enhancement is used to compensate for it. Assuming that the frequency is $ f $ and the sampling frequency is $ f_s $, the first-order finite impulse response (FIR) filter $ H (z) $ used as a high-pass filter is expressed by the following equation.

H(z) = 1 - \alpha z^{-1} \\
z = \exp(j \omega), \omega = 2 \pi f / f_s

Generally, $ \ alpha = 0.97 $ is used.

Mel frequency cepstrum

The mel frequency cepstrum converts the power spectrum of the mel spectrogram used in Speech Recognition: Genre Classification Part1 --GTZAN Genre Collections to decibels and then the discrete cosine transform. Obtained by applying.

Cepstrum [2] is an anagram of the spectrum that can separate fine and gentle fluctuations in the spectrum and represent the characteristics of the human vocal tract.

Also, in order to capture the time change of the feature amount, the difference between adjacent frames is also added as the feature amount. This is called delta cepstrum [3], and this time, not only the first derivative but also the second derivative is calculated and used as a feature.

HTKDeserializer and HTKMLFDeserializer

One of CNTK's built-in readers, HTKDeserializer and HTKMLFDeserializer, requires three files: a list file, a script file, and a model label file.

The list file must have a unique phoneme label to be used as shown below. Also, add _ as a whitespace character.

`atr503_mapping.list`


A
E
...
z
_

The contents of the script file are as follows, describe the path where the HTK format file is saved on the right side of the equal sign, and write the number of frames in the bracket. Note that the start of the number of frames must be 0 and the end must be subtracted 1 from the number of frames.

`train_atr503.scp`


train_atr503/00000.mfc=./train_atr503/00000.htk[0,141]
train_atr503/00001.mfc=./train_atr503/00001.htk[0,258]
...

The left side of the equal sign in the script file must correspond to the model label file. The contents of the model label file file are as follows, and the frame and phoneme labels start from the second line. The frame spacing must be greater than or equal to 1, and 5 0s must be added by design. Label information is separated by dots.

`train_atr503.mlf`


#!MLF!#
"train_atr503/00000.lab"
0 1600000 sil
1600000 1800000 h
...
13600000 14200000 sil
.
"train_atr503/00001.lab"
0 400000 sil
400000 1100000 s
...

result

When you run the program, features will be generated and a binary file in HTK format will be saved. At the same time, write the frame and phoneme label.

Number of labels : 43


Number of samples 452

Number of samples 51

Now that we are ready to train, we will make phoneme predictions in Part 2.

reference

CNTK 208: Training Acoustic Model with Connectionist Temporal Classification (CTC) Criteria

Speech Recognition : Genre Classification Part1 - GTZAN Genre Collections

Yoshiro Yoshida, Takeo Fukurotani, and Toshiyuki Takezawa. "ATR Speech Database", Proceedings of the Japanese Society for Artificial Intelligence National Convention 0 (2002): pp. 189-189.
B.P. Bogert, M. J. R. Healy, and J. W. Tukey. "The quefrency analysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe cracking", in Proceedings of the Sympsium on Time Series Analysis, Wiley, pp. 209-243(1963).
S. Furui. "On the role of dynamic characteristics of speech spectra for syllable perception", IEEE Transaction on Acoustics, Speech, and Signal Processing, vol.34, no. 1, pp. 52-59(1986).
Koichi Shinoda. "Machine Learning Professional Series Speech Recognition", Kodansha, 2017.

Speech Recognition: Phoneme Prediction Part1 --ATR Speech dataset