In the previous article, vowels were recognized using formant analysis. I learned about MFCC through an internship at Sciseed inc. and verified the classification accuracy, so this article is a continuation of the previous article about MFCC, which is often used for speech recognition. I would like to summarize it.
--Background --What is MFCC? --MFCC derivation process --Implementation of MFCC derivation program --About librosa --Derivation of MFCC --Classification of phonemes using MFCC --Implementation of a program for continuous speech recognition --Regarding dynamic difference --Derivation of primary difference and secondary difference --Classification of phonemes including the characteristics of secondary differences --Discussion --Summary --Reference
In the formant analysis of previous article, the envelope of the spectrum was obtained, and the frequencies emphasized in it were used as features as F1 and F2 from the lowest. It was used for vowel analysis. Regarding vowels, there are only five phonemes in Japan, and formant analysis was effective because the phonemes could be basically classified according to the shape of the vocal tract. However, phonemes and consonants in sentences cannot be expressed only by the characteristics of formant analysis due to dynamic changes and the number and differences of organs used for vocalization. Therefore, this time, we will classify phonemes using MFCC and its dynamic difference, which can express the envelope of the vocal tract spectrum in more detail.
Formants refer to multiple resonance frequencies that are emphasized as speech passes through the vocal tract, and vowels are used in vowel analysis because they change with the shape of the vocal tract and tongue. MFCC transforms phonemes according to human auditory characteristics and expresses the envelope of the spectrum using more features than formant analysis, so that phoneme features that could not be captured by formant analysis can be captured. In addition, MFCC (Mel frequency cepstrum coefficient) refers to the cepstrum of the logarithmic power spectrum of the Melfilter bank, which will be explained in detail later.
The procedure for deriving MFCC is as follows.
I will explain each in detail.
First, regarding pre-processing, the power of the voice is attenuated as it becomes higher, so high-frequency emphasis processing is performed to compensate for it. In addition, a window function is applied to the discontinuous data so that both ends of the waveform are attenuated. Since the magnitude of the sound (frequency to be emphasized) changes depending on the difference in phoneme, the power spectrum is used (the amplitude spectrum is also one of the methods of expressing sound pressure, so I think some people analyze it there).
Human hearing changes depending on the frequency. The higher the frequency, the harder it is to distinguish the pitch. The experimental calculation of the relationship between the actual frequency and the auditory frequency is called the Mel scale. When considering the frequency axis on the Mel scale, create filters according to the number of Mel filter banks so that they have an even width. Using it, the power spectrum of the voice waveform obtained by fast Fourier transform is divided into groups of the number of mel filter banks.
There are two reasons for making the output (power spectrum) of the Melfilter bank a logarithmic power spectrum. First, the loudness felt by humans is proportional to the logarithm of sound pressure, and as the sound becomes louder, it becomes more difficult to perceive the difference in loudness. It needs to be a logarithmic spectrum to correspond to this human auditory characteristic. Second, since the voice at a certain time is the glottic wave (source wave) convoluted with the spectrum of the vocal tract (spectrum wrapping), the glottic wave and vocal tract can be obtained by taking the logarithmic power spectrum. It can be separated into a linear sum of spectra. The following explains how to separate into a linear sum.
Let $ y (n) $ be the audio signal at a certain time, $ v (n) $ be the glottic wave at a certain time, and $ h (n) $ be the impulse response of the vocal tract.
When the voice signal is Fourier transformed
It becomes. At this time, the power spectrum of the audio signal is $ S (k) $ as follows.
The cepstrum can be obtained by capturing the obtained logarithmic power spectrum as if it were a time signal and performing an inverse discrete Fourier transform. Now, let's follow the process of deriving the cepstrum with an equation using the logarithmic power $ logS (k) $ used earlier. The formula for the inverse discrete Fourier transform can be found at here. Since cepstrum $ c (n) $ refers to the inverse discrete Fourier transform of logarithmic power
\begin{align}
c(n) &= \frac{1}{N}\sum_{k=0}^{N-1}\log S_{k}e^{(i\frac{2\pi kn}{N})}\\
&= \frac{1}{N}\sum_{k=0}^{N-1}\log S_{k} \cos (\frac{2\pi kn}{N}) + i \log S_{k} \sin (\frac{2\pi kn}{N})\\
\end{align}
In addition, this power spectrum is axisymmetric with the Nyquist frequency as the boundary. In other words
It becomes. Therefore, cepstrum $ c (n) $ can be expressed by the inverse discrete cosine transform as follows.
\begin{align}
c(n) &= \frac{1}{N}\sum_{k=0}^{N-1}\log S_{k} \cos (\frac{2\pi kn}{N})\\
&= \frac{2}{N}\sum_{k=0}^{N-1}\log V_{k} \cos (\frac{2\pi kn}{N})
+ \frac{2}{N}\sum_{k=0}^{N-1}\log H_{k} \cos (\frac{2\pi kn}{N}) \\
\end{align}
Since $ v (n) $ is a glottic wave, it contains many complicated changes (high frequency components). On the other hand, $ h (n) $ is an impulse response of the vocal tract, so it contains many smooth changes (low frequency components). Therefore, the low-order component of this cepstrum $ c (n) $ represents the vocal tract spectrum $ H_ {k} $.
librosa is a Python package for analyzing music and voice. In this program, it is used for MFCC and logarithmic power output. For details, please refer to the Documentation.
The following packages are used.
#Package to use
import cis
import librosa
import sklearn
import numpy as np
from collections import defaultdict
import scipy.signal
Install librosa in your environment with pip install --upgrade sklearn librosa
.
In addition, `` `cis``` used in this program is a package described in the introduction to practical image / audio processing learned in Python. Use the one in here or use it to read audio files. Please use another package instead.
The program up to MFCC derivation is as follows.
mfcc.py
#Use all average features of the audio section as a vector
mfcc_data = []
boin_list = ["a","i","u","e","o"]
nobashi_boin = ["a:","i:","u:","e:","o:"]
remove_list = ["silB","silE","sp"]
#High frequency emphasis
def preEmphasis(wave, p=0.97):
#coefficient(1.0, -p)Create FIR filter for
return scipy.signal.lfilter([1.0, -p], 1, wave)
#calculation of mfcc
def mfcc(wave):
mfccs = librosa.feature.mfcc(wave, sr = fs, n_fft = 512)
mfccs = np.average(mfccs, axis = 1)
#Make it into a one-dimensional array
mfccs = mfccs.flatten()
mfccs = mfccs.tolist()
#Delete the features of mfcc after the 1st and 14th dimensions because they are not needed.
mfccs.pop(0)
mfccs = mfccs[:12]
mfccs.insert(0,label)
return mfccs
#Data reading,Calculate mfcc for each phoneme(Data used is for 500 files)
for i in range(1,500,1):
data_list = []
open_file = "wav/sound-"+str(i).zfill(3)+".lab"
filename = "wav/sound-"+str(i).zfill(3)#Sampling frequency is 16kHz
v, fs = cis.wavread(filename+".wav")
with open(open_file,"r") as f:
data = f.readline().split()
while data:
data_list.append(data)
data = f.readline().split()
for j in range(len(data_list)):
label = data_list[j][2]
if label in boin_list:
start = int(fs * float(data_list[j][0]))
end = int(fs * float(data_list[j][1]))
voice_data = v[start:end]
#If it is too short, it cannot be analyzed well, so skip it.
if end - start <= 512:
continue
#Humming window
hammingWindow = np.hamming(len(voice_data))
voice_data = voice_data * hammingWindow
p = 0.97
voice_data = preEmphasis(voice_data, p)
mfcc_data.append(mfcc(voice_data))
As explained earlier, the higher-order components are characteristic of representing glottic waves, so they are usually used in about 12 dimensions for phoneme recognition. The first dimension is excluded because it represents the orthogonal component of the data. In addition, the lab file opened by this program is a file that contains the start position, end position, and phoneme type of phonemes by julius segmentation-kit. Previous article has an example, so please check it if you are interested.
I would like to compare the classification of phonemes using SVM (Support Vector Machine).
svm.py
#Required packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
#Data set loading
df = pd.DataFrame(mfcc_data)
x = df.iloc[:,1:]#Features obtained with mfcc
y = df.iloc[:,0]#Vowel label
#Label changed to number once
label = set(y)
label_list = list(label)
label_list.sort()
for i in range(len(label_list)):
y[y == label_list[i]] =i
y = np.array(y, dtype = "int")
#Determine the boundary between teacher data and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 1)
#Data standardization
sc = StandardScaler()
sc.fit(x_train)
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)
#Instantiate an SVM
model_linear = SVC(kernel='linear', random_state = 1)
model_poly = SVC(kernel = "poly", random_state = 1)
model_rbf = SVC(kernel = "rbf", random_state =1)
model_linear.fit(x_train_std, y_train)
model_poly.fit(x_train_std, y_train)
model_rbf.fit(x_train_std, y_train)
pred_linear_train = model_linear.predict(x_train_std)
pred_poly_train = model_poly.predict(x_train_std)
pred_rbf_train = model_rbf.predict(x_train_std)
accuracy_linear_train =accuracy_score(y_train, pred_linear_train)
accuracy_poly_train =accuracy_score(y_train, pred_poly_train)
accuracy_rbf_train =accuracy_score(y_train, pred_rbf_train)
print("train_result")
print("Linear : "+str(accuracy_linear_train))
print("Poly : "+str(accuracy_poly_train))
print("RBF : "+ str(accuracy_rbf_train))
pred_linear_test = model_linear.predict(x_test_std)
pred_poly_test = model_poly.predict(x_test_std)
pred_rbf_test = model_rbf.predict(x_test_std)
accuracy_linear_test = accuracy_score(y_test, pred_linear_test)
accuracy_poly_test = accuracy_score(y_test, pred_poly_test)
accuracy_rbf_test = accuracy_score(y_test, pred_rbf_test)
print("-"*40)
print("test_result")
print("Linear : "+str(accuracy_linear_test))
print("Poly : "+str(accuracy_poly_test))
print("RBF : "+ str(accuracy_rbf_test))
When applied to librosa's mfcc function, the input phoneme is split into multiple parts. This time, when comparing the MFCC that uses only the center of the phoneme that was divided and the MFCC that uses the average of the entire phoneme label, it was better to use the average, so I will post the result in this article. The results of the previous formant analysis are also included for comparison. Phoneme data of about 8200 vowels is used for classification.
Classification results by formant analysis
train_result
Linear : 0.8109896432681243
Poly : 0.7206559263521288
RBF : 0.8550057537399309
----------------------------------------
test_result
Linear : 0.7825503355704698
Poly : 0.6932885906040268
RBF : 0.8308724832214766
Classification result by MFCC
train_result
Linear : 0.885286271290786
Poly : 0.9113482454340243
RBF : 0.9201723784116561
----------------------------------------
test_result
Linear : 0.8833487226839027
Poly : 0.8913511849799939
RBF : 0.9039704524469068
Compared to formant analysis, ** about 7% ** recognition accuracy has improved. In addition, due to the increase in features, Linear was able to recognize it with moderate accuracy. Consonants are also classified in the same way.
Results of formant analysis
train_result
Linear : 0.2290266367936271
Poly : 0.20114513318396812
RBF : 0.31292008961911877
----------------------------------------
test_result
Linear : 0.22357723577235772
Poly : 0.1991869918699187
RBF : 0.30720092915214864
result of mfcc
train_result
train_result
Linear : 0.5635076681085333
Poly : 0.647463625639009
RBF : 0.679315768777035
----------------------------------------
test_result
Linear : 0.5396638159834857
Poly : 0.5364199351223827
RBF : 0.6004128575641404
The amount of data is about formant analysis (about 5000 consonant data) and mfcc results (about 8500 consonant data). Compared with formant analysis, the recognition accuracy is greatly improved, but the result is not good for a recognizer.
In actual continuous speech recognition, analysis processing such as MFCC is performed for each speech frame, and recognition processing for the speech of that frame is performed. Therefore, it is not possible to actually use the characteristics of MFCC averaged for each phoneme interval for classification.
In general, in speech recognition, in addition to the characteristics of MFCC, the logarithmic value of the sum of the outputs of the Melfilter bank (hereinafter referred to as the logarithmic power spectrum), the logarithmic power spectrum and the MFCC's first-order difference (dynamic difference), and second-order difference are used. It seems that the added 39 dimensions are often used (MLP series speech recognition). By using dynamic differences for recognition, it becomes possible to capture the characteristics of dynamically changing phonologies such as consonants and semivowels. Also, regarding the power spectrum, consonants change more rapidly than vowels, so taking a dynamic difference will improve recognition accuracy.
※ Caution In the implementation so far, the accuracy has been verified using the averaged features of the phoneme intervals, but with dynamic differences, it is necessary to follow continuous changes in features. In order to handle it as continuous data, the result of MFCC for each window width is used as it is as a feature. In addition, the program will be modified accordingly. As for the result, please see the following results as different from the above result.
Now, I would like to implement the program immediately. First, the MFCC, logarithmic power spectrum, and functions for deriving their dynamic differences that need to be obtained this time are summarized below.
function.py
#High frequency emphasis
def preEmphasis(wave, p=0.97):
#coefficient(1.0, -p)Create FIR filter for
return scipy.signal.lfilter([1.0, -p], 1, wave)
#calculation of mfcc
def mfcc(wave):
mfccs = librosa.feature.mfcc(wave, sr = fs, n_fft = 512)
mfccs = np.average(mfccs, axis = 1)
#Make it into a one-dimensional array
mfccs = mfccs.flatten()
mfccs = mfccs.tolist()
#Delete the features of mfcc after the 1st and 14th dimensions because they are not needed.
mfccs.pop(0)
mfccs = mfccs[:12]
return mfccs
#Calculation of logarithmic power spectrum
def cal_logpower(wave):
S = librosa.feature.melspectrogram(wave, sr = fs,n_fft = 512)
S = sum(S)
PS=librosa.power_to_db(S)
PS = np.average(PS)
return PS
#How many frames to see before and after(Usually 2~5)
#Weight part for the frame
def make_scale(K):
scale = []
div = sum(2*(i**2) for i in range(1,K+1))
for i in range(-K,K+1):
scale.append(i/div)
return np.array(scale)
#Extraction of differential features
def make_delta(K, scale, feature):
#Data reference from own position to K before
before = [feature[0]]*K
#See K data after your position
after = []
#Storage list of differential features
delta = []
for i in range(K+1):
after.append(feature[i])
for j in range(len(feature)):
if j == 0:
match = np.array(before + after)
dif_cal = np.dot(scale, match)
delta.append(dif_cal)
after.append(feature[j+K+1])
after.pop(0)
#K from behind+There is a part to see as a difference up to 1
elif j < (len(feature) - K - 1):
match = np.array(before + after)
dif_cal = np.dot(scale, match)
delta.append(dif_cal)
before.append(feature[j])
before.pop(0)
after.append(feature[j+K+1])
after.pop(0)
#amount of data-Since data cannot be added to after after K
else:
match = np.array(before + after)
dif_cal = np.dot(scale, match)
delta.append(dif_cal)
before.append(feature[j])
before.pop(0)
after.append(feature[len(feature)-1])
after.pop(0)
return delta
Next, the program for extracting the features of mfcc, the logarithmic power spectrum, and their quadratic differences is described below.
get_all_feature.py
#Use all features of the audio section as a vector
phoneme = []
feature_data = []
delta_list = []
delta_2_list = []
nobashi_boin = ["a:","i:","u:","e:","o:"]
remove_list = ["silB","silE","sp"]
#Data reading,Calculate mfcc for each phoneme(Data used is for 500 files)
for i in range(1,500,1):
data_list = []
open_file = "wav/sound-"+str(i).zfill(3)+".lab"
filename = "wav/sound-"+str(i).zfill(3)#Sampling frequency is 16kHz
v, fs = cis.wavread(filename+".wav")
with open(open_file,"r") as f:
data = f.readline().split()
while data:
data_list.append(data)
data = f.readline().split()
for j in range(len(data_list)):
label = data_list[j][2]
if label not in remove_list:
start = int(fs * float(data_list[j][0]))
end = int(fs * float(data_list[j][1]))
#Regarding stretched vowels
if label in nobashi_boin:
label = label[0]
voice_data = v[start:end]
#Humming window
hammingWindow = np.hamming(len(voice_data))
voice_data = voice_data * hammingWindow
p = 0.97
voice_data = preEmphasis(voice_data, p)
mfccs = librosa.feature.mfcc(voice_data, sr = fs, n_fft = 512)
mfccs_T = mfccs.T
S = librosa.feature.melspectrogram(voice_data, sr = fs,n_fft = 512)
S = sum(S)
PS=librosa.power_to_db(S)
for i in range(len(PS)):
feature = mfccs_T[i][1:13].tolist()
feature.append(PS[i])
feature_data.append(feature)
phoneme.append(label)
K = 3
scale = make_scale(K)
delta = make_delta(K,scale,feature_data[len(delta_list):len(feature_data)])
delta_list.extend(delta)
second_delta = make_delta(K,scale,delta)
delta_2_list.extend(second_delta)
Now that we have created a quadratic difference, we can create a data frame using `phoneme```,
feature_data```, ``
delta_list, `` `delta_2_list
. Please create. Also, I think that the classification can be used if it matches the form of the SVM program mentioned earlier.
The result of classifying using SVM with the difference feature added is as follows. For both results, the recognition accuracy is based on the characteristics of MFCC for each window width. Vowel phoneme data (about 32500) is used for classification.
MFCC(12 dimensions)Only results
train_result
Linear : 0.8297360883797054
Poly : 0.8647708674304418
RBF : 0.8777618657937807
----------------------------------------
test_result
Linear : 0.8230149597238204
Poly : 0.8420406597621788
RBF : 0.8566168009205984
MFCC+Result of adding up to the second difference of logarithmic power
train_result
Linear : 0.8631853518821604
Poly : 0.9549918166939444
RBF : 0.9495703764320785
----------------------------------------
test_result
Linear : 0.8588415803605677
Poly : 0.9132336018411967
RBF : 0.9177598772535481
By using the dynamic difference, the recognition accuracy can be improved by ** about 6% **. Since continuous data is used for analysis, it was confirmed that the change from the sound before and after is characteristic and effective.
Next, let's classify consonants as well. About 27,000 consonant data are used for classification.
MFCC(12 dimensions)Only results
train_result
Linear : 0.418983174835406
Poly : 0.5338332114118508
RBF : 0.5544989027066569
----------------------------------------
test_result
Linear : 0.4189448660510195
Poly : 0.4599067385937643
RBF : 0.49419402029807075
MFCC+The result of adding the quadratic difference of logarithmic power
train_result
Linear : 0.5945501097293343
Poly : 0.8152889539136796
RBF : 0.8201658132162887
----------------------------------------
test_result
Linear : 0.5684374142817957
Poly : 0.6783395812379994
RBF : 0.7101581786595959
Consonants also showed a significant improvement in accuracy (** about 22% **) compared to those using the characteristics of MFCC (12 dimensions). Since the waveform of consonants changes more than that of vowels, it was confirmed that the dynamic difference is more effective as a feature. On the other hand, as the number of features increases, the difference in accuracy between test and train widens, so data shortage is considered to be a problem.
Regarding vowels, it was confirmed that the accuracy is improved by using the feature amount of spectral envelope by using MFCC as compared with formant analysis. At the beginning of the sentence, I stated that formant analysis is sufficient for vowels only, but that is only for the pronunciation data of a single vowel. However, for the vowel data this time, we used the utterance data of the sentence segmented using julius. In the case of such vowel data, it is affected by the surrounding sounds, and there is a possibility that the waveform may differ from the one that pronounced a single vowel. Therefore, there are changes in the waveform that cannot be captured by formant analysis, and it is thought that the accuracy was improved by using MFCC. Furthermore, since it is affected by changes from the sounds before and after, it is presumed that the accuracy was further improved by adding a dynamic difference.
Regarding consonants, three things can be said from this result.
This article describes the theory and actual classification verification of MFCC, which is most often used in acoustic models. A library (librosa) is prepared for MFCC, and it can be easily implemented, so I explained in detail an example of its theory and usage results. We hope that this article will lead to an understanding of the mechanism of speech recognition and clues to its implementation.
In the future, I would like to touch on acoustic models and language models for continuous phonemes.
Recommended Posts