Preface

The other day, I bought a book called "Use in the field! Introduction to TensorFlow development". It was a very easy-to-understand book and explained a wide range from image classification to GAN. The script is a bit old and written in TensorFlow 1.x, so there are many places where it does not work with the current TensorFlow 2.x, As for Keras, it's almost compatible, so you can run it from Chapter 4 onwards with almost no problems.

Now that I've actually read a book and studied, I wanted to do some kind of output. But it's not fun to do the same thing. So, this time, I'm going to use Keras to do the voice analysis that I'm personally interested in.

I wanted to make something like become-yukarin that transforms voice, but first It's a little too difficult to do, so this time as an introduction "** Let's identify who the voice actor is from the voice **" I decided on the theme.

Most of the above books were image analysis, so I decided to implement it in the flow of converting audio into images and learning. I haven't learned about voice analysis before, so I think there is something strange about it. In that case, please comment.

environment

Windows 10
python 3.7.5
TensorFlow 2.3.0
jupyter

outline

As I wrote above, this time I will take the approach of converting to an image (= two-dimensional data). Therefore, let's implement it in the following flow.

Collect data
Convert audio to MFCC --Here, you can convert to image data.
Format and adjust image data
Build and learn models

Implementation

1. Collect data

This time, we will use the data set of the site Japan Statistical Society of Voice Actors.

This is a data set in which three voice actors (announcers), Maki Tsuchiya, Saeko Kamimura, and Chika Fujito, recorded 100 lines in an anechoic chamber with the three emotions of "normal," "joy," and "anger." There are a total of 900 data with 100 lines x 3 people x 3 emotions. (Since I read Wikipedia aloud, when I hear the feelings of joy and anger expressed with meaningless lines, it makes me feel like a voice actor.)

This time, let's make a model that can distinguish these three people. The data is * .tar.gz, but it can be decompressed on windows by using decompression software.

2. Convert audio to MFCC

Now let's convert the data into something called MFCC (Mel Frequency Cepstrum Coefficient). If you explain MFCC in detail, it will be long, so I will explain only "voice features". To convert voice data (wav) to a numerical value called MFCC, use something called librosa.

pip install librosa

You can install it with.

2.1. Reading data

First, let's read the data. As an example, load fujitou_normal_001.wav.

from matplotlib import pyplot as plt
import librosa
import librosa.display
import os

#Write the data path
WAV_DATA_PATH = os.path.join("Dataset","fujitou_normal","fujitou_normal_001.wav") 

x, fs = librosa.load(WAV_DATA_PATH, sr=44100)

librosa.display.waveplot(x, sr=fs, color='blue');

The return value x of librosa.load is the data (format is numpy), and fs is the sampling rate. Now you can import Wav data.

2.2. Conversion to MFCC

Let's convert this data to MFCC.

mfccs = librosa.feature.mfcc(x, sr=fs)
librosa.display.specshow(mfccs, sr=fs, x_axis='time')
plt.colorbar();

For MFCC, the horizontal axis is time and the vertical axis is 20-dimensional value.

print(mfccs.shape) # -> (20, 630)

2.3. Data shaping

Let's exclude the 1D data (at the bottom of the graph). I will not explain this in detail either, but as you can see from the graph, the data range has become large due to the data in the first dimension, making it difficult to grasp the features of values above the first dimension. ..

mfccs = mfccs[1:]
librosa.display.specshow(mfccs, sr=fs, x_axis='time')
plt.colorbar();

Furthermore, as you can see from the wav data, there is almost no sound after 7.5 seconds. This area is also an obstacle, so let's cut it.

import numpy as np

def cut_silence(wavdata, eps=0.01):
    st = 0
    gl = len(wavdata)
    data = np.abs(wavdata)
    threshold = np.max(data) * eps
    for i,a in enumerate(data):
        if a > threshold:
            st = i - 1
            break

    for i,a in reversed(list(enumerate(data))):
        if a > threshold:
            gl = i
            break
    return wavdata[st:gl]

This time, the values of "from 0 seconds until the maximum value of the data x 0.01 or more" and "from the end until the maximum value of the data x 0.01 or more" were cut.

x = cut_silence(x)
librosa.display.waveplot(x, sr=fs, color='blue');

This makes the graph look like this:

Converting to mfccs again gives:

mfccs = librosa.feature.mfcc(x, sr=fs)
mfccs = mfccs[1:]
librosa.display.specshow(mfccs, sr=fs, x_axis='time')
plt.colorbar();

With this, it was possible to make image data.

2.4. Convert and save all data

Since MFCC is a process that takes a lot of time, I think it's easier to save it as numpy data (* .npy) after using librosa.feature.mfcc. Let's convert all the data and save it in numpy data. The following directory structure is assumed.

.
|-Dataset
|   |-fujitou_angry
|   |   |-fujitou_angry_001.wav
|   |   |-fujitou_angry_002.wav
|   |   |-fujitou_angry_003.wav
|   |   |-...
|   |
|   |-fujitou_happy
|   |   |-fujitou_happy_001.wav
|   |   |-...
|   |
|   |-...
|
|-ImageData
|   |-fujitou_angry
|   |   |-fujitou_angry_001.npy
|   |   |-fujitou_angry_002.npy
|   |   |-fujitou_angry_003.npy
|   |   |-...
|   |
|   |-fujitou_happy
|   |   |-fujitou_happy_001.npy
|   |   |-...
|   |
|   |-...
|

First, get the path of all the data using ʻos.listdir`.

import os
import random

DATASET_DIR="Dataset"

wavdatas = []

dirlist = os.listdir(DATASET_DIR)
for d in dirlist:
    d = os.path.join(DATASET_DIR, d)
    datalist = os.listdir(d)
    y = [d[d.find("\\")+1:d.find("_")], d[d.find("_") + 1:]] #Determining correct answer data from file name
    datalist = [[os.path.join(d,x), y] for x in datalist]
    wavdatas.extend(datalist)

Next, create a directory to place numpy data.

IMAGE_DATA = "ImageData"

dirlist = os.listdir(DATASET_DIR)
for d in dirlist:
    os.makedirs(os.path.join(IMAGE_DATA, d), exist_ok=True)

Then convert all the data and let's np.save.

def get_mfcc(datadir):
    x, fs = librosa.load(datadir, sr=44100)
    x = cut_silence(x)
    mfccs = librosa.feature.mfcc(x, sr=fs)
    mfccs = mfccs[1:]
    return mfccs, x, fs

nn = len(wavdatas)
for i, data in enumerate(wavdatas):
    path_list = data[0].split("\\")
    path_list[0] = IMAGE_DATA
    path_list[2] = path_list[2].replace(".wav", ".npy")
    image_path = "\\".join(path_list)
    mfcc,x,fs = get_mfcc(data[0])
    if i%10 == 0:
        print(i, "/", nn)
    np.save(image_path, mfcc)

You should now have the data, as shown in the image below.

3. Format and adjust image data

With this, the data can be turned into two-dimensional data. Next, let's format the two-dimensional data so that it is easy to handle.

3.1. Reading data

First, let's load the numpy data saved above.

IMAGE_DATA = "ImageData"
numpy_datas = []

dirlist = os.listdir(IMAGE_DATA)
for d in dirlist:
    d = os.path.join(IMAGE_DATA, d)
    datalist = os.listdir(d)
    datalist = [[np.load(os.path.join(d,x)), os.path.join(d,x)] for x in datalist]
    numpy_datas.extend(datalist)

3.2. Normalization

First, the above data is in the range of -200 to 100. Let's normalize this to the range 0 ~ 1.

#Get the maximum and minimum values for the entire data
data = numpy_datas[0][0]
maximum = np.max(data)
minimum = np.min(data)
for i, data in enumerate(numpy_datas):
    M = np.max(data[0])
    m = np.min(data[0])
    if maximum < M:
        maximum = M
    if minimum > m:
        minimum = m

# 0~In the range of 1
normalize = lambda x: (x - minimum)/(maximum - minimum)
for i, data in enumerate(numpy_datas):
    numpy_datas[i][0] = normalize(data[0])

3.3. Align the size of the data

The data created in Section 2 is the size of $ 19 \ times T $. Here, $ T $ is time (more precisely, seconds x sampling rate). In order to simply plunge into a neural network, it is easier to understand if all the data sizes are the same.

from PIL import Image
import numpy as np

img_datas = []

for i,data in enumerate(numpy_datas):
    imgdata = Image.fromarray(data[0])
    imgdata = imgdata.resize((512,19))
    numpy_datas[i][0] = np.array(imgdata)

Here, all the data was converted to 512 × 19 data.

3.4. Saving image data

Save the data as in Section 2.

First, create a directory.

NORMALIZE_DATA = "NormalizeData"

dirlist = os.listdir(DATASET_DIR)
for d in dirlist:
    os.makedirs(os.path.join(NORMALIZE_DATA, d), exist_ok=True)

And save.

for i, data in enumerate(numpy_datas):
    path_list = data[1].split("\\")
    path_list[0] = NORMALIZE_DATA
    image_path = "\\".join(path_list)
    np.save(image_path, data[0])

4. Build and learn models

Let's learn with deep learning

4.1. Division of training data and test data

Let's divide 900 data into training data, test data, and validation data.

import numpy as np
import random, os

NORMALIZE_DATA="NormalizeData"

N_TRAIN = 0.8
N_TEST = 0.1
N_VALID = 0.1

train_data = []
test_data = []
valid_data = []

dirlist = os.listdir(NORMALIZE_DATA)
for d in dirlist:
    d = os.path.join(NORMALIZE_DATA, d)
    datalist = os.listdir(d)
    y = [d[d.find("\\")+1:d.find("_")], d[d.find("_") + 1:]] #Determining correct answer data from file name
    datalist = [[np.load(os.path.join(d,x)), y, os.path.join(d,x)] for x in datalist]
    random.shuffle(datalist)
    train_data.extend(datalist[:int(len(datalist)*N_TRAIN)])
    test_data.extend(datalist[int(len(datalist)*N_TRAIN): int(len(datalist)*N_TRAIN) + int(len(datalist)*N_TEST)])
    valid_data.extend(datalist[int(len(datalist)*N_TRAIN) + int(len(datalist)*N_TEST): ])

random.shuffle(train_data)
random.shuffle(test_data)
random.shuffle(valid_data)

All data was changed to training data: test data: validation data = 0.8: 0.1: 0.1. Since 900 data are used, it is 720: 90: 90. It is also shuffled twice for good distribution. After getting the data in one directory, shuffle it once, and finally shuffle it again after combining.

--Input data: train_datas [i] [0] --Correct data: train_datas [i] [1] --Voice actor name: train_datas [i] [1] [0] --Emotions: train_datas [i] [1] [1]

It has a data structure like this.

4.2. Convert input and correct data for learning

Before learning with keras of tensorflow, let's convert the input and correct answer data for ease of use. This time, train_datas [i] [1] [0] is the correct answer data to classify the voice actor names.

#Convert to numpy
input_data_train = np.array([train_data[i][0] for i in range(len(train_data))])
input_data_test = np.array([test_data[i][0] for i in range(len(test_data))])
input_data_valid = np.array([valid_data[i][0] for i in range(len(valid_data))])

#List correct answer data
label_data_train = [train_data[i][1][0] for i in range(len(train_data))]
label_data_test = [test_data[i][1][0] for i in range(len(test_data))]
label_data_valid = [valid_data[i][1][0] for i in range(len(valid_data))]

Let's change the correct answer data to 1-hot.

from tensorflow.keras.utils import to_categorical

label_dict={"tsuchiya": 0, "fujitou":1, "uemura":2}

label_no_data_train = np.array([label_dict[label] for label in label_data_train])
label_no_data_test = np.array([label_dict[label] for label in label_data_test])
label_no_data_valid = np.array([label_dict[label] for label in label_data_valid])

label_no_data_train = to_categorical(label_no_data_train, 3)
label_no_data_test = to_categorical(label_no_data_test, 3)
label_no_data_valid = to_categorical(label_no_data_valid, 3)

4.3. Model construction / learning

Now let's build a model of the neural network. I don't know anything like iron rules, so I decided to repeat the convolution operation and normalization for the time being. The activation function uses relu.

from tensorflow.keras.layers import Input, Conv2D, Conv2DTranspose,\
                                    BatchNormalization, Dense, Activation,\
                                    Flatten, Reshape, Dropout
from tensorflow.keras.models import Model

inputs = Input((19,512))
x = Reshape((19,512,1), input_shape=(19,512))(inputs)
x = Conv2D(12, (1,4), strides=(1,2), padding="same")(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = Conv2D(12, (1,4), strides=(1,2), padding="same")(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = Conv2D(12, (2,2), strides=(1,1), padding="same")(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = Conv2D(12, (2,2), strides=(1,1), padding="same")(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = Flatten()(x)
x = Dense(3)(x)
output = Activation("softmax")(x)

model = Model(inputs=inputs, outputs=output)
model.summary()

The result of model.summary () is as follows.

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 19, 512)]         0         
_________________________________________________________________
reshape (Reshape)            (None, 19, 512, 1)        0         
_________________________________________________________________
conv2d (Conv2D)              (None, 19, 256, 12)       60        
_________________________________________________________________
batch_normalization (BatchNo (None, 19, 256, 12)       48        
_________________________________________________________________
activation (Activation)      (None, 19, 256, 12)       0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 19, 128, 12)       588       
_________________________________________________________________
batch_normalization_1 (Batch (None, 19, 128, 12)       48        
_________________________________________________________________
activation_1 (Activation)    (None, 19, 128, 12)       0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 19, 128, 12)       588       
_________________________________________________________________
batch_normalization_2 (Batch (None, 19, 128, 12)       48        
_________________________________________________________________
activation_2 (Activation)    (None, 19, 128, 12)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 19, 128, 12)       588       
_________________________________________________________________
batch_normalization_3 (Batch (None, 19, 128, 12)       48        
_________________________________________________________________
activation_3 (Activation)    (None, 19, 128, 12)       0         
_________________________________________________________________
flatten (Flatten)            (None, 29184)             0         
_________________________________________________________________
dense (Dense)                (None, 3)                 87555     
_________________________________________________________________
activation_4 (Activation)    (None, 3)                 0         
=================================================================
Total params: 89,571
Trainable params: 89,475
Non-trainable params: 96
_________________________________________________________________

Let them learn.

```python
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy"
)

model.fit(
    input_data_train,
    label_no_data_train,
    batch_size=30,
    epochs=50,
    validation_data=(input_data_valid, label_no_data_valid)
)

4.4.Accuracy prediction with test data

Finally, let's make a prediction with the test data created first.

out = model.predict(input_data_test)
predict = np.argmax(out, axis=1)
answer =np.argmax(label_no_data_test, axis=1)

print("correct:", np.sum(predict == answer), "/", len(predict))
print("rate:", np.sum(predict == answer)/len(predict) * 100, "%")

correct: 90 / 90
rate: 100.0 %

I was able to answer all the questions correctly. Great.

I tried it several times, but it's about 90%The above accuracy has come out. If you tune the model a little more, it will definitely be 100%I think I can make it.

4.5.Emotion classification

So, let's try to classify not only the voice actor name but also the emotion.

Repost

-Input data: train_datas[i][0] -Correct answer data: train_datas[i][1] -Voice actor name: train_datas[i][1][0] -Emotions: train_datas[i][1][1]

Fromtrain_datas[i][1][1]Let's try as correct answer data. Result is,

correct: 88 / 90
rate: 97.77777777777777 %

And the same result was obtained.

#Summary This time, by classifying the voice actors from the voice, we converted the voice into an image and used a neural network using a convolution layer. The result is 90%With that, I think I said it well.

In the future, I would like to be able to distinguish the voices of Yui Ogura using the voices of anime and radio.

Voice processing by deep learning: Let's identify who the voice actor is from the voice