I tried to identify the language using CNN + Melspectogram

What is language identification?

In a nutshell, to identify the language from voice data. For example, from the voice data "Good morning, good weather" to "This voice data is in Japanese!" From the voice data "Buenas Tardes", it feels like "This voice data is in Spanish!".

The intended use is to identify the language for someone who does not know what language they are speaking. It seems that existing automatic translators basically have to give information such as "English" and "Spanish" in advance. Therefore, there is no way to translate it for someone who does not know what language they are speaking. (I think.)

So for those who don't know what language they are speaking Identify the language using language identification → Translate It is used like this. (I think.)

Why I used CNN

There are various methods for language identification, but this time I tried using CNN. The reason is that I happened to find an easy-to-understand article in English. (http://yerevann.github.io/2015/10/11/spoken-language-identification-with-deep-convolutional-networks/)

This article seems to have been ranked 10th in the language identification contest held in 2015 by Topcoder, so I tried it after studying.

Data set used

In the above article, the problem was to classify 66,176 10-second MP3 files prepared in advance into 176 languages.

But this time I am VoxForge (http://www.voxforge.org/) I used wav format audio files in English, French and Spanish, which I got from. Each can be downloaded from the URL below. I got it with the wget command because of the large amount of audio data.

http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/16kHz_16bit/
http://www.repository.voxforge1.org/downloads/fr/Trunk/Audio/Main/16kHz_16bit/
http://www.repository.voxforge1.org/downloads/es/Trunk/Audio/Main/16kHz_16bit/

Preprocessing

Since it uses CNN, it converts the wav format audio file of each language obtained above into an image.

This time, I used a "melspectogram", where the horizontal axis indicates time, the vertical axis indicates frequency, and the shade of the image indicates intensity. You can easily get the melspectogram from a wav file using a library called librosa.

Below is the code to convert a wav file into a mel spectrogram image. (I think the same code works for mp3 files)

#input:Audio file path
#output:Melspectogram image of audio data(192×192)

import librosa as lr

def wav_to_img(path, height=192, width=192):
    signal, sr = lr.load(path, res_type='kaiser_fast')
    if signal.shape[0] < sr: #If the wav file is less than 3 seconds
        return False, False
    else:
        signal = signal[:sr*3] #Extract only the first 3 seconds
        hl = signal.shape[0]//(width*1.1)
        spec = lr.feature.melspectrogram(signal, n_mels=height, hop_length=int(hl))
        img = lr.amplitude_to_db(spec)**2
        start = (img.shape[1] - width)  // 2 
        return True, img[:, start:start+width] 

The data set I mentioned earlier contains data for a few seconds to a few tens of seconds. This time, I extracted and used only the first 3 seconds from the data of 3 seconds or more in the data. Data less than 3 seconds will not be used.

All audio data in the folder specified by the following function is converted to a melspectogram image and saved.

#Converts all audio files in the specified folder to spectrogram images and saves them in the specified folder

import os
import glob
import imageio

def process_audio(in_folder, out_folder):
    os.makedirs(out_folder, exist_ok=True)
    files = glob.glob(in_folder)
    start = len(in_folder)
    files = files[:]
    for file in files:
        bo, img = mp3_to_img(file)
        if bo == True:
            imageio.imwrite(out_folder + '.jpg', img)

As shown below, select the folder for the audio data of each language in the first argument and the output destination folder in the second argument, and execute. Do it for all languages and convert all audio files to melspectograms.

#The following specifies the path to save the audio file and the path to output
process_audio('data/voxforge/english/*wav', 'data/voxforge/english_3s_imgp/')

Data split

Save the Melspectogram folder for each language obtained above in one HDF5 file for your convenience. It's a bit clunky, but save the melspectogram image for each language saved above in HDF5 format in the following path. Destination path:'data / voxforge / 3sImg.h5'

import dask.array.image
import h5py
dask.array.image.imread('data/voxforge/english_3s_img/*.jpg').to_hdf5('data/voxforge/3sImg.h5', english)
dask.array.image.imread('data/voxforge/french_3s_img/*.jpg').to_hdf5('data/voxforge/3sImg.h5', french)
dask.array.image.imread('data/voxforge/spanish_3s_img/*.jpg').to_hdf5('data/voxforge/3sImg.h5', spanish)

Divide into training data, validation data, and test data.

import h5py
#Please decide the data size etc. by yourself in consideration of the obtained melspectogram image etc.
data_size = 60000
tr_size = 50000
va_size = 5000
te_size = 5000

x_english = h5py.File('data/voxforge/3sImg.h5')['english']
x_french = h5py.File('data/voxforge/3sImg.h5')['french']
x_spanish =  h5py.File('data/voxforge/3sImg.h5')['spanish']

x = np.vstack((x_english[:20000], x_french[:20000], x_spanish[:20000]))

del x_french
del x_english
del x_spanish

x = da.from_array(x, chunks=1000)

#Preparation for correct answer data
y = np.zeros(data_size)
#0 labels for English, French and Spanish respectively,1,2
y[0:20000] = 0
y[20000:40000] = 1
y[40000:60000] = 2

#Shuffle and split data
import numpy as np

shfl = np.random.permutation(data_size)
training_size = tr_size
validation_size = va_size
test_size = te_size

#A randomly prepared index shfl is assigned for each division of training data, evaluation, and test size.
train_idx = shfl[:training_size] 
validation_idx = shfl[training_size:training_size+validation_size] 
test_idx = shfl[training_size+validation_size:] 

#Create training data, evaluation, and test size with the assigned index
x_train = x[train_idx]
y_train = y[train_idx]
x_vali = x[validation_idx]
y_vali = y[validation_idx]
x_test = x[test_idx]
y_test = y[test_idx]

#Image normalization
x_train = x_train/255
x_vali = x_vali/255
x_test = x_test/255

#Shape transformation for learning
x_train = x_train.reshape(tr_size, 192, 192, 1)
x_vali = x_vali.reshape(va_size, 192, 192, 1)
x_test = x_test.reshape(te_size, 192, 192, 1)

#One hot vector of teacher data
y_train = y_train.astype(np.int)
y_vali = y_vali.astype(np.int)
y_test = y_test.astype(np.int)

With the above processing, it can be divided into training data, validation data, and test data.

Structure of CNN used

The network structure used is as follows. You may change it freely here. The framework used keras.

import tensorflow as tf
from tensorflow.python import keras
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.models import Model, Sequential, load_model
from tensorflow.python.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout, Input, BatchNormalization,  Activation
from tensorflow.python.keras.preprocessing.image import load_img, img_to_array, array_to_img, ImageDataGenerator

i = Input(shape=(192,192,1))
m = Conv2D(16, (7, 7), activation='relu', padding='same', strides=1)(i)
m = MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(m)
m = BatchNormalization()(m)

m = Conv2D(32, (5, 5), activation='relu', padding='same', strides=1)(m)
m = MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(m)
m = BatchNormalization()(m)

m = Conv2D(64, (3, 3), activation='relu', padding='same', strides=1)(m)
m = MaxPooling2D()(m)
m = BatchNormalization()(m)

m = Conv2D(128, (3, 3), activation='relu', padding='same', strides=1)(m)
m = MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(m)
m = BatchNormalization()(m)

m = Conv2D(128, (3, 3), activation='relu', padding='same', strides=1)(m)
m = MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(m)
m = BatchNormalization()(m)

m = Conv2D(256, (3, 3), activation='relu', padding='same', strides=1)(m)
m = MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(m)
m = BatchNormalization()(m)

m = Flatten()(m)
m = Activation('relu')(m)
m = BatchNormalization()(m)
m = Dropout(0.5)(m)

m = Dense(512, activation='relu')(m)

m = BatchNormalization()(m)
m = Dropout(0.5)(m)

o = Dense(3, activation='softmax')(m)

model = Model(inputs=i, outputs=o)
model.summary()

I learned below. Because the number of training data was small or the model was bad, there was a tendency to overfit immediately. About 5 epochs is enough. I'm sorry I haven't been able to think about it at all.

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=32, epochs=5, verbose=1, validation_data=(x_vali, y_vali), shuffle = True)

result

This is the result of the test data.

model.evaluate(x_test, y_test)

[0.2763474455833435, 0.8972]

It feels like you can predict it properly at about 90%.

However, in fact, the training data and the test data contain a lot of the same human voice data, so I think the accuracy is high.

If you want to check the accuracy more accurately, 'data/voxforge/english_3s_imgp/' Data that is not used in the data before shuffling, in this example,

x_english = h5py.File('data/voxforge/3sImg.h5')['english']
x_french = h5py.File('data/voxforge/3sImg.h5')['french']
x_spanish =  h5py.File('data/voxforge/3sImg.h5')['spanish']

x = np.vstack((x_english[20000:], x_french[20000:], x_spanish[20000:]))

I think that you can check the accuracy more accurately by using the data of. By the way, the correct answer rate for each language at that time was as follows.

English: 0.8414201183431953 French: 0.7460106382978723 Spanish: 0.8948035487959443

I can't identify French very well.

Summary

This time, after studying, I tried voice recognition using CNN. I'm sorry that there may be places where the explanation is insufficient due to a rush on the way. I introduced the basic idea at the beginning, but you can check it from the following URL, so if you are good at English, you may want to look there. (http://yerevann.github.io/2015/10/11/spoken-language-identification-with-deep-convolutional-networks/)

Thank you for reading until the end.

Recommended Posts

I tried to identify the language using CNN + Melspectogram
765 I tried to identify the three professional families by CNN (with Chainer 2.0.0)
I tried to approximate the sin function using chainer
I tried to complement the knowledge graph using OpenKE
I tried to compress the image using machine learning
I tried to move the ball
I tried using the checkio API
I tried to estimate the interval.
I tried to simulate ad optimization using the bandit algorithm.
I tried to illustrate the time and time in C language
[TF] I tried to visualize the learning result using Tensorboard
I tried to approximate the sin function using chainer (re-challenge)
I tried to output the access log to the server using Node.js
I tried using Azure Speech to Text.
I tried to summarize the umask command
I tried to get the index of the list using the enumerate function
I tried to recognize the wake word
I tried to classify text using TensorFlow
I tried to summarize the graphical modeling.
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
I tried using the BigQuery Storage API
I tried to predict Covid-19 using Darts
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried to get the batting results of Hachinai using image processing
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to control multiple servo motors MG996R using the servo driver PCA9685.
I tried to summarize various sentences using the automatic summarization API "summpy"
I tried to extract and illustrate the stage of the story using COTOHA
I tried the common story of using Deep Learning to predict the Nikkei 225
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I implemented the VGG16 model in Keras and tried to identify CIFAR10
I tried to analyze the New Year's card by myself using python
I tried to program bubble sort by language
I tried web scraping to analyze the lyrics.
I tried using scrapy for the first time
vprof --I tried using the profiler for Python
I tried to optimize while drying the laundry
I tried to save the data with discord
I tried to synthesize WAV files using Pydub.
I tried using PyCaret at the fastest speed
I tried using the Google Cloud Vision API
I tried to touch the API of ebay
I tried to correct the keystone of the image
Qiita Job I tried to analyze the job offer
I tried using the image filter of OpenCV
LeetCode I tried to summarize the simple ones
I tried using the functional programming library toolz
I tried to implement the traveling salesman problem
I tried to make a ○ ✕ game using TensorFlow
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using aiomysql
I tried using Summpy
I tried using Pipenv
I tried using matplotlib
I tried using ESPCN