I tried to classify the voices of voice actors

I tried to identify the voice of a voice actor in the practice of machine learning. Write down the procedure as a memo.

Thing you want to do

First of all, let's learn the voice of the voice actor (the voice of the radio etc.), and check how accurate it can be discriminated by including the voice as test data. Next, I would like to test whether the voice of the anime can be predicted from the voice of the voice actor.

1. First, collect the data

This time, I downloaded a radio video of 5 Lady Go voice actors on Nico Nico Douga. Since it is .mp4 as it is, convert this to wav.

ffmpeg -i hoge.mp4 -map 0:1 hoge.wav 

This is OK!

Then, this wav file was divided into 30 seconds each.

sox hoge.wav hogehoge.wav trim 0 30 : newfile : restart

It's not over yet. From here, I manually deleted the part of the corner switching scene, the corner where music was played, and the part where other people's voices were included. However, I can't help but ignore the fact that small music is always playing in the background. How can I minimize or eliminate the influence of this back sound?

For the time being, data collection is complete!

2. Extract features from data

Shouldn't we use frequency intensity as a feature? Fast Fourier transform! However, It's better to use the Mel Frequency Cepstrum Coefficient (MFCC) for the practical machine learning system from O'Reilly! I wrote that, so I will use that one this time.

Looking at various things, it seems that MFCC is used as a typical feature in the current speech recognition, and it considers the features of human speech perception. However, the MFCC doesn't seem to contain pitch information. It seems that cepstrum can extract pitch components, so maybe this is a better feature, but I decided to use MFCC for the time being.

A library called Takalbox Scikit is used to calculate MFCC. Save the calculated result as a cache and reuse it.

from scikits.talkbox.features import mfcc
import scipy
from scipy import io
from scipy.io import wavfile
import glob
import numpy as np
import os

def write_ceps(ceps,fn):
	base_fn,ext = os.path.splitext(fn)
	data_fn = base_fn + ".ceps"
	np.save(data_fn,ceps)

def create_ceps(fn):
	sample_rate,X = io.wavfile.read(fn)
	ceps,mspec,spec = mfcc(X)
	isNan = False
	for num in ceps:
		if np.isnan(num[1]):
			isNan = True
	if isNan == False:
		write_ceps(ceps,fn)

Somehow, I didn't understand the cause, as some wav files couldn't be read well by io.wavfile.read (fn). .. If you calculate MFCC without reading it well, the result will be Nan, so I'm trying to ignore this.

The code to read the saved data is as follows. name_list is a list of directory names. This time, I created a directory with the names of five Lady Go voice actors and saved the data there.

def read_ceps(name_list,base_dir = BASE_DIRE):
	X,y = [],[]
	for label,name in enumerate(name_list):
		for fn in glob.glob(os.path.join(base_dir,name,"*.ceps.npy")):
			ceps = np.load(fn)
			num_ceps = len(ceps)
			X.append(np.mean(ceps[:],axis=0))
			y.append(label)
	return np.array(X),np.array(y)

3. Learn and predict

We trained using the features created in 2, this time we classified using a support vector machine (SVM). The code is below.

import MFCC #This is what I made in 2
from matplotlib.pyplot import specgram
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.utils import resample
from matplotlib import pylab
import numpy as np

name_list = ["Uesaka_Sumire","Komatsu_Mikako","Okubo_Rumi","Takamori_Natsumi","Mikami_Shiori"]

x,y = MFCC.read_ceps(name_list)
svc = LinearSVC(C=1.0)
x,y = resample(x,y,n_samples=len(y))
svc.fit(x[150:],y[150:])
prediction = svc.predict(x[:150])
cm = confusion_matrix(y[:150],prediction)

After reading the data, I shuffled them with resample (), and tried the 151st and subsequent data as teacher data and the 150th and subsequent data as test data.

confusion_matrix is a confusion matrix. The distribution of the predicted results for each label on the test data is shown. Let's plot the graph using this.

4. Plot the graph

Let's plot the correct answer rate using the mixed matrix of the prediction results obtained in 3. Normalize the values in the mixed matrix to get the correct answer rate.

def normalisation(cm):
	new_cm = []
	for line in cm:
		sum_val = sum(line)
		new_array = [float(num)/float(sum_val) for num in line]
		new_cm.append(new_array)
	return new_cm

Plot the graph.

def plot_confusion_matrix(cm,name_list,name,title):
	pylab.clf()
	pylab.matshow(cm,fignum=False,cmap='Blues',vmin=0,vmax=1.0)
	ax = pylab.axes()
	ax.set_xticks(range(len(name_list)))
	ax.set_xticklabels(name_list)
	ax.xaxis.set_ticks_position("bottom")
	ax.set_yticks(range(len(name_list)))
	ax.set_yticklabels(name_list)
	pylab.title(title)
	pylab.colorbar()
	pylab.grid(False)
	pylab.xlabel('Predict class')
	pylab.ylabel('True class')
	pylab.grid(False)
	pylab.show()

result.

スクリーンショット 2015-06-23 1.34.50.png

The correct answer rate was 100%, and it was almost 100% no matter how many times I tried.

Next, try the anime voice as test data. However, it is difficult to collect data, and only one or two characters were collected for each voice actor, and the number of data itself is small (about 90 patterns for 5 people in total).

The result is below.

素の声からアニメの声を予測.png

It didn't work at all!

Furthermore, I tested the original voice using the voice of the animation as teacher data.

アニメの声から素の声を予測.png

Of course this is also useless.

5. Consideration and verification

Overfitting comes to mind as the reason why the accuracy became very poor when distinguishing anime voices. By default, the mfcc () function has only 13 coefficients for audio frames, which is very large, so the feature value is the average of each coefficient for all frames. In other words, the number of feature vectors is 13, and it is unlikely that overfitting was caused by this.

Therefore, I thought that the bias in the data used might have led to a decrease in generalization ability. Since all the data used are voices in the radio, there is not much difference in the frequency components contained in each voice actor, which may have led to a decrease in generalization ability.

Therefore, I added a part of the appropriate anime voice to the teacher data and tested the rest of the anime voice. As for teacher data, 2/3 is the voice of the radio and 1/3 is the voice of the animation.

First, I tested the voice of the radio. スクリーンショット 2015-06-23 2.40.45.png

The recognition accuracy is almost 100% as before, and it will be about this no matter how many times it is done.

Next, let's test the voice of the anime.

スクリーンショット 2015-06-23 2.40.59.png

Recognition accuracy has improved dramatically! After all, it seems that the generalization ability was greatly reduced because I was learning only with the voice of the radio.

6. Conclusion

It turns out that in machine learning, we must always think about the dangers of overfitting. And on top of that, it is necessary to accurately measure the generalization ability. In that respect, it is not good to evaluate the classifier only by the correct answer rate this time. To evaluate the classifier, use the precision-recall rate curve or ROC curve (as stated in the book).

reference

O'Reilly Japan Practical Machine Learning System

Recommended Posts

I tried to classify the voices of voice actors
I tried to classify the voices of voice actors
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to summarize the basic form of GPLVM
I tried to visualize the spacha information of VTuber
I tried to summarize the string operations of Python
I tried to move the ball
I tried to estimate the interval.
I tried to find the entropy of the image with python
I tried to get the location information of Odakyu Bus
I tried to find the average of the sequence with TensorFlow
[Python] I tried to visualize the follow relationship of Twitter
[Machine learning] I tried to summarize the theory of Adaboost
I tried to fight the Local Minimum of Goldstein-Price Function
[Linux] I tried to summarize the command of resource confirmation system
I tried the asynchronous server of Django 3.0
I tried to summarize the umask command
I tried to get the index of the list using the enumerate function
I tried to automate the watering of the planter with Raspberry Pi
I tried to recognize the wake word
I tried to classify text using TensorFlow
I tried to summarize the graphical modeling.
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
I tried to expand the size of the logical volume with LVM
I tried to summarize the frequently used implementation method of pytest-mock
I tried to improve the efficiency of daily work with Python
I tried to visualize the common condition of VTuber channel viewers
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried to visualize the age group and rate distribution of Atcoder
I tried transcribing the news of the example business integration to Amazon Transcribe
zoom I tried to quantify the degree of excitement of the story at the meeting
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried how to improve the accuracy of my own Neural Network
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to get the authentication code of Qiita API with Python.
I tried to automatically extract the movements of PES players with software
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
I tried to summarize the logical way of thinking about object orientation.
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to extract and illustrate the stage of the story using COTOHA
I tried to verify and analyze the acceleration of Python by Cython
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I tried to streamline the standard role of new employees with Python
I tried to visualize the text of the novel "Weathering with You" with WordCloud
[Linux] I tried to verify the secure confirmation method of FQDN (CentOS7)
I tried to get the movie information of TMDb API with Python
I tried to display the altitude value of DTM in a graph
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried to verify the result of A / B test by chi-square test
I tried web scraping to analyze the lyrics.
I tried the pivot table function of pandas
I tried cluster analysis of the weather map
I tried to notify slack of Redmine update
I tried to find 100 million digits of pi
Qiita Job I tried to analyze the job offer
I want to customize the appearance of zabbix
I tried to classify dragon ball by adaline