2020/12/26: Fixed because there was a mistake in preprocessing. At the same time, the silent part of the data set was deleted in advance. val_acc has improved a little to 91%. 2021/01/01: Changed Conv from 2D to 1D, added BN layer and Dropout layer, and val_acc became 96%.
In the "Visualize long meetings with Python" series, the digital voiceprint x-vector needs to be introduced ** to further improve the ** accuracy of speaker dialiation.
The x-vector is obtained as the output of the final hidden layer of supervised speaker identification, as learned in Trend Times (https://qiita.com/toast-uz/items/44c6a12dbf10cb3055ca):
Therefore, as a first premise, it is necessary to ** obtain a speaker identification model by supervised learning with a certain degree of accuracy **. For that, it seems good to use the PyTorch that I learned the other day.
** PyTorch meets this goal in the following ways **:
This time, I used the Japanese dataset of Common Voice 5.1 of mozilla.org, which is also supported by PyTorch. A total of 6158 types of Japanese audio clips of 170 people are registered in the validated data set. Of these, 80% randomly extracted data is used as training data, and the remaining 20% is used as verification data.
In addition, since the audio clip of the data set has a slight silence part before and after, the silence part was deleted in advance. This is because the ultimate goal, speaker dialiation, requires the use of x-vectors with silenced data. The code for removing silence is introduced at the end of this article.
The pre-processing is as follows.
Of the above, only the 10-second clip part did not have a good library and I made it myself, but I can see that ** torchaudio and torchvision can be used and the development efficiency with PyTorch is high ** ..
In addition, since there is no need to expand the data for preprocessing of verification data, only the first 3 seconds are cut out.
The deep learning model is a relatively simple and rudimentary CNN with 4 layers of 1D convolutional layers specialized for voice, 2 layers of fully connected layers, and Batch Normalization in each layer, with Dropout at key points. I will. In addition, there are various articles that it is not good to use BN and Dropout together, but it seems to be case by case.
Below are the learning results. Since it is long, it is displayed for each epoch10 except for epoch0 to 10 and the optimum value epoch71.
epoch:0, loss:2.350, acc:0.535, val_loss:1.492, val_acc:0.685, can_save:False
epoch:1, loss:1.217, acc:0.738, val_loss:0.996, val_acc:0.772, can_save:False
epoch:2, loss:0.827, acc:0.807, val_loss:0.736, val_acc:0.824, can_save:False
epoch:3, loss:0.625, acc:0.844, val_loss:0.599, val_acc:0.847, can_save:False
epoch:4, loss:0.475, acc:0.875, val_loss:0.552, val_acc:0.870, can_save:False
epoch:5, loss:0.412, acc:0.891, val_loss:0.445, val_acc:0.884, can_save:False
epoch:6, loss:0.345, acc:0.907, val_loss:0.384, val_acc:0.901, can_save:True
epoch:7, loss:0.302, acc:0.912, val_loss:0.401, val_acc:0.897, can_save:False
epoch:8, loss:0.267, acc:0.923, val_loss:0.449, val_acc:0.893, can_save:False
epoch:9, loss:0.242, acc:0.933, val_loss:0.382, val_acc:0.895, can_save:False
epoch:10, loss:0.237, acc:0.931, val_loss:0.323, val_acc:0.915, can_save:True
epoch:20, loss:0.147, acc:0.955, val_loss:0.251, val_acc:0.942, can_save:True
epoch:30, loss:0.084, acc:0.975, val_loss:0.262, val_acc:0.946, can_save:False
epoch:40, loss:0.076, acc:0.976, val_loss:0.321, val_acc:0.927, can_save:False
epoch:50, loss:0.060, acc:0.984, val_loss:0.304, val_acc:0.942, can_save:False
epoch:60, loss:0.070, acc:0.979, val_loss:0.264, val_acc:0.935, can_save:False
epoch:70, loss:0.045, acc:0.986, val_loss:0.235, val_acc:0.954, can_save:False
epoch:71, loss:0.040, acc:0.988, val_loss:0.197, val_acc:0.960, can_save:True
epoch:80, loss:0.049, acc:0.986, val_loss:0.341, val_acc:0.940, can_save:False
epoch:90, loss:0.047, acc:0.986, val_loss:0.297, val_acc:0.939, can_save:False
The optimum value of epoch71 is val_loss = 0.197 and val_acc = 0.960. Conditions such as datasets are different, but looking at Table 7 of Biometric Recognition Using Deep Learning: A Survey, EER of about 5% seems to be the highest peak several years ago.
Below is a graph of the learning results.
Below is the source code. It is generalized so that the output of the final hidden layer, which is x_vector, can also be obtained.
speech.py
import random
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset, random_split
import torchaudio
from torchvision import transforms
torchaudio.set_audio_backend('sox_io')
#A dataset that can be transformed based on Common Voice
#Download Common Voice in advance
# https://commonvoice.mozilla.org/ja/datasets
class SpeechDataset(Dataset):
sample_rate = 16000
def __init__(self, train=True, transform=None, split_rate=0.8):
tsv = './CommonVoice/cv-corpus-5.1-2020-06-22/ja/validated.tsv'
#Dataset uniqueness verification and correct label enumeration
import pandas as pd
df = pd.read_table(tsv)
assert not df.path.duplicated().any()
self.classes = df.client_id.drop_duplicates().tolist()
self.n_classes = len(self.classes)
#Data set preparation
self.transform = transform
data_dirs = tsv.split('/')
dataset = torchaudio.datasets.COMMONVOICE(
'/'.join(data_dirs[:-4]), tsv=data_dirs[-1],
url='japanese', version=data_dirs[-3])
#Data set split
n_train = int(len(dataset) * split_rate)
n_val = len(dataset) - n_train
torch.manual_seed(torch.initial_seed()) #Needed to get the same split
train_dataset, val_dataset = random_split(dataset, [n_train, n_val])
self.dataset = train_dataset if train else val_dataset
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
x, sample_rate, dictionary = self.dataset[idx]
#After resampling, common sample_Can be transformed by rate
if sample_rate != self.sample_rate:
x = torchaudio.transforms.Resample(sample_rate)(x)
#Various transformations, MFCC, etc. are described as transforms externally.
#However, MFCC should be done first to match the inference.
x = torchaudio.transforms.MFCC(log_mels=True)(x)
#Finally align the size of x
if self.transform:
x = self.transform(x)
#Feature value:Voice tensor, correct label:Speaker ID index
return x, self.classes.index(dictionary['client_id'])
#Learning model
class SpeechNet(nn.Module):
def __init__(self, n_classes):
super().__init__()
self.conv = nn.Sequential(
nn.BatchNorm1d(40),
nn.Conv1d(40, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(inplace=True),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm1d(128),
nn.ReLU(inplace=True),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm1d(128),
nn.ReLU(inplace=True),
nn.MaxPool1d(kernel_size=2),
nn.Conv1d(128, 64, kernel_size=3, padding=1),
nn.BatchNorm1d(64),
nn.ReLU(inplace=True),
nn.Dropout(),
)
self.fc = nn.Sequential(
nn.Linear(30*64, 1024),
nn.BatchNorm1d(1024),
nn.ReLU(inplace=True),
nn.Dropout(),
nn.Linear(1024, n_classes),
)
def forward(self, x):
x = self.conv(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
return x
#Crop to the specified size in the last dimension, and if the length is not enough, circular pad
#A transform component used to align the length of audio data in the time direction
class CircularPad1dCrop:
def __init__(self, size):
self.size = size
def __call__(self, x):
n_repeat = self.size // x.size()[-1] + 1
repeat_sizes = ((1,) * (x.dim() - 1)) + (n_repeat,)
out = x.repeat(*repeat_sizes).clone()
return out.narrow(-1, 0, self.size)
def SpeechML(train_dataset=None, val_test_dataset=None, *,
n_classes=None, n_epochs=15,
load_pretrained_state=None, test_last_hidden_layer=False,
show_progress=True, show_chart=False, save_state=False):
'''
Preprocess, learn, verify, infer
train_dataset:Data set for training
val_test_dataset:Verification/Test dataset
(If you want to change the data in verification and test, after learning once and saving the state
Read the state only in the test and re-execute)
(If there is no correct label, verification is skipped)
n_classes:Number of classification classes (train for None)_Obtained from dataset)
n_epocs:Number of learning epochs
load_pretrained_state:When using learned weights.pth file path
test_last_hidden_layer:Use final hidden layer for inference results of test data
show_progress:Print the learning status of the epoch
show_chart:Graph the results
save_state: test_acc > 0.Test at 9_loss Save the state when the minimum value is updated
(load_pretrained_Used in state)
Return value:Inference result of test data
'''
#Model preparation
if not n_classes:
assert train_dataset, 'train_dataset or n_classes must be a valid.'
n_classes = train_dataset.n_classes
model = SpeechNet(n_classes)
if load_pretrained_state:
model.load_state_dict(torch.load(load_pretrained_state))
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
#Definition of preprocessing
Squeeze2dTo1d = lambda x: torch.squeeze(x, -3)
train_transform = transforms.Compose([
CircularPad1dCrop(800),
transforms.RandomCrop((40, random.randint(160, 320))),
transforms.Resize((40, 240)),
Squeeze2dTo1d,
])
test_transform = transforms.Compose([
CircularPad1dCrop(240),
Squeeze2dTo1d
])
#Preparation of training data and test data
batch_size = 32
if train_dataset:
train_dataset.transform = train_transform
train_dataloader = DataLoader(
train_dataset, batch_size=batch_size, shuffle=True)
else:
n_epochs = 0 #Epoch cannot be turned without training data
if val_test_dataset:
val_test_dataset.transform = test_transform
val_test_dataloader = DataLoader(
val_test_dataset, batch_size=batch_size)
#Learning
losses = []
accs = []
val_losses = []
val_accs = []
for epoch in range(n_epochs):
#Learning loop
running_loss = 0.0
running_acc = 0.0
for x_train, y_train in train_dataloader:
optimizer.zero_grad()
y_pred = model(x_train)
loss = criterion(y_pred, y_train)
loss.backward()
running_loss += loss.item()
pred = torch.argmax(y_pred, dim=1)
running_acc += torch.mean(pred.eq(y_train).float())
optimizer.step()
running_loss /= len(train_dataloader)
running_acc /= len(train_dataloader)
losses.append(running_loss)
accs.append(running_acc)
#Validation loop
val_running_loss = 0.0
val_running_acc = 0.0
for val_test in val_test_dataloader:
if not(type(val_test) is list and len(val_test) == 2):
break
x_val, y_val = val_test
y_pred = model(x_val)
val_loss = criterion(y_pred, y_val)
val_running_loss += val_loss.item()
pred = torch.argmax(y_pred, dim=1)
val_running_acc += torch.mean(pred.eq(y_val).float())
val_running_loss /= len(val_test_dataloader)
val_running_acc /= len(val_test_dataloader)
can_save = (val_running_acc > 0.9 and
val_running_loss < min(val_losses))
val_losses.append(val_running_loss)
val_accs.append(val_running_acc)
if show_progress:
print(f'epoch:{epoch}, loss:{running_loss:.3f}, '
f'acc:{running_acc:.3f}, val_loss:{val_running_loss:.3f}, '
f'val_acc:{val_running_acc:.3f}, can_save:{can_save}')
if save_state and can_save: #Create a model folder in advance
torch.save(model.state_dict(), f'model/0001-epoch{epoch:02}.pth')
#Graph drawing
if n_epochs > 0 and show_chart:
fig, ax = plt.subplots(2)
ax[0].plot(losses, label='train loss')
ax[0].plot(val_losses, label='val loss')
ax[0].legend()
ax[1].plot(accs, label='train acc')
ax[1].plot(val_accs, label='val acc')
ax[1].legend()
plt.show()
#inference
if not val_test_dataset:
return
if test_last_hidden_layer:
model.fc = model.fc[:-1] #Output the last hidden layer
y_preds = torch.Tensor()
for val_test in val_test_dataloader:
x_test = val_test[0] if type(val_test) is list else val_test
y_pred = model.eval()(x_test)
if not test_last_hidden_layer:
y_pred = torch.argmax(y_pred, dim=1)
y_preds = torch.cat([y_preds, y_pred])
return y_preds.detach()
#Call sample
if __name__ == '__main__':
train_dataset = SpeechDataset(train=True)
val_dataset = SpeechDataset(train=False)
result = SpeechML(train_dataset, val_dataset, n_epochs=100,
show_chart=True, save_state=True)
Recommended Posts