Visualize long meetings in Python ~ Speaker identification by PyTorch ~

2020/12/26: Fixed because there was a mistake in preprocessing. At the same time, the silent part of the data set was deleted in advance. val_acc has improved a little to 91%. 2021/01/01: Changed Conv from 2D to 1D, added BN layer and Dropout layer, and val_acc became 96%.

In the "Visualize long meetings with Python" series, the digital voiceprint x-vector needs to be introduced ** to further improve the ** accuracy of speaker dialiation.

The x-vector is obtained as the output of the final hidden layer of supervised speaker identification, as learned in Trend Times (https://qiita.com/toast-uz/items/44c6a12dbf10cb3055ca):

Therefore, as a first premise, it is necessary to ** obtain a speaker identification model by supervised learning with a certain degree of accuracy **. For that, it seems good to use the PyTorch that I learned the other day.

** PyTorch meets this goal in the following ways **:

torchaudio provides major conversion functions such as audio datasets and MFCCs.
Many speaker identification models based on supervised learning treat audio spectrograms such as MFCC as images. In that case, the image-based torch vision is useful.
Easy to customize due to the high degree of freedom of expression in preprocessing, models, and learning.
Many of the latest papers in this field are practiced on the premise of PyTorch, and the results are easy to use.

This time, I used the Japanese dataset of Common Voice 5.1 of mozilla.org, which is also supported by PyTorch. A total of 6158 types of Japanese audio clips of 170 people are registered in the validated data set. Of these, 80% randomly extracted data is used as training data, and the remaining 20% is used as verification data.

In addition, since the audio clip of the data set has a slight silence part before and after, the silence part was deleted in advance. This is because the ultimate goal, speaker dialiation, requires the use of x-vectors with silenced data. The code for removing silence is introduced at the end of this article.

The pre-processing is as follows.

Converts audio clips to 25ms units (12.5ms hops) and 40-dimensional MFCCs.
Since it is necessary to adjust the time length for batch learning, adjust the length to 10 seconds (= 12.5ms x 800). Most audio clips take less than 10 seconds, so setting it to 10 seconds will allow you to fully utilize your data for learning. In addition, the shortage is padded by repeating the audio clip.
Randomly cut out a length of 2-4 seconds from a random time position. ** Data expansion by changing the cutout time position ** is realized. In addition, since the length to be cut out is random, data expansion by changing the pitch is realized along with the next enlargement / reduction.
Scale to 3 seconds. Since the data after MFCC is used, the pitch component of the sound is not changed and is enlarged or reduced in the time axis direction, and ** data expansion by changing the pitch ** is realized.

Of the above, only the 10-second clip part did not have a good library and I made it myself, but I can see that ** torchaudio and torchvision can be used and the development efficiency with PyTorch is high ** ..

In addition, since there is no need to expand the data for preprocessing of verification data, only the first 3 seconds are cut out.

The deep learning model is a relatively simple and rudimentary CNN with 4 layers of 1D convolutional layers specialized for voice, 2 layers of fully connected layers, and Batch Normalization in each layer, with Dropout at key points. I will. In addition, there are various articles that it is not good to use BN and Dropout together, but it seems to be case by case.

Below are the learning results. Since it is long, it is displayed for each epoch10 except for epoch0 to 10 and the optimum value epoch71.

epoch:0, loss:2.350, acc:0.535, val_loss:1.492, val_acc:0.685, can_save:False
epoch:1, loss:1.217, acc:0.738, val_loss:0.996, val_acc:0.772, can_save:False
epoch:2, loss:0.827, acc:0.807, val_loss:0.736, val_acc:0.824, can_save:False
epoch:3, loss:0.625, acc:0.844, val_loss:0.599, val_acc:0.847, can_save:False
epoch:4, loss:0.475, acc:0.875, val_loss:0.552, val_acc:0.870, can_save:False
epoch:5, loss:0.412, acc:0.891, val_loss:0.445, val_acc:0.884, can_save:False
epoch:6, loss:0.345, acc:0.907, val_loss:0.384, val_acc:0.901, can_save:True
epoch:7, loss:0.302, acc:0.912, val_loss:0.401, val_acc:0.897, can_save:False
epoch:8, loss:0.267, acc:0.923, val_loss:0.449, val_acc:0.893, can_save:False
epoch:9, loss:0.242, acc:0.933, val_loss:0.382, val_acc:0.895, can_save:False
epoch:10, loss:0.237, acc:0.931, val_loss:0.323, val_acc:0.915, can_save:True
epoch:20, loss:0.147, acc:0.955, val_loss:0.251, val_acc:0.942, can_save:True
epoch:30, loss:0.084, acc:0.975, val_loss:0.262, val_acc:0.946, can_save:False
epoch:40, loss:0.076, acc:0.976, val_loss:0.321, val_acc:0.927, can_save:False
epoch:50, loss:0.060, acc:0.984, val_loss:0.304, val_acc:0.942, can_save:False
epoch:60, loss:0.070, acc:0.979, val_loss:0.264, val_acc:0.935, can_save:False
epoch:70, loss:0.045, acc:0.986, val_loss:0.235, val_acc:0.954, can_save:False
epoch:71, loss:0.040, acc:0.988, val_loss:0.197, val_acc:0.960, can_save:True
epoch:80, loss:0.049, acc:0.986, val_loss:0.341, val_acc:0.940, can_save:False
epoch:90, loss:0.047, acc:0.986, val_loss:0.297, val_acc:0.939, can_save:False

The optimum value of epoch71 is val_loss = 0.197 and val_acc = 0.960. Conditions such as datasets are different, but looking at Table 7 of Biometric Recognition Using Deep Learning: A Survey, EER of about 5% seems to be the highest peak several years ago.

Below is a graph of the learning results. スクリーンショット 2020-12-31 10.32.28.png

Below is the source code. It is generalized so that the output of the final hidden layer, which is x_vector, can also be obtained.

`speech.py`


import random
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset, random_split
import torchaudio
from torchvision import transforms

torchaudio.set_audio_backend('sox_io')

#A dataset that can be transformed based on Common Voice
#Download Common Voice in advance
# https://commonvoice.mozilla.org/ja/datasets
class SpeechDataset(Dataset):
    sample_rate = 16000
    def __init__(self, train=True, transform=None, split_rate=0.8):
        tsv = './CommonVoice/cv-corpus-5.1-2020-06-22/ja/validated.tsv'
        #Dataset uniqueness verification and correct label enumeration
        import pandas as pd
        df = pd.read_table(tsv)
        assert not df.path.duplicated().any()
        self.classes = df.client_id.drop_duplicates().tolist()
        self.n_classes = len(self.classes)
        #Data set preparation
        self.transform = transform
        data_dirs = tsv.split('/')
        dataset = torchaudio.datasets.COMMONVOICE(
            '/'.join(data_dirs[:-4]), tsv=data_dirs[-1],
            url='japanese', version=data_dirs[-3])
        #Data set split
        n_train = int(len(dataset) * split_rate)
        n_val = len(dataset) - n_train
        torch.manual_seed(torch.initial_seed())  #Needed to get the same split
        train_dataset, val_dataset = random_split(dataset, [n_train, n_val])
        self.dataset = train_dataset if train else val_dataset
    def __len__(self):
        return len(self.dataset)
    def __getitem__(self, idx):
        x, sample_rate, dictionary = self.dataset[idx]
        #After resampling, common sample_Can be transformed by rate
        if sample_rate != self.sample_rate:
            x = torchaudio.transforms.Resample(sample_rate)(x)
        #Various transformations, MFCC, etc. are described as transforms externally.
        #However, MFCC should be done first to match the inference.
        x = torchaudio.transforms.MFCC(log_mels=True)(x)
        #Finally align the size of x
        if self.transform:
            x = self.transform(x)
        #Feature value:Voice tensor, correct label:Speaker ID index
        return x, self.classes.index(dictionary['client_id'])

#Learning model
class SpeechNet(nn.Module):
    def __init__(self, n_classes):
        super().__init__()
        self.conv = nn.Sequential(
            nn.BatchNorm1d(40),
            nn.Conv1d(40, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool1d(kernel_size=2),
            nn.Conv1d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm1d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool1d(kernel_size=2),
            nn.Conv1d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm1d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool1d(kernel_size=2),
            nn.Conv1d(128, 64, kernel_size=3, padding=1),
            nn.BatchNorm1d(64),
            nn.ReLU(inplace=True),
            nn.Dropout(),
        )
        self.fc = nn.Sequential(
        	nn.Linear(30*64, 1024),
            nn.BatchNorm1d(1024),
            nn.ReLU(inplace=True),
            nn.Dropout(),
        	nn.Linear(1024, n_classes),
        )
    def forward(self, x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

#Crop to the specified size in the last dimension, and if the length is not enough, circular pad
#A transform component used to align the length of audio data in the time direction
class CircularPad1dCrop:
    def __init__(self, size):
        self.size = size
    def __call__(self, x):
        n_repeat = self.size // x.size()[-1] + 1
        repeat_sizes = ((1,) * (x.dim() - 1)) + (n_repeat,)
        out = x.repeat(*repeat_sizes).clone()
        return out.narrow(-1, 0, self.size)

def SpeechML(train_dataset=None, val_test_dataset=None, *,
             n_classes=None, n_epochs=15,
             load_pretrained_state=None, test_last_hidden_layer=False,
             show_progress=True, show_chart=False, save_state=False):
    '''
Preprocess, learn, verify, infer
    train_dataset:Data set for training
    val_test_dataset:Verification/Test dataset
(If you want to change the data in verification and test, after learning once and saving the state
Read the state only in the test and re-execute)
(If there is no correct label, verification is skipped)
    n_classes:Number of classification classes (train for None)_Obtained from dataset)
    n_epocs:Number of learning epochs
    load_pretrained_state:When using learned weights.pth file path
    test_last_hidden_layer:Use final hidden layer for inference results of test data
    show_progress:Print the learning status of the epoch
    show_chart:Graph the results
    save_state: test_acc > 0.Test at 9_loss Save the state when the minimum value is updated
   　　　　　　　 （load_pretrained_Used in state)
Return value:Inference result of test data
    '''
    #Model preparation
    if not n_classes:
        assert train_dataset, 'train_dataset or n_classes must be a valid.'
        n_classes = train_dataset.n_classes
    model = SpeechNet(n_classes)
    if load_pretrained_state:
        model.load_state_dict(torch.load(load_pretrained_state))
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters())
    #Definition of preprocessing
    Squeeze2dTo1d = lambda x: torch.squeeze(x, -3)
    train_transform = transforms.Compose([
        CircularPad1dCrop(800),
        transforms.RandomCrop((40, random.randint(160, 320))),
        transforms.Resize((40, 240)),
        Squeeze2dTo1d,
    ])
    test_transform = transforms.Compose([
        CircularPad1dCrop(240),
        Squeeze2dTo1d
    ])
    #Preparation of training data and test data
    batch_size = 32
    if train_dataset:
        train_dataset.transform = train_transform
        train_dataloader = DataLoader(
            train_dataset, batch_size=batch_size, shuffle=True)
    else:
        n_epochs = 0   #Epoch cannot be turned without training data
    if val_test_dataset:
        val_test_dataset.transform = test_transform
        val_test_dataloader = DataLoader(
            val_test_dataset, batch_size=batch_size)
    #Learning
    losses = []
    accs = []
    val_losses = []
    val_accs = []
    for epoch in range(n_epochs):
    	#Learning loop
        running_loss = 0.0
        running_acc = 0.0
        for x_train, y_train in train_dataloader:
            optimizer.zero_grad()
            y_pred = model(x_train)
            loss = criterion(y_pred, y_train)
            loss.backward()
            running_loss += loss.item()
            pred = torch.argmax(y_pred, dim=1)
            running_acc += torch.mean(pred.eq(y_train).float())
            optimizer.step()
        running_loss /= len(train_dataloader)
        running_acc /= len(train_dataloader)
        losses.append(running_loss)
        accs.append(running_acc)
        #Validation loop
        val_running_loss = 0.0
        val_running_acc = 0.0
        for val_test in val_test_dataloader:
            if not(type(val_test) is list and len(val_test) == 2):
                break
            x_val, y_val = val_test
            y_pred = model(x_val)
            val_loss = criterion(y_pred, y_val)
            val_running_loss += val_loss.item()
            pred = torch.argmax(y_pred, dim=1)
            val_running_acc += torch.mean(pred.eq(y_val).float())
        val_running_loss /= len(val_test_dataloader)
        val_running_acc /= len(val_test_dataloader)
        can_save = (val_running_acc > 0.9 and
                    val_running_loss < min(val_losses))
        val_losses.append(val_running_loss)
        val_accs.append(val_running_acc)
        if show_progress:
            print(f'epoch:{epoch}, loss:{running_loss:.3f}, '
                  f'acc:{running_acc:.3f}, val_loss:{val_running_loss:.3f}, '
                  f'val_acc:{val_running_acc:.3f}, can_save:{can_save}')
        if save_state and can_save:   #Create a model folder in advance
            torch.save(model.state_dict(), f'model/0001-epoch{epoch:02}.pth')
    #Graph drawing
    if n_epochs > 0 and show_chart:
        fig, ax = plt.subplots(2)
        ax[0].plot(losses, label='train loss')
        ax[0].plot(val_losses, label='val loss')
        ax[0].legend()
        ax[1].plot(accs, label='train acc')
        ax[1].plot(val_accs, label='val acc')
        ax[1].legend()
        plt.show()
    #inference
    if not val_test_dataset:
        return
    if test_last_hidden_layer:
        model.fc = model.fc[:-1]  #Output the last hidden layer
    y_preds = torch.Tensor()
    for val_test in val_test_dataloader:
        x_test = val_test[0] if type(val_test) is list else val_test
        y_pred = model.eval()(x_test)
        if not test_last_hidden_layer:
            y_pred = torch.argmax(y_pred, dim=1)
        y_preds = torch.cat([y_preds, y_pred])
    return y_preds.detach()

#Call sample
if __name__ == '__main__':
    train_dataset = SpeechDataset(train=True)
    val_dataset = SpeechDataset(train=False)
    result = SpeechML(train_dataset, val_dataset, n_epochs=100,
        show_chart=True, save_state=True)