End-to-End single channel sound source separation with Google Colaboratory

Since it is summer vacation, I implemented a program to learn and execute sound source separation with a deep neural network. We have made it possible to study DNN on Google Colaboratory so that anyone can use it as much as possible.

End-to-End sound source separation

The technique of separating the original A and B from the signal A + B, which is a mixture of the voice of speaker A and the voice of speaker B, is called sound source separation. Speech has long been analyzed using a spectrogram obtained by Fourier transforming a waveform as a feature. However, in recent years, a method has been devised that directly estimates the waveform of the separated signal from the waveform of the mixed signal, and is especially called end-to-end (E2E) sound source separation.

Dual-path RNN TasNet There is a method called Dual-path RNN TasNet as one of the methods using deep learning of end-to-end sound source separation. Dual-path RNN TasNet consists of three parts: Encoder, Separator, and Decoder. DPRNN_architecture.png

Encoder maps from waveform to latent space, and Decoder maps from latent space to waveform. Separator masks and separates the features in the mapped latent space for each sound source. E2E sound source separation by DNN by the configuration of Encoder-Separator-Decoder has been announced so far, but especially Dual-path RNN TasNet uses a module called Dual-path RNN for the Separator part. Dual-path RNN is a network that runs RNN in each direction of the global time axis and the local time axis. As a result, the receptive field of DNN can be expanded with a small number of parameters, and high-quality sound source separation is realized.

Things necessary

--Google account (to use Google Colaboratory) --Time (because it takes time to learn DNN) --Friends (to make mixed voice) --Mac PC

Execution method

The link for Github is here. Learning DNN is a notebook on Google Colaboratory, and the actual separation is performed on a PC.

1. Learning DNN

1.1. Preparation

Bring the set of code and download the required libraries.

train_dprnn-tasnet.ipynb


!git clone https://github.com/tky1117/DNN-based_source_separation.git
!pip install soundfile

If you want to keep the data of the DNN model on Google drive, mount it.

train_dprnn-tasnet.ipynb


from google.colab import drive
drive.mount("/content/drive")

This time, we will use a voice data set called LibriSpeech. Create a two-speaker mixed audio dataset by doing the following:

train_dprnn-tasnet.ipynb


%cd "/content/DNN-based_source_separation/egs/librispeech/common"
!. ./prepare.sh "../../../dataset" 2 # 2 is number of speakers

1.2. Learning DNN

Now, move to the working directory and learn the sound source separation. However, it just runs train.sh. The deep learning library uses PyTorch. Dual-path RNN TasNet has some hyperparameters, but due to the time constraints of Google Colaboratory, the default setting is smaller than the original paper.

train_dprnn-tasnet.ipynb


%cd "/content/DNN-based_source_separation/egs/librispeech/dprnn_tasnet"
!. ./train.sh <OUT_DIR>

Models etc. are saved under <OUT_DIR>. Since the directory name is determined based on hyperparameters, I think it is quite long. With the default settings

train_dprnn-tasnet.ipynb


<OUT_DIR>/2mix/trainable-trainable/sisdr/N64_L16_H256_K100_P50_B3/enc-relu_dilated1_separable1_causal0_norm1_mask-sigmoid/b4_e100_adam-lr1e-3-decay0_clip5/seed111/model/

Models are saved as best.pth and last.pth in the directory. best.pth is the model of the epoch with the largest loss of validation data, and last.pth is the model of the last epoch of learning. If the 12-hour time limit is reached, again

train_dprnn-tasnet.ipynb


%cd "/content/DNN-based_source_separation/egs/librispeech/dprnn_tasnet"
!. ./train.sh <OUT_DIR> <MODEL_PATH>

By doing so, learning can be restarted from the continuation. Even for <MODEL_PATH>, with the default settings

<OUT_DIR>/2mix/trainable-trainable/sisdr/N64_L16_H256_K100_P50_B3/enc-relu_dilated1_separable1_causal0_norm1_mask-sigmoid/b4_e100_adam-lr1e-3-decay0_clip5/seed111/model/last.pth

It becomes. However, if the loss does not decrease for 5 consecutive epochs, the learning of DNN is terminated early. Therefore, learning does not proceed any further.

2. Execution of sound source separation on PC

After learning on Google Colaboratory, try separating the sound sources on the actual machine. It is assumed to be a Mac PC. From here, I will show the input on the terminal. In addition, on Google Colaboratory, it works just by adding soundfile (as of September 4, 2020), but for local, it is necessary to install the necessary library separately. In particular, the version of pytorch depends on the environment of Google Colaboratory.

2.1. Preparation

First, download the set of code and move to the working directory.

git clone https://github.com/tky1117/DNN-based_source_separation.git
cd "<WORK_DIR>/DNN-based_source_separation/egs/librispeech/dprnn_tasnet"

Place the trained model under the working directory. If everything up to this point is done with the default settings,

<WORK_DIR>/DNN-based_source_separation/egs/librispeech/dprnn_tasnet/exp/2mix/trainable-trainable/sisdr/N64_L16_H256_K100_P50_B3/enc-relu_dilated1_separable1_causal0_norm1_mask-sigmoid/b4_e100_adam-lr1e-3-decay0_clip5/seed111/model/

Below, best.pth will be placed. In order to actually record and separate the voice, prepare as many people as the number of speakers specified in 1.2.

2.2. Audio Separation and Execution

Once the preparation is complete, all you have to do is execute it.

. ./demo.sh

Recording starts for 5 seconds. If it works properly, you will see a progress bar with the characters " Now recording ... ". When the recording is finished, the separation by DNN will start. Recording result and separation result

<WORK_DIR>/DNN-based_source_separation/egs/librispeech/dprnn_tasnet/results

It is saved as a wav file below.

Summary

We have created a program to learn and separate Dual-path RNN TasNet, which is one of the end-to-end single-channel sound source separation. Perhaps it will take about 12 hours x 4 to study. If there is little noise, I think it will work reasonably well on the actual machine. Networks other than Dual-path RNN TasNet are also implemented, so please feel free to contact us.

Recommended Posts

End-to-End single channel sound source separation with Google Colaboratory
Study Python with Google Colaboratory
Try OpenCV with Google Colaboratory
OpenCV feature detection with Google Colaboratory
Source code of sound source separation (machine learning practice series) learned with Python
How to search Google Drive with Google Colaboratory
Make a cascade classifier with google colaboratory
Manage deals with Trello + Google Colaboratory (Part 1)
Using Java's Jupyter Kernel with Google Colaboratory
Use TPU and Keras with Google Colaboratory
Google colaboratory
I tried simple image processing with Google Colaboratory.
Cheat sheet when scraping with Google Colaboratory (Colab)
I tried to make a real-time sound source separation mock with Python machine learning