Since it is summer vacation, I implemented a program to learn and execute sound source separation with a deep neural network. We have made it possible to study DNN on Google Colaboratory so that anyone can use it as much as possible.
The technique of separating the original A and B from the signal A + B, which is a mixture of the voice of speaker A and the voice of speaker B, is called sound source separation. Speech has long been analyzed using a spectrogram obtained by Fourier transforming a waveform as a feature. However, in recent years, a method has been devised that directly estimates the waveform of the separated signal from the waveform of the mixed signal, and is especially called end-to-end (E2E) sound source separation.
Dual-path RNN TasNet There is a method called Dual-path RNN TasNet as one of the methods using deep learning of end-to-end sound source separation. Dual-path RNN TasNet consists of three parts: Encoder, Separator, and Decoder.
Encoder maps from waveform to latent space, and Decoder maps from latent space to waveform. Separator masks and separates the features in the mapped latent space for each sound source. E2E sound source separation by DNN by the configuration of Encoder-Separator-Decoder has been announced so far, but especially Dual-path RNN TasNet uses a module called Dual-path RNN for the Separator part. Dual-path RNN is a network that runs RNN in each direction of the global time axis and the local time axis. As a result, the receptive field of DNN can be expanded with a small number of parameters, and high-quality sound source separation is realized.
--Google account (to use Google Colaboratory) --Time (because it takes time to learn DNN) --Friends (to make mixed voice) --Mac PC
The link for Github is here. Learning DNN is a notebook on Google Colaboratory, and the actual separation is performed on a PC.
Bring the set of code and download the required libraries.
train_dprnn-tasnet.ipynb
!git clone https://github.com/tky1117/DNN-based_source_separation.git
!pip install soundfile
If you want to keep the data of the DNN model on Google drive, mount it.
train_dprnn-tasnet.ipynb
from google.colab import drive
drive.mount("/content/drive")
This time, we will use a voice data set called LibriSpeech. Create a two-speaker mixed audio dataset by doing the following:
train_dprnn-tasnet.ipynb
%cd "/content/DNN-based_source_separation/egs/librispeech/common"
!. ./prepare.sh "../../../dataset" 2 # 2 is number of speakers
Now, move to the working directory and learn the sound source separation.
However, it just runs train.sh
. The deep learning library uses PyTorch.
Dual-path RNN TasNet has some hyperparameters, but due to the time constraints of Google Colaboratory, the default setting is smaller than the original paper.
train_dprnn-tasnet.ipynb
%cd "/content/DNN-based_source_separation/egs/librispeech/dprnn_tasnet"
!. ./train.sh <OUT_DIR>
Models etc. are saved under <OUT_DIR>
. Since the directory name is determined based on hyperparameters, I think it is quite long. With the default settings
train_dprnn-tasnet.ipynb
<OUT_DIR>/2mix/trainable-trainable/sisdr/N64_L16_H256_K100_P50_B3/enc-relu_dilated1_separable1_causal0_norm1_mask-sigmoid/b4_e100_adam-lr1e-3-decay0_clip5/seed111/model/
Models are saved as best.pth
and last.pth
in the directory. best.pth
is the model of the epoch with the largest loss of validation data, and last.pth
is the model of the last epoch of learning.
If the 12-hour time limit is reached, again
train_dprnn-tasnet.ipynb
%cd "/content/DNN-based_source_separation/egs/librispeech/dprnn_tasnet"
!. ./train.sh <OUT_DIR> <MODEL_PATH>
By doing so, learning can be restarted from the continuation.
Even for <MODEL_PATH>
, with the default settings
<OUT_DIR>/2mix/trainable-trainable/sisdr/N64_L16_H256_K100_P50_B3/enc-relu_dilated1_separable1_causal0_norm1_mask-sigmoid/b4_e100_adam-lr1e-3-decay0_clip5/seed111/model/last.pth
It becomes. However, if the loss does not decrease for 5 consecutive epochs, the learning of DNN is terminated early. Therefore, learning does not proceed any further.
After learning on Google Colaboratory, try separating the sound sources on the actual machine. It is assumed to be a Mac PC.
From here, I will show the input on the terminal. In addition, on Google Colaboratory, it works just by adding soundfile
(as of September 4, 2020), but for local, it is necessary to install the necessary library separately. In particular, the version of pytorch depends on the environment of Google Colaboratory.
First, download the set of code and move to the working directory.
git clone https://github.com/tky1117/DNN-based_source_separation.git
cd "<WORK_DIR>/DNN-based_source_separation/egs/librispeech/dprnn_tasnet"
Place the trained model under the working directory. If everything up to this point is done with the default settings,
<WORK_DIR>/DNN-based_source_separation/egs/librispeech/dprnn_tasnet/exp/2mix/trainable-trainable/sisdr/N64_L16_H256_K100_P50_B3/enc-relu_dilated1_separable1_causal0_norm1_mask-sigmoid/b4_e100_adam-lr1e-3-decay0_clip5/seed111/model/
Below, best.pth
will be placed.
In order to actually record and separate the voice, prepare as many people as the number of speakers specified in 1.2.
Once the preparation is complete, all you have to do is execute it.
. ./demo.sh
Recording starts for 5 seconds. If it works properly, you will see a progress bar with the characters " Now recording ... "
.
When the recording is finished, the separation by DNN will start. Recording result and separation result
<WORK_DIR>/DNN-based_source_separation/egs/librispeech/dprnn_tasnet/results
It is saved as a wav file below.
We have created a program to learn and separate Dual-path RNN TasNet, which is one of the end-to-end single-channel sound source separation. Perhaps it will take about 12 hours x 4 to study. If there is little noise, I think it will work reasonably well on the actual machine. Networks other than Dual-path RNN TasNet are also implemented, so please feel free to contact us.
Recommended Posts