Speech Recognition: Genre Classification Part2-Music Genre Classification CNN

Target

This is a continuation of music genre classification using the Microsoft Cognitive Toolkit (CNTK).

In Part2, music genre classification is performed using the logarithmic mel spectrogram image prepared in Part1. It is assumed that you have NVIDIA GPU CUDA installed.

Introduction

Speech Recognition: Genre Classification Part1 --GTZAN Genre Collections prepared training data and verification data.

In Part2, we will classify music genres using a convolutional neural network (CNN).

Convolutional neural network in speech recognition

Since audio data is one-dimensional waveform data, a one-dimensional convolutional neural network comes to mind first, but this time we will use a two-dimensional convolutional neural network as a grayscale image with time on the horizontal axis and frequency on the vertical axis. [1]

The structure of the convolutional neural network has been simplified as follows. [2]

Layer Filters Size/Stride Input Output
Convolution2D 64 3x3/1 1x128x128 64x128x128
MaxPooling2D 3x3/2 64x128x128 64x64x64
Convolution2D 128 3x3/1 64x64x64 128x64x64
MaxPooling2D 3x3/2 128x64x64 128x32x32
Convolution2D 256 3x3/1 128x32x32 256x32x32
Dense 512 256x32x32 512
Dense 512 512 512
Dense 10 512 10
Softmax 10 10

Settings in training

For the initial value of each parameter, we used the normal distribution of He [[3]](# reference) for the convolution layer and the uniform distribution of Glorot [[4]](# reference) for the fully connected layer.

The loss function used Cross Entropy Error.

We adopted Stochastic Gradient Decent (SGD) with Momentum as the optimization algorithm. The momentum was fixed at 0.9 and the L2 regularization value was set to 0.0005.

The Cyclical Learning Rate (CLR) [5] is used for the learning rate, the maximum learning rate is 1e-3, the base learning rate is 1e-5, and the step size is 10 times the number of epochs. The strategy was set to triangular2.

As a countermeasure against overfitting, Dropout [6] was applied at 0.5 between the fully connected layers.

Model training performed 25 Epoch with mini-batch training of mini-batch size 32.

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz ・ GPU NVIDIA GeForce GTX 1060 6GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.53 ・ Matplotlib 3.3.1 ・ Numpy 1.19.2 ・ Pandas 1.1.2 ・ Scikit-learn 0.23.2

Program to run

The training program is available on GitHub.

mgcc_training.py


result

Training loss and error The figure below is a visualization of the loss function and false recognition rate logs during training. The graph on the left shows the loss function, the graph on the right shows the false recognition rate, the horizontal axis shows the number of epochs, and the vertical axis shows the value of the loss function and the false recognition rate, respectively.

gtzan_logging.png

Validation accuracy and confusion matrix When the performance was evaluated using the test data that was separated when preparing the data in Part 1, the following results were obtained.

Validation Accuracy 69.00%

The figure below is a visualization of the mixed matrix of the verification data. The row direction is the correct answer, and the column direction is the prediction.

confusion_matrix.png

reference

Speech Recognition : Genre Classification Part1 - GTZAN Genre Collections

  1. Tom LH. Li, Antoni B. Chan, and Andy HW. Chun. "Automatic Musical Pattern Feature Extraction Using Convolutional Neural Network", Genre 10 (2010).
  2. Weibin Zhang, Wenkang Lei, Xiangmin Xu, and Xiaofeng Xing. "Improved Music Genre Classification with Convolutional Neural Networks", Interspeech. 2016, p. 3304-3308.
  3. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification", The IEEE International Conference on Computer Vision (ICCV). 2015, p. 1026-1034.
  4. Xaiver Glorot and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks", Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010, p. 249-256.
  5. Leslie N. Smith. "Cyclical Learning Rates for Training Neural Networks", 2017 IEEE Winter Conference on Applications of Computer Vision. 2017, p. 464-472.
  6. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevshky, Ilya Sutskever, and Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", The Journal of Machine Learning Research 15.1 (2014) p. 1929-1958.

Recommended Posts

Speech Recognition: Genre Classification Part2-Music Genre Classification CNN
Speech Recognition: Genre Classification Part1 --GTZAN Genre Collections
Speech Recognition: Phoneme Prediction Part2 --Connectionist Temporal Classification RNN
Speech recognition in Python
CNN 1 Image Recognition Basics
Speech recognition by Python MFCC
CNN (1) for image classification (for beginners)
Application of CNN2 image recognition