Speech Recognition: Genre Classification Part1 --GTZAN Genre Collections

Target

We have summarized the music genre classification using the Microsoft Cognitive Toolkit (CNTK).

In Part 1, we will prepare for music genre classification.

I will introduce them in the following order.

  1. Download GTZAN dataset
  2. Creating a logarithmic mel spectrogram image
  3. Preparation of training data and verification data

Introduction

Download GTZAN dataset

GTZAN Genre Collections

・ Blues ・ Classical ・ Country ・ Disco ・ Hiphop ・ Jazz ・ Metal ・ Reggae ・ Rock

Contains 10 different music genres. Each has 100 data for 30 seconds.

GTZAN Genre Collections

Download and unzip genres.tar.gz from the link above.

The directory structure this time is as follows.

MGCC  |―gtzan   |―...  mgcc_gtzan.py

Creating a logarithmic spectrogram image

Speech is represented as waveform data, but in speech recognition, it is generally treated as frequency data using the Fourier transform, rather than being treated as waveform data as it is.

This time, create a logarithmic melspectogram image from the waveform data of the voice by the following procedure.

  1. Generation of spectrogram by short-time Fourier transform
  2. Conversion to Mel Spectrogram by Mel Scale Filter Bank
  3. Apply logarithm and then split into 128x128 image size

Preparation of training data and verification data

I used scikit-learn's train_test_split function to split it into training and validation data. The argument test_size was 0.2.

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Matplotlib 3.1.1 ・ Numpy 1.19.2 ・ Librosa 0.8.0 ・ Scikit-learn 0.23.2

Program to run

The implemented program is published on GitHub.

mgcc_genre.py


Commentary

It supplements the essential contents of the program to be executed.

Fourier transform

As shown in the lower left of the figure below, audio is obtained as waveform data with time on the horizontal axis and amplitude on the vertical axis. On the other hand, the waveform data is composed of waveforms with multiple frequencies as shown in the upper right. Therefore, by using the Fast Fourier Transform, you can check the frequencies contained in the waveform as shown in the lower right.

time_fft_frequency.png

Short-time Fourier transform

The Short-Term Fourier Transform divides the waveform data into sections and performs a fast Fourier transform. This makes it possible to see the time change of the frequency with each section as one frame.

At the time of execution, as shown in the figure below, overlap is allowed, the interval is cut out, the window function is applied, and then the Fourier transform is performed.

stft.png

I used a humming window for the window function. The humming window is expressed by the following formula.

W_{hamming} = 0.54 - 0.46 \cos \left( \frac{2n \pi}{N-1} \right)

As shown in the figure below, the results obtained by the short-time Fourier transform can be viewed as an image with time on the horizontal axis and frequency on the vertical axis. Such an image is called a spectrogram, where each pixel represents the intensity of the amplitude spectrum, which we have converted to dB.

spectrogram.png

Mel scale filter bank

The higher the frequency of human hearing, the lower the resolution. The Mel scale [2] is a scale that reflects this. There are several types of mel scale conversion formulas, but librosa defaults to the Slaney formula.

Convert the spectrogram to mel frequency by calculating the dot product of the filter as shown below and the power spectrum of the short-time Fourier transform.

mel_filterbank.png

result

When you run the program, it creates and saves logarithmic melspectogram images of training and validation data.

The figure below is a logarithmic melspectogram image of each genre. The horizontal axis is time and the vertical axis is mel frequency, and each pixel represents logarithmic intensity.

log-mel_spectrogram.png

Now that you're ready to train, Part 2 will do a music genre classification.

reference

GTZAN Genre Collections

  1. Bob L. Sturm. "The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its futre use", arXiv preprint arXiv:1306.1461 (2013).
  2. Stanley Smith Stevens, John Volkman, and Edwin Newman. "A scale for the measurement of the psychological magnitude pitch", Journal of the Acoustical Society of America, vol. 8, no. 3, pp. 185-190(1937).
  3. Koichi Shinoda. "Machine Learning Professional Series Speech Recognition", Kodansha, 2017.

Recommended Posts

Speech Recognition: Genre Classification Part1 --GTZAN Genre Collections
Speech Recognition: Genre Classification Part2-Music Genre Classification CNN
Speech Recognition: Phoneme Prediction Part2 --Connectionist Temporal Classification RNN
Speech Recognition: Phoneme Prediction Part1 --ATR Speech dataset
Speech recognition in Python