Extract music features with Deep Learning and predict tags

motivation

It seems that conventional music recommendation technology often uses collaborative filtering, but collaborative filtering has the disadvantage that it cannot handle works that "do not collect user evaluations" such as minor songs and new songs.

Another approach to music recommendation technology, the method of "extracting music features and utilizing them for recommendation," seems to be able to avoid the above-mentioned problems.

So I tried to study using the paper'End-to-end learning for music audio'[^ 1], but I couldn't find the source code. I decided to do the "Predict" task. (Please note that some work has been changed and it is not a complete reproduction.)

I couldn't find many articles about processing music with Python and Deep Learning, so I hope it helps.

Obtaining a dataset

Use MagnaTagATune Dataset [^ 2]. Each song has 29 seconds, 25863 songs, and 188 tags.

#Obtaining MP3 data
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.001
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.002
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.003

#Integrate and unzip the split zip file
$ cat mp3.zip* > ~/music.zip
$ unzip music.zip

#Obtaining tag data
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/annotations_final.csv

Make MP3s workable with Numpy

Install pydub

The voice features normally used in voice recognition and MIR (music information retrieval) are mel frequency cepstrum after feature extraction is applied to RAW data. However, in this paper, RAW data is used as it is as audio features. Like the image, it is soulful to put the raw data into Deep Learning and automatically extract the features.

I used a package called pydub to convert MP3 to RAW. You also need libav or ffmpeg (which seems to encode and decode audio). For more information, go to Official Github

$ pip install pydub

#For mac
$ brew install libav --with-libvorbis --with-sdl --with-theora

#For linux
$ apt-get install libav-tools libavcodec-extra-53

Also, the official method did not work in my ubuntu environment, so I referred to this article.

File import and conversion to ndarray

Let's define the following function that creates an ndarray with the path of the mp3 file as an argument.

import numpy as np
from pydub import AudioSegment

def mp3_to_array(file):
    
    #Convert MP3 to RAW
    song = AudioSegment.from_mp3(file)
    
    #Conversion from RAW to bytestring type
    song_data = song._data
    
    #Conversion from bytestring to Numpy array
    song_arr = np.fromstring(song_data, np.int16)
    
    return song_arr

Data set preparation

Preparation of music tag (y)

Let's read the tag data downloaded earlier. Also, please note the following two points.

--Limiting tags to 50 commonly used tags --Limited to 3000 samples because it does not survive the memory

import pandas as pd

tags_df = pd.read_csv('annotations_final.csv', delim_whitespace=True)
tags_df = tags_df.sample(frac=1)
tags_df = tags_df[:3000]

top50_tags = tags_df.iloc[:, 1:189].sum().sort_values(ascending=False).index[:50].tolist()
y =  tags_df[top50_tags].values

Preparation of RAW data (X)

--Use the tags_df because it contains the path to the mp3 file. --X is reshaped to [samples (number of songs), features, channel (1 this time)]. --Since RAW data is 16kHz, it has 16000 features per second and 465984 features in about 30 seconds. ――In the original paper, the sound source was divided into 3 seconds for training, but since it is troublesome, I will stick it in 30 seconds.

files = tags_df.mp3_path.values
X = np.array([ mp3_to_array(file) for file in files ])
X = X.reshape(X.shape[0], X.shape[1], 1)

Preparation of training data and test data

from sklearn.model_selection import train_test_split
random_state = 42

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=random_state)

Learning & test (7/9 revision)

Model building

I tried using Keras. Unlike the original paper, the dimension of x is as long as 465984, so we will stack a little deeper.


import keras
from keras.models import Model
from keras.layers import Dense,  Flatten, Input
from keras.layers import Conv1D, MaxPooling1D

features = train_X.shape[1]

x_inputs = Input(shape=(features, 1), name='x_inputs') # (Number of features,Number of channels)
x = Conv1D(128, 256, strides=256,
           padding='valid', activation='relu') (x_inputs)
x = Conv1D(32, 8, activation='relu') (x) # (Number of channels,Filter length)
x = MaxPooling1D(4) (x) #(Filter length)
x = Conv1D(32, 8, activation='relu') (x)
x = MaxPooling1D(4) (x)
x = Conv1D(32, 8, activation='relu') (x)
x = MaxPooling1D(4) (x)
x = Conv1D(32, 8, activation='relu') (x)
x = MaxPooling1D(4) (x)
x = Flatten() (x)
x = Dense(100, activation='relu') (x) #(Number of units)
x_outputs = Dense(50, activation='sigmoid', name='x_outputs') (x)

model = Model(inputs=x_inputs, outputs=x_outputs)
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_X1, train_y, batch_size=600, epochs=50)

Calculation graph visualization

'''Output to png''' 
from keras.utils.visualize_util import plot
plot(model, to_file="music_only.png ", show_shapes=True)


'''Visualize interactively'''
from IPython.display import SVG
from keras.utils.visualize_util import model_to_dot
SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))

test

In the original paper, the AUC was about 0.87, but in this experiment, only 0.66 was obtained. Since the sample size is less than 1/5, it will be low, but let's say that it was possible to predict to some extent by feeding the raw (and still 30 seconds) audio data as it is.

from sklearn.metrics import roc_auc_score
pred_y_x1 = model.predict(test_X1, batch_size=50)
print(roc_auc_score(test_y, pred_y_x1)) # => 0.668582599155

Summary

--I was able to convert the audio file to ndarray. --I was able to predict the tag by inserting a RAW file without feature extraction. ――The research environment will be ready soon, so I would like to increase the number of samples and try it.