It seems that conventional music recommendation technology often uses collaborative filtering, but collaborative filtering has the disadvantage that it cannot handle works that "do not collect user evaluations" such as minor songs and new songs.
Another approach to music recommendation technology, the method of "extracting music features and utilizing them for recommendation," seems to be able to avoid the above-mentioned problems.
So I tried to study using the paper'End-to-end learning for music audio'[^ 1], but I couldn't find the source code. I decided to do the "Predict" task. (Please note that some work has been changed and it is not a complete reproduction.)
I couldn't find many articles about processing music with Python and Deep Learning, so I hope it helps.
Use MagnaTagATune Dataset [^ 2]. Each song has 29 seconds, 25863 songs, and 188 tags.
#Obtaining MP3 data
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.001
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.002
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.003
#Integrate and unzip the split zip file
$ cat mp3.zip* > ~/music.zip
$ unzip music.zip
#Obtaining tag data
$ wget http://mi.soi.city.ac.uk/datasets/magnatagatune/annotations_final.csv
The voice features normally used in voice recognition and MIR (music information retrieval) are mel frequency cepstrum after feature extraction is applied to RAW data. However, in this paper, RAW data is used as it is as audio features. Like the image, it is soulful to put the raw data into Deep Learning and automatically extract the features.
I used a package called pydub to convert MP3 to RAW. You also need libav or ffmpeg (which seems to encode and decode audio). For more information, go to Official Github
$ pip install pydub
#For mac
$ brew install libav --with-libvorbis --with-sdl --with-theora
#For linux
$ apt-get install libav-tools libavcodec-extra-53
Also, the official method did not work in my ubuntu environment, so I referred to this article.
Let's define the following function that creates an ndarray with the path of the mp3 file as an argument.
import numpy as np
from pydub import AudioSegment
def mp3_to_array(file):
#Convert MP3 to RAW
song = AudioSegment.from_mp3(file)
#Conversion from RAW to bytestring type
song_data = song._data
#Conversion from bytestring to Numpy array
song_arr = np.fromstring(song_data, np.int16)
return song_arr
Let's read the tag data downloaded earlier. Also, please note the following two points.
--Limiting tags to 50 commonly used tags --Limited to 3000 samples because it does not survive the memory
import pandas as pd
tags_df = pd.read_csv('annotations_final.csv', delim_whitespace=True)
tags_df = tags_df.sample(frac=1)
tags_df = tags_df[:3000]
top50_tags = tags_df.iloc[:, 1:189].sum().sort_values(ascending=False).index[:50].tolist()
y = tags_df[top50_tags].values
--Use the tags_df because it contains the path to the mp3 file. --X is reshaped to [samples (number of songs), features, channel (1 this time)]. --Since RAW data is 16kHz, it has 16000 features per second and 465984 features in about 30 seconds. ――In the original paper, the sound source was divided into 3 seconds for training, but since it is troublesome, I will stick it in 30 seconds.
files = tags_df.mp3_path.values
X = np.array([ mp3_to_array(file) for file in files ])
X = X.reshape(X.shape[0], X.shape[1], 1)
from sklearn.model_selection import train_test_split
random_state = 42
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=random_state)
I tried using Keras. Unlike the original paper, the dimension of x is as long as 465984, so we will stack a little deeper.
import keras
from keras.models import Model
from keras.layers import Dense, Flatten, Input
from keras.layers import Conv1D, MaxPooling1D
features = train_X.shape[1]
x_inputs = Input(shape=(features, 1), name='x_inputs') # (Number of features,Number of channels)
x = Conv1D(128, 256, strides=256,
padding='valid', activation='relu') (x_inputs)
x = Conv1D(32, 8, activation='relu') (x) # (Number of channels,Filter length)
x = MaxPooling1D(4) (x) #(Filter length)
x = Conv1D(32, 8, activation='relu') (x)
x = MaxPooling1D(4) (x)
x = Conv1D(32, 8, activation='relu') (x)
x = MaxPooling1D(4) (x)
x = Conv1D(32, 8, activation='relu') (x)
x = MaxPooling1D(4) (x)
x = Flatten() (x)
x = Dense(100, activation='relu') (x) #(Number of units)
x_outputs = Dense(50, activation='sigmoid', name='x_outputs') (x)
model = Model(inputs=x_inputs, outputs=x_outputs)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(train_X1, train_y, batch_size=600, epochs=50)
'''Output to png'''
from keras.utils.visualize_util import plot
plot(model, to_file="music_only.png ", show_shapes=True)
'''Visualize interactively'''
from IPython.display import SVG
from keras.utils.visualize_util import model_to_dot
SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))
In the original paper, the AUC was about 0.87, but in this experiment, only 0.66 was obtained. Since the sample size is less than 1/5, it will be low, but let's say that it was possible to predict to some extent by feeding the raw (and still 30 seconds) audio data as it is.
from sklearn.metrics import roc_auc_score
pred_y_x1 = model.predict(test_X1, batch_size=50)
print(roc_auc_score(test_y, pred_y_x1)) # => 0.668582599155
--I was able to convert the audio file to ndarray. --I was able to predict the tag by inserting a RAW file without feature extraction. ――The research environment will be ready soon, so I would like to increase the number of samples and try it.
Recommended Posts