1.First of all

** Neural Network Console Challenge ** (NNC Challenge) is an AI development tool developed by SONY ** Neural Network Console **, which is supported by supporting companies. This is the second time for the ** AI Development Contest ** to challenge deep learning using the provided data.

The sponsoring company this time is ** Audiostock **, which sells BGM and sound effects for video production, events, and sound effects. The data provided is ** more than 10,000 BGM data **, which was held from 2020.09.16 to 2020.10.19.

And this time, I chose from two themes: ** A) Creating an automatic classification algorithm for Audiostock's BGM search, and B) Analyzing audio data with free ideas (free theme) **. Also participated, so I will describe the content of the challenge.

*** Learning data provided by: Audiostock **

2. What is Audiostock doing?

Audiostock will register works such as ** BGM and sound effects ** from ** Music Creator ** and sell them online as ** copyright-free music **. ** Buyer ** is a person who is engaged in ** video production, events, sound effect work **, etc. When the sale is successful, Audiostock offers a service that receives ** payment ** from the purchaser and pays ** reward ** to the music creator who created the work. 　スクリーンショット 2020-10-16 12.31.03.png

Buyers can easily buy the music they need ** without worrying about copyright law **, and music creators have the opportunity to sell their work efficiently. It is **.

Currently, the number of contracted music creators is 16,000, and the number of sound sources sold is 600,000, which is increasing by 10,000 every month.

Let's take a look at the actual online sales web page. How to find your favorite song is ** "length", "use", "song genre", "image", "main instrument", "tempo", "file format" ** Specify the ** tag ** in to narrow down the songs you want to find. There are other methods, but the main one seems to be this method.

スクリーンショット 2020-10-17 17.51.35.png

Clicking on a category will display a list of ** tags ** (there are 21 tags from "pops" to "classical" in the case of music genres), so check the tags you like and ** apply ** Press the button to narrow down.

After narrowing down by tag, you will listen to each song and select it. The process of listening to songs is well done, you can listen to songs smoothly one after another just by pressing the up and down arrows, and you can play the songs from any position with the mouse.

In addition to narrowing down the search by tag, there is also a method of narrowing down songs by keyword search for tags and one-line comments, but the essential difference from selecting by tag is small.

3. Data provided this time

This time, the data provided by Audiostock is the following three. スクリーンショット 2020-10-17 8.35.33.png

** 02_rawdata ** is the full size data of BGM, and there are 10,802 pieces (sample rate 44.1KHz). It's fun to listen to the sound quality well, but the length of the song varies and the total capacity exceeds 200GB, which makes it difficult to handle as it is.

** 01_processed data ** is a compact version of 02_rawdata, and there are also 10,802. By trimming the length of the song in 24 seconds from the beginning of the song and downsampling the sample rate to 8kHz, the overall capacity is suppressed to about 3GB. The sound quality isn't very good, but it's a good choice for deep learning.

** BGM data list ** is a collection of BGM one-line comments and tags, and looks like this. スクリーンショット 2020-10-16 16.23.43.png

Unlike images, it takes a considerable amount of time to annotate music from scratch, so the ** BGM data list ** is extremely valuable data. So let's take a look at the tagging status of the major categories. First is the ** "use" tag **. スクリーンショット 2020-10-17 10.50.37.png First place is ** Jingle **, followed by ** CM, events, games, movies **. The 1st place jingle is a short piece of music that is inserted at the turning point of the program, so if you are looking for a jingle, it seems that you can effectively narrow down. However, other than that, it seems that the narrowing down will be ambiguous. For example, there are various kinds of movies, so it is a little doubtful whether you can efficiently narrow down the songs you are looking for with this tag. The number of tags used is not as high as I expected at 6,730 (62% usage without duplicate usage).

Next is the ** "Music Genre" tag **. 楽曲ジャンルパレート.png First place is ** Rock **, followed by ** Ballads, Techno, Pops and World **. The 5th place world is a general term for folk music in each region of the world. Since it is narrowed down only to the genre of the song, apart from narrowing down the special genre, I think it is just a sub-type. The number of tags used is small at 5,185 (48% usage rate without duplicate usage).

And then there is the ** "image" tag **. イメージパレート.png The first place is ** light **, followed by ** fun, bright, gentle, and exciting **. It's easy to put the image of the song you're looking for into your own words, so it may be effective if you get hooked on it. However, the image is subjective, and I think it depends on each person, for example, whether they think it is ** fun ** or ** bright **.

Finally, there is the ** "Main Instrument" tag **. メイン楽器パレート.png First place is ** Synthesizer **, followed by ** Piano, Electric Guitar, Strings, Percussion **. I think the tags in this category are objective because they clearly indicate the instrument you are using. The total number of tags used is 3,429 (32% usage rate without duplicate usage), which is the lowest.

To summarize the tags, I think that ** "use" tags and "music" tags have little narrowing effect, and "image" tags have a lot of subjective variations **. ** I think "main instrument" is an objective tag **.

And, even though it will be out of the list when narrowing down if it does not have the correct tag, ** the overall tag usage rate is low **. Even if there is no duplicate usage, the usage rate is 62% for the "Usage" tag, 48% for the "Music genre" tag, and 32% for the "Main instrument" tag.

The reason why the tag usage rate is low overall is that tagging is not done by the Aidiostock secretariat but is left to the music creators in principle, so the variation in tagging of creators may have an effect. Hmm. However, this is not unreasonable, and with 10,000 new songs added every month, I think that tag editing at the secretariat will have to be an auxiliary one.

4. Theme selection

I decided to register as a member of Audiostock as a purchaser (membership registration is free) and actually select a song. The assumption is that you will choose the background music for creating a promotional video for your company. 　 The "use" tag is ** VP, corporate VP **, the "music genre" tag is ** pops , and the "image" tag is ** sprinting, fashionable,light If you tag the and" main instrument "tags with ** synthesizer ** and narrow down the songs, a total of 730 songs will be listed.

スクリーンショット 2020-10-16 15.14.11.png

I listen to songs one by one from here, but it is not realistic to listen to all 730 songs. That's because it takes two hours to select a song, even if it takes an average of 10 seconds per song. Listening to a song takes much longer than choosing an image.

With that in mind, if you narrow down by tag and listen to each song one by one and come across a song that roughly suits your image, it would be nice to be able to ** search for similar songs ** from the narrowed down songs. I thought.

Also, considering that ** the tag usage rate is low overall **, which was found from the analysis of the BGM data list earlier, there may be songs that you like among the songs that were missed by narrowing down the tags. There is sex. In that case, I think it is an effective method to perform a ** similar song search ** of songs that are ** out of the list and that roughly match your image.

Therefore, I decided to set the theme of this time to ** "Search for similar songs" **.

5. Make a rough sketch

Now that the theme has been decided, we will create a rough sketch to realize similar song search.

** 1) Convert waveform data to image ** BGM waveform data can be handled as it is for deep learning, but time-series data such as waveforms tend to take a long time to process, so we will convert the characteristics of the waveform data into image data and handle it. Therefore, we will convert the BGM waveform data to ** mel frequency spectrogram **. スクリーンショット 2020-10-15 17.41.42.png Here is an example of a Mel frequency spectrogram. The vertical axis is frequency (Hz), the horizontal axis is elapsed time (sec), and the color is sound pressure level (db). ** Mel frequency ** is a frequency adjusted to suit human hearing characteristics. In other words, it is an image ** that shows the time-series changes in sound.

** 2) Create a BGM classifier ** Enter the classified BGM images and let CNN learn the weights. At first, it outputs a random class, but it learns the weight of the network that gradually outputs the correct class by error back propagation. スクリーンショット 2020-10-18 11.54.27.png

The result is a ** BGM classifier ** that can properly classify BGM.

** 3) Get the BGM feature vector ** When you input the BGM image to the BGM classifier, the class is output, but the amount of information is small because the class is the final result after performing various processing. Since the amount of information in the fully connected layer (Affine) before that is overwhelmingly rich, we will extract the ** feature vector ** from here. スクリーンショット 2020-10-18 11.54.51.png By doing this, if you input the images of all BGM, you can get the feature vectors of all BGM.

** 4) Search for similar songs from feature vectors ** How much the two vectors point in the same direction can be calculated by calculating ** cos similarity ** using the formula below. The cos similarity is represented by -1 to +1 and is +1 when facing in exactly the same direction. スクリーンショット 2020-10-16 11.07.59.png Therefore, if you calculate all the COS similarity between the feature vector of the song you want to search for similar songs and the feature vectors of other songs and sort the results, you can search for songs in descending order of similarity.

Now that you have a rough sketch, let's proceed concretely along with it.

6. Convert waveform data to image

Convert BGM waveform data to Mel frequency spectrogram using the package ** lbrosa ** used for music and sound analysis in Python.

import sys
import numpy as np
import librosa
import matplotlib.pyplot as plt
import scipy.io.wavfile
import librosa.display
import os

def save_png(filename,soundpath,savepath):

    #Read audio file
    wav_filename = soundpath+filename
    rate, data = scipy.io.wavfile.read(wav_filename)

    #16-bit audio file data-Normalize from 1 to 1
    data = data / 32768
    #Frame length
    fft_size = 1024                 
    #Frameshift length
    hop_length = int(fft_size / 4)  

    #Short-time Fourier transform execution
    amplitude = np.abs(librosa.core.stft(data, n_fft=fft_size, hop_length=hop_length))

    #Convert amplitude to decibels
    log_power = librosa.core.amplitude_to_db(amplitude)

    #graph display
    plt.figure(figsize=(4, 4)) 
    librosa.display.specshow(log_power, sr=rate, hop_length=hop_length)
    plt.savefig(savepath+filename + '.png',dpi=200)    

soundpath = './input/'
savepath = './output/'
cnt = 0

for filename in os.listdir(soundpath):
    cnt += 1
    print(cnt,'Processed the matter', filename)
    save_png(filename,soundpath,savepath)
    plt.close()

スクリーンショット 2020-10-16 15.57.15.png

When I run the code, I get a png image of 800x800 pixels. Since there is a margin around it, crop it from the center at 600 x 600 and resize it to 224 x 224.

7. Create a BGM classifier

As for how to create a dataset, it is too difficult to classify BGM from scratch, so I want to manage to use the tags in the BGM data list. When relying on tags, "use" and "music genre" are ambiguous, and "image" is subjective. I thought that the objective one was the "main instrument", so I decided to classify it based on the "main instrument" tag.

スクリーンショット 2020-10-17 12.49.21.png

Focus on frequently used tags, exclude songs that are not the main sound such as "percussion" and songs that have various sounds such as "ethnicity", and only songs with a red frame tag. Extract. As a result, we picked up 716 synthesizer songs, 596 piano songs, 419 electric guitar songs, 282 strings, 215 acoustic guitar songs, 177 brass sections, and 127 synth leads. After that, I listened to the songs I picked up and classified them. ..

For example, even if it has a synthesizer tag, it omits things like saxophone, which is only used for backing and is mainly used, or something that is certainly used but only used for a short time. I went (there are quite a lot of these). In the end, we selected a total of 300 songs, 75 songs each, divided into 4 types: ** "Acoustic Guitar", "Electric Guitar", "Piano", and "Synthesizer" **, including the consolidation of tags.

Then, the waveform data of 300 songs is converted into an image by the method explained earlier. Then create the dataset in NNC.

データセット（補足）.PNG

The 300 data were divided by learning: evaluation = 7: 3, and 210 training data and 90 test data were created.

Regarding the design of the neural network, as a privilege of the contest participants, the secretariat gave me the right to use the cloud version GPU of NNC for 10,000 yen, so I was able to try it without any hesitation. ** GPU can learn about 20 times faster than CPU, so I was very grateful that I could try various things with almost no stress **.

As a result of various trials, the model with relatively small cost parameters gave better results, probably because the training data was as small as 210.

モデル.PNG The cost parameter of this model is 1,128,700, which is a level that works practically on a CPU.

It is a learning curve. 学習曲線.PNG The learning was done using NNC Windows version (CPU), considering the processing after this, and 100 epoch was completed in 48 minutes. The best validation error (VALIDATION ERROR) is 0.530326 at 60 epoch.

Confusion Matrix and precision. 混同行列PNG.PNG ** Accuracy ** (Accuracy) is ** 84.44% **, and there is no big variation in recall for each label, which is a reasonable classification accuracy.

8. Get the BGM feature vector

Now that we have a BGM classifier, we will use it to find the feature vectors of all the BGM. At first, I was wondering if I had to do something quite complicated, but the Neural Network Console (Windows version) is well made, [** "Analyzing the intermediate output of a learned neural network". **](https://support.dl.sony.com/docs-ja/ Tutorial: Learned Neural Network /).

First, convert all the BGM waveforms into images to create test data (correct answer data can be anything), and replace it with the test data from the previous learning.

Next, create a network for intermediate result output.

ベクトル検出モデル.PNG

It is basically a copy of the learning network, but ** a second branch connection from Affine to Identify is added ** so that the feature vector can be retrieved.

This network is ** additionally registered ** in the ** EDIT tab ** with the name ActivationMonitor, and the ** Executor ** specification of ** CONFIG ** is also changed to this ActivationMonitor, ** Global Config ** Set ** Max Epoch ** to ** 0 **.

When you load the trained neural network with ** Open in EDIT Tab with Weight ** and start learning, Max Epoch is 0, so training is completed without doing anything. Then, when the evaluation is started, the feature vector of each image registered in the test data will be output as a CSV file (output_result.csv). It's amazing to be able to make a small turn like this!

This is the image of the CSV file. Since the Affine is 100-dimensional, the feature vector is also 100-dimensional. スクリーンショット 2020-10-15 21.52.47.png

In the editor etc., the columns of y: label and CategoricalCrossEntropy are deleted, and the name of x: image is displayed in a simplified form.

9. Search for similar songs from the feature vector

Now that we have a CSV file that summarizes the feature vectors of each BGM, we will use Python to search for songs with the top 5 similarities.

import csv
import numpy as np

#Initial setting
N = 500  #Index to calculate similarity(audiostock_Specify 43054)

#cos similarity calculation function
def cos_sim(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

#Load feature vector into array
with open('./output_result.csv', 'r') as f:
     reader = csv.reader(f)

     h = next(reader)  #Skip header
     for i, line in enumerate(reader):
        if i == 0:
            array = np.array(line[1:], dtype=float)
            index = [line[0]]
        else:
            array = np.vstack([array,np.array(line[1:], dtype=float)])
            index.append(line[0])
     print('Feature vector read completed', array.shape)
     print('index reading completed', len(index))

#cos similarity calculation
for i in range(len(array)):
    x = cos_sim(array[N], array[i])     
    if i == 0:
        score = x
    else:
        score = np.hstack([score,x])
print('cos similarity calculation completed', score.shape)

#TOP5 index search
for j in range(1,6):
    b = np.where(score==np.sort(score)[-j])
    n = int(b[0])
    print('TOP'+str(j), index[n][:16], 'score = '+str(score[n]))

スクリーンショット 2020-10-18 21.38.53.png When you execute the code, the song titles are arranged in descending order of similarity. TOP1 ʻaudiostock_43054 is the song to be searched (so the cos similarity is the maximum 1). Looking at TOP2 and beyond, the one with the highest similarity is ʻaudiostock_52175, which has a similarity of 0.9530.

Now, let's search for similar songs for various songs. スクリーンショット 2020-10-19 9.59.08.png

I picked up four songs, asked if there were songs similar to TOP1 in TOP2 to TOP5, checked, and marked the songs that seemed to be applicable in yellow.

As a result, it was found that 11 of the 16 songs could be judged as similar songs, and that similar songs could be searched with ** an average accuracy of about 70% **. We also found that 9 out of 11 similar songs are not in the dataset, and ** similar song search works effectively **.

In particular, I thought that the TOP5 of light acoustic guitars marked with a star and the TOP3 of rock are exactly the same songs as I aimed.

10. Summary

In this challenge, I found that ** searching for similar songs using the BGM classifier is quite usable **. By adding this method in addition to the current song search method, you can efficiently find your favorite song from a huge amount of BGM.

Also, for the first time, I learned that the Neural Network Console has a function that makes it easy to turn around, "analyzing the intermediate output of a learned neural network." Neural Network Console is recommended because it is an easy-to-use AI tool that can reach the itch.

2nd Neural Network Console Challenge Realize similar song search