I tried to classify music major / minor on Neural Network

The other day, I examined the code of a simple Recurrent Neural Network, and from the idea of wanting to do something with it, I came up with the idea of classifying music in major / minor. .. Of course, I'm not familiar with music, so I started by looking up major and minor keys on Wikipedia.

Key is one of the musical terms. When a melody or chord is composed in association with a central note, the music is said to be tonality. In a narrow sense, in traditional Western music, two keys, a major key and a minor key, which are composed of diatonic scale sounds, are known, and they are the diatonic scale do sound and the diatonic scale do sound, respectively. The sound of la is the central tone.

The basic definition is as follows, but as I learned in elementary and junior high school, the major key sounds "bright and energetic", while the minor key sounds "dark and heavy". I investigated whether this could be classified programmatically.

A little more explanation of major key and minor key

The central note is also called the root note (hereinafter referred to as the base key), but the most well-known is the "C major" with "C" as the base key, the so-called "Dremi Fasolaside". (In Japan, it is also called "C major".) The minor key starting with the sound of La is "A minor" ("A minor"). I will quote both of these scales from wikipedia.

Fig. C major scale

Fig. A minor scale

Do you understand? In these two scales, the scale is written on a plain score with neither a pound sign nor a flat sign on the right side of the treble clef (tadpole). In other words, C major and A minor consist of "same constituent notes". (The elements are the same.) The difference between the two is whether the Base key is "C" or "A". (A minor is shifted low.) By the way, it seems that the relationship between two scales with the same constituent notes is called "parallel key".

Now, when dealing with the scale of music, we assign numbers to each note.

Fig. Key mapping

The above figure shows the keyboard for one octave, but integers such as 3, 4, 5 ... are assigned in order from the leftmost "C" (do). Numbers are assigned even where there is a black key, so if you look only at the white key, it will be a slightly irregular sequence, but we will proceed with programming in this way.

Problem setting and dataset generation

I mentioned that there are "C major" and "A minor" as typical ones in major and minor, but since this alone is boring as a classification problem, we handle 5 majors, 5 minors, a total of 10 types of music scales. ， The problem to be roughly divided into major (major) and minor (minor) was set.

We prepared the following 10 types of scales. There are 5 majors and 5 minors. C major, A minor (these two are in parallel), G major, E minor (these two are in parallel), D major, B minor (these two are in parallel), A major, F sharp minor (these two are in parallel), E major, C sharp minor (these two are in parallel).

A program was used to randomly select from these 10 types of scales and generate songs (sound sequences). First, we prepared constants that follow the key rules.

 scale_names = ['Cmj', 'Gmj', 'Dmj', 'Amj', 'Emj', 'Amn', 'Emn', 'Bmn', 'Fsmn', 'Csmn']

cmj_set = [3, 5, 7, 8, 10, 12, 14]
cmj_base = [3, 15]
amn_base = [12, 24]

gmj_set = [3, 5, 7, 9, 10, 12, 14]
gmj_base = [10, 22]
emn_base = [7, 19]

dmj_set = [4, 5, 7, 9, 10, 12, 14]
dmj_base = [5, 17]
bmn_base = [14, 26]

amj_set = [4, 5, 7, 9, 11, 12, 14] 
amj_base = [12, 24]
fsmn_base = [9, 21]

emj_set = [4, 6, 7, 9, 11, 12, 14] 
emj_base = [7, 19]
csmn_base = [4, 16]

scale_names is a scale string. Next, prepare a list of constituent notes consisting of seven integers (for one octave). For example, set the constituent sound "Doremi Fasorashi" of C major to [3, 5, 7, 8, 10, 12, 14] by referring to the Key map in the above figure. Since this is one octave, the sound one octave higher can be calculated as [15, 17, 19, 20, 22, 24, 26] by adding 12 to this list element. Also, define the Base Key for each scale. Let the Base Key of C major be "do" and "do" one octave higher, such as cmj_base = [3, 15].

Next, in minor, as mentioned above, C major and A minor are in a parallel key relationship, and the constituent notes are the same. Only the base key of A minor is defined as ʻamn_base = [12, 24] (sound of" la "). Later, when generating a song (sequence) of A minor, refer to the constituent note list cmj_set` of C major. Since such a "parallel key" relationship is used, the following scale definition constants (list in list) are prepared and used in advance.

scale_db = [
    [cmj_set, cmj_base, amn_base],
    [gmj_set, gmj_base, emn_base],
    [dmj_set, dmj_base, bmn_base],
    [amj_set, amj_base, fsmn_base],
    [emj_set, emj_base, csmn_base]
    ]

Next is the generation of the data set, which is expressed in pseudo code as follows.

# Begin
　　 # 0 ..Generate 9 random numbers and respond'key'To decide.
    key_index = np.random.randint(10)
    
    #to have been decided'key'Extract the list for one octave of the constituent notes and the Base Key list from.
    myset, mybase = (scale_db[][], scale_db[][])
    #Extend the scale to 2 octaves.
    myscale2 = prep_2x_scale(myset)
    
    #Of sequence length'for'loop
    for i in range(m_len):
        if i == 0:    #The first note is the Base Key (one octave higher)
        　　cur_key = base[1]
        else:         #The second and subsequent sounds randomly determine their direction.
            direct = np.random.randint(7)
            if t < 3 :
In the scale list, select a note that is one note lower than the previous note.
            if t < 4 :
Select the same sound as before.
            else:
In the scale list, select a note that is one higher than the previous note.
                
# Check how the sequence ends
    if last_ley in base:    #The last sound is Base Key?
        proper = True
Adopt as data.
    else 
        proper = False
This sequence is abandoned because it doesn't end well.

# End

In this way, random numbers are used to generate a sequence of numbers that indicate the key of the sound. Any sequence length and any number can be generated, and the output example is as follows. (This time, the sequence length is 20.)

21, 19, 17, 19, 21, 19, 17, 16, 14, 16, 17, 19, 21, 19, 21, 23, 21, 21, 19, 21, Fsmn
16, 14, 16, 14, 16, 14, 12, 11, 12, 14, 12, 14, 12, 14, 16, 14, 12, 14, 16, 16, Csmn
26, 24, 24, 24, 22, 22, 24, 22, 24, 22, 24, 22, 22, 21, 22, 24, 24, 22, 24, 26, Bmn
21, 23, 21, 19, 21, 23, 24, 24, 26, 26, 24, 23, 21, 19, 21, 19, 21, 23, 23, 21, Fsmn
24, 26, 26, 24, 22, 20, 22, 20, 19, 19, 17, 15, 14, 15, 17, 15, 17, 15, 14, 12, Amn
 ...

In this way, an integer sequence and a set of Key labels (character strings) are output, but in the sequence on the first line, it starts with the bass note 21 of'F sharp minor'and ends with the same note 21. You can check it. In the'A minor'sequence on the 5th line, it can be confirmed that it starts with the bass note 24 and ends with the bass note 12 one octave below it.

(Supplement)

The problem set this time is to classify major (bright feeling) and minor (dark feeling), but it is certain to actually play and ask whether the data generated by the above program is valid. I think. I also took out the iPad and pressed the keyboard with the GarageBand app, but it was quite difficult to play so that I could understand the tone (bright / dark), so I gave up immediately. (Maybe I should have used the midi standard format for automatic performance, but I don't have the skill or guts to do that. However, for the 10 types of Key Scales I dealt with this time, I hit the keyboard (limited). However, I confirmed that it was bright / dark.)

Neural Network model (preliminary examination with MLP)

Before trying Recurrent Neural Network (RNN), I first investigated what happens with the Multi-layer Perceptron (MLP) model. The model is to consider each sequence of sounds of a predetermined length as an independent number and input this to the number of network units to obtain output. Since it is possible to compose two keys (major and minor) from the same constituent notes, it was expected that the accuracy of classification would not improve with this model that does not use sequences.

The configuration of this model layer is as follows.

class HiddenLayer(object):

(Omitted)

class ReadOutLayer(object):

(Omitted)

h_layer1 = HiddenLayer(input=x, n_in=seq_len, n_out=40)            #Hidden layer 1
h_layer2 = HiddenLayer(input=h_layer1.output(), n_in=40, n_out=40) #Hidden layer 2
o_layer = ReadOutLayer(input=h_layer2.output(), n_in=40, n_out=1)  #Output layer

This is an MLP model with two hidden layers and finally an output layer, for a total of three layers. (This time, the sequence length is set to seq_len = 20.) The figure below shows the status of the calculation performed with this model.

Fig. Loss & Accuracy by MLP model (RMSProp)

The red line is the cost and the blue line is the classification accuracy of Train Data. The calculation is oscillating, probably because the hyperparameter settings (or regularization process) were not appropriate, but the final accuracy is 0.65. Since it is a binary classification problem, if the dice are rolled or a coin toss is performed and the classification is performed appropriately, the accuracy is 0.50, which is a slightly improved accuracy from this baseline. The impression is that it wasn't as bad as I expected.

At the beginning of the calculation, I was worried about the part where Loss and Accuracy were stagnant and the point that the calculation was oscillating, so the result of calculating by changing the optimizer is shown in the figure below.

Fig. Loss & Accuracy by MLP model (Gradient Descent)

The vibration of the calculation has disappeared, but the stagnation at the beginning of the calculation remains. The accuracy is slightly improved to about 0.67.

I calculated it with RNN (Elman Net), but ...

Next, the calculation was performed using Elman Net, which is a favorite, a simple RNN (Recurrent Nueral Network). The main part of this model is the following code.

class simpleRNN(object):
    #   members:  slen  : state length
    #             w_x   : weight of input-->hidden layer
    #             w_rec : weight of recurrnce 
    def __init__(self, slen, nx, nrec, ny):
        self.len = slen
        self.w_h = theano.shared(
            np.asarray(np.random.uniform(-.1, .1, (nx)),
            dtype=theano.config.floatX)
        )
        self.w_rec = theano.shared(
            np.asarray(np.random.uniform(-.1, .1, (nrec)),
            dtype=theano.config.floatX)
        )
        self.w_o = theano.shared(
            np.asarray(np.random.uniform(-1., .1, (ny)),
            dtype=theano.config.floatX)
        )
        self.b_h = theano.shared(
            np.asarray(0., dtype=theano.config.floatX)            
        )
        self.b_o = theano.shared(
            np.asarray(0., dtype=theano.config.floatX)
        )
    
    def state_update(self, x_t, s0):
        # this is the network updater for simpleRNN
        def inner_fn(xv, s_tm1, wx, wr, wo, bh, bo):
            s_t = xv * wx + s_tm1 * wr + bh
            y_t = T.nnet.sigmoid(s_t * wo + bo)
            
            return [s_t, y_t]
        
        w_h_vec = self.w_h[0]
        w_rec_vec = self.w_rec[0]
        w_o = self.w_o[0]
        b_h = self.b_h
        b_o = self.b_o
        
        [s_t, y_t], updates = theano.scan(fn=inner_fn,
                        sequences=[x_t],
                        outputs_info=[s0, None],
                        non_sequences=[w_h_vec, w_rec_vec, w_o, b_h, b_o]
        )
        return y_t
(Omitted)

    net = simpleRNN(seq_len, 1, 1, 1)
    y_t = net.state_update(x_t, s0)
    y_hypo = y_t[-1]
    prediction = y_hypo > 0.5
    
    cross_entropy = T.nnet.binary_crossentropy(y_hypo, y_)

Refer to the figure for explanation.

Fig. Simple RNN structure

The figure shows the configuration developed in chronological order on the premise of the BPTT method (Backpropagation through time). Sound sequence data is input to this model as [X1, X2, X3, ..., Xn]. After weighting this, it is output to the hidden layer S, the recursion is calculated, and finally the series [Y1, Y2, Y3, ..., Yn] is output. The output of the last unit Yn of this Y series is passed through the activation function to obtain a binary number (0 or 1).

I tried to execute the calculation with expectation, but the result was disappointing.

Fig. Loss & Accuracy by 1st RNN model (RMSProp)

Little progress was made, and the final accuracy was 0.58, which is not much different from the zero performance of 0.5. (It was useless even if I changed it to the optimizer or tampered with the hyperparameters.)

I suspected that the cause was that only [Yn] in the output sequence was referenced and the remaining information [Y1 .. Yn-1] was discarded. Therefore, we examined the improvement of the model.

RNN improved model (added output layer)

In order to refer to all the output values of the sequence [Y1, Y2, ..., Yn], we decided to weight them to create a signal for classification.

Fig. Simple RNN + Read-out Layer structure

The code is created by inserting the output layer part of the MLP model.

class simpleRNN(object):
    #   members:  slen  : state length
    #             w_x   : weight of input-->hidden layer
    #             w_rec : weight of recurrnce 
    def __init__(self, slen, nx, nrec, ny):

(Omitted)
    
    def state_update(self, x_t, s0):
    
(Omitted)

class ReadOutLayer(object):                 # <====Additional class
    def __init__(self, input, n_in, n_out):
        self.input = input
        
        w_o_np = 0.05 * (np.random.standard_normal([n_in,n_out]))
        w_o = theano.shared(np.asarray(w_o_np, dtype=theano.config.floatX))
        b_o = theano.shared(
            np.asarray(np.zeros(n_out, dtype=theano.config.floatX))
        )
       
        self.w = w_o
        self.b = b_o
        self.params = [self.w, self.b]
    
    def output(self):
        linarg = T.dot(self.input, self.w) + self.b
        self.output = T.nnet.sigmoid(linarg)  

        return self.output
        
(Omitted)

    net = simpleRNN(seq_len, 1, 1, 1)
    y_t = net.state_update(x_t, s0)
    y_tt = T.transpose(y_t)
    ro_layer = ReadOutLayer(input=y_tt, n_in=seq_len, n_out=1)  # <====add to
    
    y_hypo = (ro_layer.output()).flatten()
    prediction = y_hypo > 0.5
    
    cross_entropy = T.nnet.binary_crossentropy(y_hypo, y_)
    
(Omitted)

The situation where the calculation is executed is as follows.

Fig. Loss & Accuracy by 2nd RNN model (RMSProp)

As the learning progressed, the final accuracy improved to 0.73. It is thought that the reason is that the information of the output sequence was successfully extracted as intended, and the degree of freedom of the network increased due to the increase in the number of weights, so the goodness of fit (flexibility) in the learning process increased. ing.

However, with an accuracy of 0.73, it is below the initially expected value. (I was thinking about the classification accuracy of 0.9 + as a target.) It may be possible to further improve the accuracy by investigating and improving the movement of each weights, but this time I would like to finish it.

This time, I used an artificial music data created by a program with random numbers, but I think that this may also affect the low accuracy of this time. (Isn't there more complicated rules in actual music?) If data etc. can be obtained, I would like to classify melody made by humans into major / minor. (You may need to study a little more about music theory.)

References (web site)

-Key --Wikipedia -C major --Wikipedia -A minor --Wikipedia

[pdf] Artificial Neural Networks that Classify Music Chords
Theano scan　- Looping in Theano http://deeplearning.net/software/theano/library/scan.html
Theano optimizers - Gist/ kastnerkyle/opimizers.py https://gist.github.com/kastnerkyle/816134462577399ee8b2 --Deep Learning, Kodansha Machine Learning Professional Series