Hello everybody
Google made an automatic composition learner called magenta, so you can make as many songs as you like! I thought, but it wasn't really that easy. The file of the learning source seems to be midi, and it seems that it is not possible to learn by inserting any recorded file.
That's why I tried various things because I couldn't make my own automatic composition learning machine that can be made using wav. The last few articles are experiments for this time.
TL;DR
The repository uses the following https://github.com/niisan-tokyo/music_generator
The actual learning uses the following files. https://github.com/niisan-tokyo/music_generator/blob/master/src/stateful_learn.py
Google has released magenta, an automatic composition tool that uses tensorflow. https://magenta.tensorflow.org/ This is a great tool, but I'm a little unfamiliar with it because the target file is midi. (There used to be a lot, but nowadays there are a lot of mp3s ...)
For the time being, in order to handle the recorded file easily, it seems good to use the data of the waveform itself such as wav. Moreover, python can handle wav natively, so I thought I had to try this.
I thought about using the raw waveform data as it is for learning, but it was totally useless, so I thought of the following plan.
In other words, I thought I could make something like this Once you have a time-series frequency distribution, you can make music by performing an inverse Fourier transform on it.
In the previous article, the wav file returned by the inverse transform for the 256-frame Fourier transform could be heard without any problem. I think this is enough as a feature that expresses the sound itself. http://qiita.com/niisan-tokyo/items/764acfeec77d8092eb73
Fast Fourier Transform (FFT) with the numpy library will give you 256 frequency distributions, for example in a 256 frame interval. Normally, the frame rate (frames per second) of an audio file is 44100, so you can get a frequency distribution every 256/44100 = 5.8 (msec). The idea is that if we can create a learning machine that can automatically generate this time transition every 5.8 msec, music will be created automatically.
stateful RNN RNN is a recurrent neural network (Recurrent Neural Network), which is a type of network that handles continuous states by referring to previously calculated contents when obtaining output from input. http://qiita.com/kiminaka/items/87afd4a433dc655d8cfd
When dealing with RNNs in Keras, it usually takes several consecutive states as inputs and creates an output, but at each input the previous state is reset. The stateful RNN performs the following processing while maintaining the state after this previous processing. It is expected that this will enable complex series processing at irregular intervals.
This time, I thought that it would be possible to create a generator with a song flow by sequentially learning the input as the frequency distribution at the current moment and the output as the frequency distribution at the next moment with a stateful RNN.
If you have m4a or mp3 files, you can use ffmpeg to convert them to wav. http://qiita.com/niisan-tokyo/items/135824905e4a3021d358 I record my favorite game music on mac and spit it out to wav.
Since the dataset is created by Fourier transforming wav, basically you can refer to the code, but there are some caveats.
def create_test_data(left, right):
arr = []
for i in range(0, len(right)-1):
#Complex number vectorization
temp = np.array([])
temp = np.append(temp, left[i].real)
temp = np.append(temp, left[i].imag)
temp = np.append(temp, right[i].real)
temp = np.append(temp, right[i].imag)
arr.append(temp)
return np.array(arr)
This is the part where the frequency distribution of the Fourier-transformed stereo sound source is combined to create the data for input. Here, the real and imaginary parts of the frequency distribution represented by complex numbers are reinserted into separate elements of the vector. This is because if you try to calculate with complex numbers, the imaginary part will be dropped. As a result, the number of samples in one section is 256 frames, but the actual input dimension is 1024.
The model is as simple as connecting three LSTMs and finally inserting a fully connected layer.
model = Sequential()
model.add(LSTM(256,
input_shape=(1, dims),
batch_size=samples,
return_sequences=True,
activation='tanh',
stateful=True))
model.add(Dropout(0.5))
model.add(LSTM(256, stateful=True, return_sequences=True, activation='tanh'))
model.add(Dropout(0.3))
model.add(LSTM(256, stateful=True, return_sequences=False, activation='tanh'))
model.add(Dropout(0.2))
model.add(Dense(dims))
model.compile(loss='mse', optimizer='adam')
Now, there are some conditions when using a stateful RNN.
First, you have to specify the input dimensions per batch.
Then each sample in the previous batch and each sample in the next batch must be continuous as a series.
As a specific example, if there is a first batch X_1
and a second batch X_2
, the i-th sample $ X_1 [i] $ and $ X_2 [i] $ of both are related. It means that there must be.
This time, we assume that the generator is made up of $ x_ {n + 1} = RNN (x_n) $, and create the next set of the same number of states from multiple consecutive states.
In other words, $ X_2 [i] $ is always ahead of $ X_1 [i] $ by the number of samples. I don't think it's a little too sloppy, but I think it's a machine that repeats the number of samples, that is, 32 pieces each to create the next state.
Now that you're ready, let's start learning.
for num in range(0, epochs):
print(num + 1, '/', epochs, ' start')
for one_data in test:
in_data = one_data[:-samples]
out_data = np.reshape(one_data[samples:], (batch_num, dims))
model.fit(in_data, out_data, epochs=1, shuffle=False, batch_size=samples)
model.reset_states()
print(num+1, '/', epochs, ' epoch is done!')
model.save('/data/model/mcreator')
Since the order of batches is important during learning, I try not to shuffle the samples on the batch. In addition, learning is performed for each wav, and the internal state is reset after learning once.
First of all, when I try to learn, it seems that the fitting is progressing somehow.
1 / 10 start
Epoch 1/1
16384/16384 [==============================] - 87s - loss: 1.9879e-04
Epoch 1/1
16384/16384 [==============================] - 84s - loss: 1.9823e-04
Epoch 1/1
16384/16384 [==============================] - 75s - loss: 1.1921e-04
Epoch 1/1
16384/16384 [==============================] - 82s - loss: 2.3389e-04
Epoch 1/1
16384/16384 [==============================] - 80s - loss: 3.7428e-04
Epoch 1/1
16384/16384 [==============================] - 90s - loss: 3.3968e-04
Epoch 1/1
16384/16384 [==============================] - 87s - loss: 5.0188e-04
Epoch 1/1
16384/16384 [==============================] - 76s - loss: 4.9725e-04
Epoch 1/1
16384/16384 [==============================] - 74s - loss: 3.7447e-04
Epoch 1/1
16384/16384 [==============================] - 87s - loss: 4.1855e-04
1 / 10 epoch is done!
2 / 10 start
Epoch 1/1
16384/16384 [==============================] - 82s - loss: 1.9742e-04
Epoch 1/1
16384/16384 [==============================] - 85s - loss: 1.9718e-04
Epoch 1/1
16384/16384 [==============================] - 90s - loss: 1.1876e-04
Epoch 1/1
16384/16384 [==============================] - 104s - loss: 2.3144e-04
Epoch 1/1
16384/16384 [==============================] - 97s - loss: 3.7368e-04
Epoch 1/1
16384/16384 [==============================] - 78s - loss: 3.3906e-04
Epoch 1/1
16384/16384 [==============================] - 87s - loss: 5.0128e-04
Epoch 1/1
16384/16384 [==============================] - 79s - loss: 4.9627e-04
Epoch 1/1
16384/16384 [==============================] - 82s - loss: 3.7420e-04
Epoch 1/1
16384/16384 [==============================] - 90s - loss: 4.1857e-04
2 / 10 epoch is done!
...
Let's make a sound using the model that is completed.
I will leave the detailed code to stateful_use.py in the repository, so if you write only the general flow,
The generator part is as follows
#Fourier transform of seed file
Kl = fourier(left, N, samples * steps)
Kr = fourier(right, N, samples * steps)
sample = create_test_data(Kl, Kr)
sample = np.reshape(sample, (samples * steps, 4 * N))
music = []
#Enter seed data into the model=>"Foster" the state
for i in range(steps):
in_data = np.reshape(sample[i * samples:(i + 1) * samples], (samples, 1, 4 * N))
model.predict(np.reshape(in_data, (samples, 1, 4 * N)))
#Music is self-generated by sequentially substituting the last output data into the model whose state has been changed with seed data.
for i in range(0, frames):
if i % 50 == 0:
print('progress: ', i, '/', frames)
music_data = model.predict(np.reshape(in_data, (samples, 1, 4 * N)))
music.append(np.reshape(music_data, (samples, 4 * N)))
in_data = music_data
music = np.array(music)
The data obtained in this way is subjected to inverse Fourier transform, converted to real space, and then written to wav. Looking at the waveform in real space, it looks like the following.
A little longer span When I listen to this with wav, it is in a surreal state where the buzzer sound of a certain scale called "boo" is played all the time. .. ..Far from music, it has become a mysterious machine that only produces steady sounds. No matter what kind of music file you put in, it will make the same sound.
I think the reason why it didn't work is that the transition of sound is so intense that learning has progressed in the direction of taking a constant that minimizes the error. Maybe that's why the error fluctuation is too small despite the complicated system. If you try to do it stateless, you don't know how many sequences you should take, and the number of dimensions increases by the amount of the sequence, making it difficult to learn easily. Or it may be that the number of learnings is too small, but since the loss is already very small, the idea may be different.
Either way, we need to improve more.
If you could make a song so easily, it wouldn't be that easy. It didn't work very well, but I think I've learned to some extent how to use python, especially the meaning of numpy, so I think that's a good point.
Also, I don't care, but I ended up studying numpy's reshape very much.
This time is such a place
2017/06/18 The following changes have been made.
I got the following waveform. The background noise remained as usual, but now it sounds like it has a constant rhythm. In addition, the obtained frequency distribution (the real part) is as follows. The distribution map for 10 pieces is overlaid with 5 frames each.
N = 256, number of LSTM neurons = 256 N = 1024, number of LSTM neurons = 512 When N = 1024 and the number of neurons are 256, it is the same as when N = 256.
When the Fourier transform was performed, the factor to be applied was changed, and of course the loss increased. Since mse is used for loss, changing the factor increases loss by the square. This makes it possible to observe changes in smaller components, which may have improved accuracy. Also, by increasing the number of neurons, the power of expression increased.
As an improvement, it is possible to increase the factor applied during the Fourier transform or further increase the number of neurons. Since the number of epochs is 10 and the number of original songs is 9, you can change this, but since the change in loss is small, I could not see the effect of not being good, so I thought it was reserved for the time being.
Recommended Posts