Automatic composition by deep learning (Stacked LSTM edition) [DW Day 6]

0. Roughly speaking

--Written Stacked LSTM in Chainer --I tried automatic composition using it --This happened → Playback (Caution! Sound will be played immediately)

1. What is LSTM?

See below.

2. What is Stacked LSTM?

A neural network with multiple layers of LSTM. It is expected that long correlations and short correlations can be learned in each layer by making layers. By the way, there is also a network called Grid LSTM, which is multidimensional by connecting LSMTs in the vertical and horizontal directions. It seems that Wikipedia's character prediction task and Chinese translation task are performing well.

3. Chainer code

I made a neural network like the one below.

The input layer and output layer are One-hot vectors. The four intermediate layers (orange) are the LSTM layers.

class rnn(Chain):
  state = {}

  def __init__(self, n_vocab, n_units):
    print n_vocab, n_units
    super(rnn, self).__init__(
      l_embed = L.EmbedID(n_vocab, n_units),
      l1_x = L.Linear(n_units, 4 * n_units),
      l1_h = L.Linear(n_units, 4 * n_units),
      l2_x = L.Linear(n_units, 4 * n_units),
      l2_h = L.Linear(n_units, 4 * n_units),
      l3_x=L.Linear(n_units, 4 * n_units),
      l3_h=L.Linear(n_units, 4 * n_units),
      l4_x=L.Linear(n_units, 4 * n_units),
      l4_h=L.Linear(n_units, 4 * n_units),
      l_umembed = L.Linear(n_units, n_vocab)
    )

  def forward(self, x, t, train=True, dropout_ratio=0.5):
    h0 = self.l_embed(x)
    c1, h1 = F.lstm(
      self.state['c1'],
      F.dropout( self.l1_x(h0), ratio=dropout_ratio, train=train ) + self.l1_h(self.state['h1'])
    )
    c2, h2 = F.lstm(
      self.state['c2'],
      F.dropout( self.l2_x(h1), ratio=dropout_ratio, train=train ) + self.l2_h(self.state['h2'])
    )
    c3, h3 = F.lstm(
      self.state['c3'],
      F.dropout( self.l3_x(h2), ratio=dropout_ratio, train=train ) + self.l3_h(self.state['h3'])
    )
    c4, h4 = F.lstm(
      self.state['c4'],
      F.dropout( self.l4_x(h3), ratio=dropout_ratio, train=train ) + self.l4_h(self.state['h4'])
    )
    y = self.l_umembed(h4)
    self.state = {'c1': c1, 'h1': h1, 'c2': c2, 'h2': h2, 'c3': c3, 'h3': h3, 'c4': c4, 'h4': h4}
    if train:
      return F.softmax_cross_entropy(y, t), F.accuracy(y, t)
    else:
      return F.softmax(y), y.data

  def initialize_state(self, n_units, batchsize=1, train=True):
    for name in ('c1', 'h1', 'c2', 'h2', 'c3', 'h3', 'c4', 'h4'):
      self.state[name] = Variable(np.zeros((batchsize, n_units), dtype=np.float32), volatile=not train)

4. Experiment (I tried to improve the performance of automatic music generation)

I want to improve the performance of automatic composition in the previous article (RNN + LSTM automatic composition [DW Day 1]). Last time, it was 2-layer LSTMs, but I replaced it with the above-mentioned 4-layer LSTMs and tried to relearn.

Training data

I used the same midi data as last time. However, midi needs to be converted to text data format. The code to convert to text is also listed below. In the code below, you can extract only one track and convert it to text by using python midi2text.py --midi foo.midi. `0_80_40_00 0_90_4f_64 0_90_43_64 120_80_4f_00 ・ ・ ・`, 1 or 2 bytes following the delta time, status byte, and status byte, concatenated with underscores (called "chunks") are lined up separated by half-width spaces. Text data is generated. This text data was trained by the above-mentioned LSTM as training data.

Learning curve

Song generation

After learning, a series of chunks was generated using the same network. The generated chunk sequence was made into a midi file using the code described later.

This happened → Playback (Caution! Audio will be played immediately)

Impressions

The song was pretty good. However, it seems that the influence of the underlying midi is quite strong (remaining). I also expected the song to be structured with Stacked LSTM, but it was more monotonous than I had imagined. In order to make the song move, it is better to intentionally add irregular noise (sound that transfers the series that has continued until then) at the time of output of the series, instead of generating everything automatically. I think you can get it. From the point of view of composition, there are other issues besides sound selection. Adjusting the tone, adding effects, and sessions with multiple instruments are also issues.

code

--midi → text (in the form of the actual data `` following the `delta time_status byte_status byte)

!/usr/bin/env python
 -*- coding: utf-8 -*-

import sys
import os
import struct
from binascii import *
from types import *
reload(sys)
sys.setdefaultencoding('utf-8')

def is_eq_0x2f(b):
  return int(b2a_hex(b), 16) == int('2f', 16)

def is_gte_0x80(b):
  return int(b2a_hex(b), 16) >= int('80', 16)

def is_eq_0xff(b):
  return int(b2a_hex(b), 16) == int('ff', 16)

def is_eq_0xf0(b):
  return int(b2a_hex(b), 16) == int('f0', 16)

def is_eq_0xf7(b):
  return int(b2a_hex(b), 16) == int('f7', 16)

def is_eq_0x8n(b):
  return int(b2a_hex(b), 16) >= int('80', 16) and int(b2a_hex(b), 16) <= int('8f', 16)

def is_eq_0x9n(b):
  return int(b2a_hex(b), 16) >= int('90', 16) and int(b2a_hex(b), 16) <= int('9f', 16)

def is_eq_0xan(b): # An: 3byte
  return int(b2a_hex(b), 16) >= int('a0', 16) and int(b2a_hex(b), 16) <= int('af', 16)

def is_eq_0xbn(b): # Bn: 3byte
  return int(b2a_hex(b), 16) >= int('b0', 16) and int(b2a_hex(b), 16) <= int('bf', 16)

def is_eq_0xcn(b): # Cn: 2byte
  return int(b2a_hex(b), 16) >= int('c0', 16) and int(b2a_hex(b), 16) <= int('cf', 16)

def is_eq_0xdn(b): # Dn: 2byte
  return int(b2a_hex(b), 16) >= int('d0', 16) and int(b2a_hex(b), 16) <= int('df', 16)

def is_eq_0xen(b): # En: 3byte
  return int(b2a_hex(b), 16) >= int('e0', 16) and int(b2a_hex(b), 16) <= int('ef', 16)

def is_eq_0xfn(b):
  return int(b2a_hex(b), 16) >= int('f0', 16) and int(b2a_hex(b), 16) <= int('ff', 16)

def mutable_lengths_to_int(bs):
  length = 0
  for i, b in enumerate(bs):
    if is_gte_0x80(b):
      length += ( int(b2a_hex(b), 16) - int('80', 16) ) * pow(int('80', 16), len(bs) - i - 1)
    else:
      length += int(b2a_hex(b), 16)
  return length

def int_to_mutable_lengths(length):
  length = int(length)
  bs = []
  append_flag = False
  for i in range(3, -1, -1):
    a = length / pow(int('80', 16), i)
    length -= a * pow(int('80', 16), i)
    if a > 0:
      append_flag = True
    if append_flag:
      if i > 0:
        bs.append(hex(a + int('80', 16))[2:].zfill(2))
      else:
        bs.append(hex(a)[2:].zfill(2))
  return bs if len(bs) > 0 else ['00']

def read_midi(path_to_midi):
  midi = open(path_to_midi, 'rb')
  data = {'header': [], 'tracks': []}
  track = {'header': [], 'chunks': []}
  chunk = {'delta': [], 'status': [], 'meta': [], 'length': [], 'body': []}
  current_status = None

  """
  Load data.header
  """
  bs = midi.read(14)
  data['header'] = [b for b in bs]

  while 1:
    """
    Load data.tracks[0].header
    """
    if len(track['header']) == 0:
      bs = midi.read(8)
      if bs == '':
        break
      track['header'] = [b for b in bs]

    """
    Load data.tracks[0].chunks[0]
    """
    # delta time
    # ----------
    b = midi.read(1)
    while 1:
      chunk['delta'].append(b)
      if is_gte_0x80(b):
        b = midi.read(1)
      else:
        break

    # status
    # ------
    b = midi.read(1)
    if is_gte_0x80(b):
      chunk['status'].append(b)
      current_status = b
    else:
      midi.seek(-1, os.SEEK_CUR)
      chunk['status'].append(current_status)

    # meta and length
    # ---------------
    if is_eq_0xff(current_status): # meta event
      b = midi.read(1)
      chunk['meta'].append(b)
      b = midi.read(1)
      while 1:
        chunk['length'].append(b)
        if is_gte_0x80(b):
          b = midi.read(1)
        else:
          break
      length = mutable_lengths_to_int(chunk['length'])
    elif is_eq_0xf0(current_status) or is_eq_0xf7(current_status): # sysex event
      b = midi.read(1)
      while 1:
        chunk['length'].append(b)
        if is_gte_0x80(b):
          b = midi.read(1)
        else:
          break
      length = mutable_lengths_to_int(chunk['length'])
    else: # midi event
      if is_eq_0xcn(current_status) or is_eq_0xdn(current_status):
        length = 1
      else:
        length = 2

    # body
    # ----
    for i in range(0, length):
      b = midi.read(1)
      chunk['body'].append(b)

    track['chunks'].append(chunk)


    if is_eq_0xff(chunk['status'][0]) and is_eq_0x2f(chunk['meta'][0]):
      data['tracks'].append(track)
      track = {'header': [], 'chunks': []}
    chunk = {'delta': [], 'status': [], 'meta': [], 'length': [], 'body': []}

  return data

def write_text(tracks):
  midi = open('out.txt', 'w')
  for track in tracks:
    for chunks in track:
      midi.write('{} '.format(chunks))

if __name__ == '__main__':
  from argparse import ArgumentParser
  parser = ArgumentParser(description='audio RNN')
  parser.add_argument('--midi', type=unicode, default='', help='path to the MIDI file')
  args = parser.parse_args()
  
  data = read_midi(args.midi)

  # extract midi track
 track_list = [1] # ← Track number you want to extract

  tracks = []
  for n in track_list:
    raw_data = []
    chunks = data['tracks'][n]['chunks']
    for i in range(0, len(chunks)):
      chunk = chunks[i]
      if is_eq_0xff(chunk['status'][0]) or \
         is_eq_0xf0(chunk['status'][0]) or \
         is_eq_0xf7(chunk['status'][0]) :
        continue
      raw_data.append('_'.join(
        [str(mutable_lengths_to_int(chunk['delta']))] +
        [str(b2a_hex(chunk['status'][0]))] +
        [str(b2a_hex(body)) for body in chunk['body']]
      ))
    tracks.append(raw_data)

  write_text(tracks)

--Text (delta time_status byte_status byte followed by actual data `` format) → midi

!/usr/bin/env python
 -*- coding: utf-8 -*-

import sys
import os
import struct
from binascii import *
from types import *
reload(sys)
sys.setdefaultencoding('utf-8')

def int_to_mutable_lengths(length):
  length = int(length)
  bs = []
  append_flag = False
  for i in range(3, -1, -1):
    a = length / pow(int('80', 16), i)
    length -= a * pow(int('80', 16), i)
    if a > 0:
      append_flag = True
    if append_flag:
      if i > 0:
        bs.append(hex(a + int('80', 16))[2:].zfill(2))
      else:
        bs.append(hex(a)[2:].zfill(2))
  return bs if len(bs) > 0 else ['00']

def write_midi(tracks):
  print len(tracks)
  midi = open('out.midi', 'wb')

  """
  MIDI Header
  """
  header_bary = bytearray([])
  header_bary.extend([0x4d, 0x54, 0x68, 0x64, 0x00, 0x00, 0x00, 0x06, 0x00, 0x00])
  header_bary.extend([int(hex(len(tracks))[2:].zfill(4)[i:i+2], 16) for i in range(0, 4, 2)])
  header_bary.extend([0x01, 0xe0])
  midi.write(header_bary)

  for track in tracks:
    track_bary = bytearray([])
    for chunk in track:
      # It is assumed that each chunk consists of just 4 elements
      if len(chunk.split('_')) != 4:
        continue
      int_delta, status, data1, data2 = chunk.split('_')

      if status[0] == '8' or status[0] == '9' or status[0] == 'a' or status[0] == 'b' or status[0] == 'e': # 3byte
        delta = int_to_mutable_lengths(int_delta)
        track_bary.extend([int(d, 16) for d in delta])
        track_bary.extend([int(status, 16)]) 
        track_bary.extend([int(data1, 16)])  
        track_bary.extend([int(data2, 16)])  
      elif status[0] == 'c' or status[0] == 'd':
        delta = int_to_mutable_lengths(int_delta)
        track_bary.extend([int(d, 16) for d in delta])
        track_bary.extend([int(status, 16)]) 
        track_bary.extend([int(data1, 16)])  
      else:
        print status[0]

    """
    Track header
    """
    header_bary = bytearray([])
    header_bary.extend([0x4d, 0x54, 0x72, 0x6b])
    header_bary.extend([int(hex(len(track_bary)+4)[2:].zfill(8)[i:i+2], 16) for i in range(0, 8, 2)])
    midi.write(header_bary)

    """
    Track body
    """
    print len(track_bary)
    midi.write(track_bary)

    """
    Track footer
    """
    footer_bary = bytearray([])
    footer_bary.extend([0x00, 0xff, 0x2f, 0x00])
    midi.write(footer_bary)

if __name__ == '__main__':

 # ↓ Arrange the format of "delta time_status byte_actual data following status byte" separated by spaces.
 # It doesn't work well if running status is included. .. ..
  txt = '0_80_40_00 0_90_4f_64 0_90_43_64 120_80_4f_00 0_80_43_00 0_90_51_64 0_90_45_64 480_80_51_00 0_80_45_00 0_90_4c_64 0_90_44_64 120_80_4c_00 0_80_44_00 0_90_4f_64 0_90_43_64 60_80_4f_00 0_80_43_00 0_90_4d_64 0_90_41_64 120_80_4d_00'
  tracks = [txt.split(' ')]
  write_midi(tracks)

Link

Study Deep Learning Thoroughly [DW Day 0]

Recommended Posts

Automatic composition by deep learning (Stacked LSTM edition) [DW Day 6]
Thoroughly study Deep Learning [DW Day 0]
Deep learning / LSTM scratch code
Rabbit Challenge Deep Learning 1Day
<Course> Deep Learning: Day1 NN
Subjects> Deep Learning: Day3 RNN
[Deep learning] Image classification with convolutional neural network [DW day 4]
Deep learning learned by implementation 1 (regression)
Introduction to Deep Learning ~ Dropout Edition ~
Study Minutes: Day 1
Python study day 1
Thoroughly study Deep Learning [DW Day 0]
[Rabbit Challenge (E qualification)] Deep learning (day2)
Deep learning learned by implementation 2 (image classification)
Deep learning from scratch (forward propagation edition)
[Rabbit Challenge (E qualification)] Deep learning (day3)
<Course> Deep Learning Day4 Reinforcement Learning / Tensor Flow
Produce beautiful sea slugs by deep learning
Deep Understanding Object Detection by Deep Learning by Keras
[Rabbit Challenge (E qualification)] Deep learning (day4)
Learning record (2nd day) Scraping by #BeautifulSoup