Detect General MIDI data from large amounts of MIDI

TL;DR

What is General MIDI?

General MIDI (general midi) is a unified MIDI standard that defines basic tone maps and control changes. Abbreviation is GM. (From wikipedia https://ja.wikipedia.org/wiki/General_MIDI)

So, Roland compliantly extended this standard with GS, and Yamaha extended it with XG. At that time, Roland's SC series was the best selling sound source of this kind, so quite a lot of GS data was distributed at Nifty forums and so on. Nostalgic,,,

I want such a MIDI file in 2020

So, search for GM, GS, XG data from 130,000 songs on this page of reddit (https://www.reddit.com/r/WeAreTheMusicMakers/comments/3ajwe4/the_largest_midi_collection_on_the_internet/). I made a script.

How to find it?

In the case of GM, GS, XG data, it often contains information to configure the device with SysEx (if you have created the data properly for those devices), so in the midi data I tried to make it with the policy that you should judge by looking at SysEx.

What is SysEx?

Abbreviation for System Exclusive, this page (https://www.g200kg.com/jp/docs/dic/systemexclusive.html) explains as follows.

This is one of the types of MIDI messages, and is not a function common to MIDI, but a message used to control functions such as effects specific to the model of the sound source.

So, when I checked the SysEx of each company by saying that MIDI data that uses some function of GM, GS, XG should always contain the corresponding SysEx, ...

Types of MIDI data standards System Exclusive
GM (General MIDI) F0 7E xx 09
GS (Roland's GM expansion) F0 41 xx 42
XG (Yamaha GM expansion F0 43 xx 4C

If it is included at the beginning of SysEx, it can be said that it is a project of each manufacturer. In the table above, F0 marks the beginning of SysEx. Next is the manufacturer ID, xx is the device ID (device-specific ID), and the last is the model ID (If you listen to Roland's GS, it looks like model ID 42, so you can identify the GS sound module), so use that.

About scripts

There is a library called mido (https://mido.readthedocs.io/en/latest/#) to handle midi in Python, so I'm using it. The repository name is also used to master it, isn't it? Mido is well maintained (important) and can be used in various ways, so I think it's perfect for handling midi.

The execution result is a file called GMMidiCheck.ipynb. When I was writing, my friend told me, "If you write in .py, you can run tests in CI," and I thought that was the case, and all the functions were written in midi_utill.py (after all). I haven't written a test yet, but ...). Therefore, in each cell of .jpynb

importlib.reload(midi_utill)

So, I'm reloading midi_utill.py. So, the point of this processing is that we have to compare SysEx of MIDI files, so we compare after converting all midi to hex with the following function.

def getMidiHexData(midifilename):
    import mido
    midi = mido.MidiFile(midifilename)
    MidiData = []
    for i in range(len(midi.tracks)):
        for msg in midi.tracks[i]:
#             print(msg.hex())
            MidiData.append(msg.hex())
    
    return MidiData

Besides, I learned a lot about file handling, directory handling, etc., but since it is Python itself, I will omit it, so if you are interested, please read the script.

By the way, please specify the following two variables as variables in GMMidiCheck.ipynb.

For the time being, it is assumed that the reddit data is in the same directory as GMMidiCheck.ipynb. Similarly, it is assumed that the data judged to be GM compliant data will be written out by creating a directory in the same place.

What happened to the result?

Of the 130,000 songs (with this script), 33,000 songs were caught saying "It looks like GM, GS, XG data".

This is enough as learning source data for machine learning, isn't it? Moreover, since the tone information is open to the public in the case of GM, it is also possible to extract "drums and percussion", "single music instruments", "double music instruments", etc. from the track (and since it is GM compliant, it can be reliably extracted with control change information. , Should be. I'm sure).

By the way, it takes about 5 hours to check 130,000 songs, so use it systematically.

Recommended Posts

Detect General MIDI data from large amounts of MIDI
Notes on handling large amounts of data with python + pandas
Preprocessing of Wikipedia dump files and word-separation of large amounts of data by MeCab
Acquisition of plant growth data Acquisition of data from sensors
The transition of baseball as seen from the data
I want to detect images of cats from Instagram
Generate a vertical image of a novel from text data
DataNitro, implementation of function to read data from sheet
[Basics of data science] Collecting data from RSS with python