Introduction

The author has no knowledge of voice signal processing and voice recognition. This article is not recommended for professionals on the road (; ´ ･ ω ･) By the way, I plan to proceed to beginner, intermediate, and advanced.

Motivation

At work, a story like "Recommend music!" Came out.

Is music recommendation classified as voice recognition?

The answer is no. Speech recognition is the process by which a machine converts the voice spoken by a human into characters, so music recommendation is not called speech recognition. (This site was very easy to understand.) Music recommendation seems to be a research field called MIR, and audio signal processing seems to be the core.

What is MIR

Abbreviation for MusicInformatioRetrieval. Text data is used as input for music search by artist name or song name, which is usually used, but MIR uses the voice waveform itself as input.

Below is a specific example of MIR

--Recommend music that suits the listener --Instrument separation and instrument recognition --Automatic transcription (doesn't you need ear copy?) --Automatic classification (genre labeling, etc.) --Music generation, etc ...

Convenient tools and libraries for audio signal processing

SPTK --Commands for voice analysis such as resampling and Fourier transform are provided. It seems to be quite famous in the area of voice signal processing and voice recognition.
librosa --Python package for music analysis. Released in 2015.
SOX --Audio file format conversion software
lame --Audio file format conversion software

I touched the above three, but for me, a beginner of audio signal processing, librosa was better than SPTK. (SPTK was troublesome to build the environment ...) Also, it is recommended for people who want to study audio signal processing while studying machine learning with Python. (Although it is possible to write SPTK from Python, of course)

That's why the introduction has become long, but this time I will introduce librosa.

(By the way, the article about building a similar music system using SPTK was too excellent .. http://aidiary.hatenablog.com/entry/20121014/1350211413)

installation of librosa

I was quite impatient because the "jupyter notebook" did not pass during the environment construction, so I will summarize the procedure.

The author is a Windows user. Also, I want to do it in the Anaconda environment, so if you are a raw Python person, 2. I think it's okay.
Pip install didn't work (probably because there is no C ++ compiler)

procedure

1. Reinstall Anaconda (probably not needed on Mac or Linux, I think it's not needed on Windows if the latest versions of Anaconda and Python)
DL of resampy
1. DL of librosa
Installation of Microsoft Visual C ++ Compiler for Python 2.7
Open Visual C ++ 2008 64-bit Command Prompt and execute the following command in each directory of reampy and librosa

python setup.py build python setup.py install

In python

 library(librosa)

If it passes, it's ok

Old environment: Python2.7.11: Anaconda2-4.0.7 New environment: Python2.7.12: Anaconda2-4.2.0

Before touching librosa

I will summarize what I investigated when starting audio signal processing

--Three elements of sound --Loudness: Corresponds to the amplitude of the wave. The louder the sound, the larger the amplitude. --Pitch: Equivalent to wave frequency and period. The higher the sound, the higher the frequency and the shorter the cycle. --Tone: Corresponds to the shape of a wave.

図1.gif 図2.gif

--Sampling frequency (unit: Hz) --Frequency of taking samples per unit time --The sampling frequency used for music CDs is 44.1kHz --Number of frames (≈ data volume) --Number of channels: The number of sound information when different data are output at the same time. 1 for monaural, 2 for stereo. --Quantization bit number ――How many bits do you want to convert analog data to digital data at a time? ――The larger the number, the larger the amount of data --It seems that 16bit or more is often used for audio, 8bit for telephone voice, and 8-10bit for video signals.

Finally the main subject

librosa is a Python package for music analysis. Modules for MIR are provided.

What I did while referring to the librosa tutorial

--Visualize the waveform --Note: I tried it with librosa, but finally I am using the Python standard library wave. .. --Beat tracker --Audio playback --Split the original voice into percussion instruments / treble / chords

I will give you ipynb later

from now on

--Collect "learning data (music) that is as unbiased as possible". --Reference URL: https://kodack64.gitbooks.io/toho_mir_ml/content/1-0.html --Study a little more about voice analysis (Fourier transform, window transform, pre-emphasis filter, etc.) --Intermediate plan: Acquire knowledge about music features and extraction methods --Chord progression, HVL, BPM, MBL, MSL, ASL, mfcc, local features (so-called rust), etc ... --Schedule for advanced edition: Find the best feature for searching for similar songs ――Let's learn by combining features --Construction and evaluation of similar music system. (We also have to think about the evaluation method.)

Impressions

――I tried to dig into the world of audio signal processing with the intention of using a weapon called machine learning, but I will study more because I do not have enough knowledge. ――Personally, it turned out that the motivation for studying was considerably increased when the input data of machine learning was converted to voice. Actually, it was the biggest discovery this time.

Reference URL summary

http://recognition.web.fc2.com/
http://hhsprings.pinoko.jp/site-hhs/2015/02/microsoft-visual-c-compiler-for-python-2-7%E3%81%AF%E3%81%B2%E3%81%A8%E3%81%AE%E3%81%9F%E3%82%81%E3%81%AA%E3%82%89%E3%81%9A/
http://np2lkoo.hatenablog.com/entry/2016/09/22/052354

Thank you very much. Please look forward to it next time!

Try audio signal processing with librosa-Beginner