The author has no knowledge of voice signal processing and voice recognition. This article is not recommended for professionals on the road (; ´ ・ ω ・) By the way, I plan to proceed to beginner, intermediate, and advanced.
At work, a story like "Recommend music!" Came out.
The answer is no. Speech recognition is the process by which a machine converts the voice spoken by a human into characters, so music recommendation is not called speech recognition. (This site was very easy to understand.) Music recommendation seems to be a research field called MIR, and audio signal processing seems to be the core.
Abbreviation for MusicInformatioRetrieval. Text data is used as input for music search by artist name or song name, which is usually used, but MIR uses the voice waveform itself as input.
Below is a specific example of MIR
--Recommend music that suits the listener --Instrument separation and instrument recognition --Automatic transcription (doesn't you need ear copy?) --Automatic classification (genre labeling, etc.) --Music generation, etc ...
I touched the above three, but for me, a beginner of audio signal processing, librosa was better than SPTK. (SPTK was troublesome to build the environment ...) Also, it is recommended for people who want to study audio signal processing while studying machine learning with Python. (Although it is possible to write SPTK from Python, of course)
That's why the introduction has become long, but this time I will introduce librosa.
(By the way, the article about building a similar music system using SPTK was too excellent .. http://aidiary.hatenablog.com/entry/20121014/1350211413)
I was quite impatient because the "jupyter notebook" did not pass during the environment construction, so I will summarize the procedure.
procedure
DL of resampy
Installation of Microsoft Visual C ++ Compiler for Python 2.7
Open Visual C ++ 2008 64-bit Command Prompt and execute the following command in each directory of reampy and librosa
python setup.py build python setup.py install
In python
library(librosa)
If it passes, it's ok
Old environment: Python2.7.11: Anaconda2-4.0.7 New environment: Python2.7.12: Anaconda2-4.2.0
I will summarize what I investigated when starting audio signal processing
--Three elements of sound --Loudness: Corresponds to the amplitude of the wave. The louder the sound, the larger the amplitude. --Pitch: Equivalent to wave frequency and period. The higher the sound, the higher the frequency and the shorter the cycle. --Tone: Corresponds to the shape of a wave.
--Sampling frequency (unit: Hz) --Frequency of taking samples per unit time --The sampling frequency used for music CDs is 44.1kHz --Number of frames (≈ data volume) --Number of channels: The number of sound information when different data are output at the same time. 1 for monaural, 2 for stereo. --Quantization bit number ――How many bits do you want to convert analog data to digital data at a time? ――The larger the number, the larger the amount of data --It seems that 16bit or more is often used for audio, 8bit for telephone voice, and 8-10bit for video signals.
librosa is a Python package for music analysis. Modules for MIR are provided.
What I did while referring to the librosa tutorial
--Visualize the waveform --Note: I tried it with librosa, but finally I am using the Python standard library wave. .. --Beat tracker --Audio playback --Split the original voice into percussion instruments / treble / chords
--Collect "learning data (music) that is as unbiased as possible". --Reference URL: https://kodack64.gitbooks.io/toho_mir_ml/content/1-0.html --Study a little more about voice analysis (Fourier transform, window transform, pre-emphasis filter, etc.) --Intermediate plan: Acquire knowledge about music features and extraction methods --Chord progression, HVL, BPM, MBL, MSL, ASL, mfcc, local features (so-called rust), etc ... --Schedule for advanced edition: Find the best feature for searching for similar songs ――Let's learn by combining features --Construction and evaluation of similar music system. (We also have to think about the evaluation method.)
――I tried to dig into the world of audio signal processing with the intention of using a weapon called machine learning, but I will study more because I do not have enough knowledge. ――Personally, it turned out that the motivation for studying was considerably increased when the input data of machine learning was converted to voice. Actually, it was the biggest discovery this time.
Thank you very much. Please look forward to it next time!
Recommended Posts