An introduction to voice analysis for music apps

This is a summary of the contents of Audio Signal Processing for Music Applications that I took at Coursera. I think it will be a gateway when you want to go one step further than simply playing a track.

Purpose of voice analysis

Roughly speaking, voice analysis is to clarify "what frequency the sound you hear is composed of". This is called spectrum analysis. Once this is revealed, you will be able to:

As mentioned above, if the decomposition (analysis) is successful, it will be possible to create sound by vice versa synthesis. In other words, music analysis is a process of ** decomposition-> analysis-> (conversion / filtering)-> reconstruction ** as a whole (it can be said to be Fullmetal Alchemist).

Speech expression

The first step in speech analysis begins with expressing speech as a function.

x[n] = Acos(\omega nT + \phi) = Acos(2\pi fnT + \phi)

I think it is in my memory that the graph of trigonometric functions resembles a sound wave. This is used to express the sound as a function. The angular velocity of $ \ omega $, as the name implies, is the angle of travel per unit time (1 second). The unit of angle is radians (for those who say radians, see here), in the case of radians. One lap with $ 2 \ pi $, that is, one cycle, so dividing $ \ omega $, whose unit is radian, by $ 2 \ pi $ gives $ f $ per second, that is, Hz.

The point to note is $ n , which is the index of the final sound ( x [n] $). This $ n $ advances by $ T $ seconds when it advances by 1. $ T $ is the reciprocal of the sampling frequency, which is the sampling interval. The sampling frequency represents the frequency with which sound is acquired, and the higher this is, the higher the sound (higher frequency sound) can be handled. The sampling frequency must have a margin of at least twice the frequency you want to acquire for synthesis (sampling theorem), and even though the human audible range is about 20,000 Hz for general CDs, etc. 44,100Hz, which is about twice that, is adopted.

The introduction has become longer, but the figure below is a plot of this function with the actual parameters. When you hear it as a sound, it makes a sound like that of a hearing test.

sineWave.PNG

CC by MTG(Basicmathematics-Sinewaveplot)

However, it is difficult to handle as it is, so we will use Euler's formula to convert it to an exponential expression.

\bar{x}[n] = Ae^{j(\omega nT + \phi)} = Acos(\omega nT + \phi) + Asin(\omega nT + \phi)

Euler's formula shows the relationship between the complex exponential and trigonometric function $ e ^ {j \ phi} = cos \ phi + jsin \ phi $ (see figure below), and by applying this, the trigonometric function It is possible to convert the representation into a complex exponential representation.

euler.PNG CC by MTG(Basicmathematics-Euler'sformula)

The merit of using complex exponential notation is that the amplitude $ A $ can be easily expressed as the absolute value of the complex number, and the initial phase $ \ phi $ can be easily expressed as the argument.

complex_representation.PNG CC by MTG(Basicmathematics-Complexnumbers)

The story so far can be summarized as follows.

Now that the sound can be expressed, I will explain how to decompose and reconstruct this sound.

Audio decomposition / reconstruction

Discrete Fourier Transform (DFT)

From the conclusion, it is the Fourier transform that decomposes the speech and the inverse Fourier transform that reconstructs it. However, since the time handled by the computer is discrete as explained above, it is called the discrete Fourier transform / inverse transform.

First, let's see how the Discrete Fourier Transform (DFT) that performs decomposition is defined.

X[k] = \sum^{N-1}_{n=0} x[n] e^{-j2\pi kn/N}

The essence of the Fourier transform is to examine the breakdown $ X [k] $ of how many sounds of each frequency $ k $ are contained in the obtained $ N $ samples $ x [n] $. However, it should be noted here that $ k $ is not a simple frequency. The upper limit of the frequency that can be examined is determined by the sampling frequency $ f_s $, but if only $ N $ is sampled from this, it will be examined every $ f_s / N $ (= discrete). This $ f_s / N $ is the unit of k, so the relationship between $ k $ and frequency is $ f_k = f_s k / N $ as described above. The point is that the fineness of the frequency to be examined depends on the number of samples.

If you do not keep these two points in mind, you will be confused by unit conversion. The $ e ^ {-j2 \ pi kn / N} $ part in DFT has the effect of canceling anything other than the desired frequency $ k $ from $ x [n] $ (The Discrete Fourier Transform 2 of 2, [Euler's identity Generalizations](http://en.wikipedia.org/ See wiki / Euler's_identity)), you can get only $ k $ by multiplying $ x [n] $ by this. By calculating this for $ n $ each time and taking the sum, we calculate the sum of the $ k $ components contained in $ n $ each time, that is, how much frequency $ k $ is included in the entire sound. You can do it.

The figure below shows what was actually calculated and plotted.

DFT.PNG CC by MTG(TheDiscreteFourierTransform1of2)

The first stage shows the actual sound, the second and third stages show the DFT results, the second stage is the amplitude, and the third stage is the initial phase graph. In the second row, the vertical axis is the intensity of the amplitude converted to a decibel, and the horizontal axis is Hz, which shows how high the sound is and how strong it is (this is called the magnitude spectrum). ). In the third row, the vertical axis is the initial phase (angular velocity) and the horizontal axis is Hz, which indicates which pitch sound starts to sound at what timing (this is called the phase spectrum).

Here, I will summarize a few related matters.

magnitude_symmetry.PNG CC by MTG(TheDiscreteFourierTransform2of2DFTofrealsinusoid)

Since we don't need the negative side, we usually deal only with 0 or more parts (positive half).

\sum_{n=-N/2}^{N/2-1}|x[n]|^2 = \frac{1}{N}\sum_{k=-N/2}^{N/2-1}|X[k]|^2
db = 20 * log_{10}(abs(X)=magnitude)

phase_unwrap.PNG CC by MTG(FourierTransformproperty2of2Phaseunwrapping)

zero_padding.PNG CC by MTG(FourierTransformproperty2of2Zero-padding)

Performing DFT and clarifying the magnitude and phase is the process equivalent to "sound decomposition".

Reconstruction (IDFT: Inverse Discrete Fourier Transform)

The Inverse Discrete Fourier Transform (IDFT) is used to reconstruct the original sound by using the magnitude and phase obtained as a result of DFT.

x[n] = \frac{1}{N}\sum^{N-1}_{k=0}X[k]e^{j2\pi kn/N}

This is the reverse operation of DFT and resynthesizes the voice.

Speeding up

The actual calculation uses the Fast Fourier Transform / Inverse Transform (FFT / IFFT), which is a faster version of DFT / IDFT. To use the FFT, do the following:

fft.PNG

FFT / IFFT is implemented in scipy and can be easily implemented using it.

The above explanation is quicker to read in code, so please refer to it as well.

MTG/sms-tools/software/models/dftModel.py

I think you can now decompose and reconstruct the audio. However, as it is, we are simply reconstructing the original sound, so I would like to go a little further on how to analyze the decomposition results from here, and then finally look at the parts related to conversion and filtering.

Analysis of speech decomposition results

Analysis of time series changes (STFT: Short-Time Fourier Transform)

A typical song has a length of several minutes, and when it is applied to the FFT, the sounds that are ringing at various timings in the song are mixed, and the magnitude and phase become cluttered and the characteristics become difficult to grasp. I will end up. Therefore, the song is divided into fixed lengths, and discrete Fourier transform is performed for each division unit. By arranging these, you can grasp the change in magnitude / phase in time series. This technique is called the Short-Time Fourier Transform (STFT).

X_l[k] = \sum^{N/2-1}_{n=-N/2}w[n]x[n+lH]e^{-j2\pi kn/N}

This is an image of shifting the analysis range of size N by H (see the figure below).

stft_image.PNG CC by MTG(TheShort-TimeFourierTransform(1of2))

$ w $ is called a window function. The reason for applying such a thing is that it is assumed that the cut out part of size N is a "periodic signal". However, in reality, this is not the case, so the first part is applied with a function that converges to 0 at the end to make it look like a periodic signal. The function for this is called the window function.

You may wonder if it's okay to apply something like this, but on the contrary, if you do not apply the window function, the assumption of "periodic signal" will be broken, so noise will be generated when the sound is reconstructed. It will sound like it is included.

Specifically, the window function is as follows (the following is a simple rectangular window).

window_function.PNG CC by MTG(TheShort-TimeFourierTransform(1of2)Analysiswindow)

It takes a value of 1 at the peak and takes the form of a single peak that attenuates around it. When the magnitude is taken, it becomes as shown in the above figure. The peak is called main-lobe, and the peak next to the peak is called side-lobe. Since it is preferable that the window function passes only the target frequency, the narrower the main-lobe, the higher the frequency resolution, and the lower the side-lobe other than the main-lobe, the more the effect on sounds other than the main (small sound). Will be reduced. There is a trade-off between the narrowness of the main-lobe and the low side-lobe, and it is necessary to use them properly according to the situation. In general, select the one with low side-lobe in the case where both loud and soft sounds are included, and select the one with narrow main-lobe in the case where it fits in a certain volume, giving priority to resolution. (The former often uses blackman / blackman-harris, and the latter often uses hammering / hanning).


Reference
Window function Windowing Understanding FFT Windows Choosing a Windowing Function

Sinusoidal model

The Sinusoidal model is to think of a complex sound as a set of simple sounds (Sinusoid) of a specific frequency. In the model below, the sound at time n is represented by R sounds of specific frequencies.

y[n] = \sum^R_{r=1} A_r[n]cos(2\pi f_r [n]n)

If it can be applied to this model, it seems possible to reconstruct complex sounds from simple sounds. So how do you find out what frequency a sound is made up of? The clue to this is still the spectrum.

Spectral analysis reveals that sounds of different frequencies are detected as magnitude peaks, as shown below.

image Week 5: Sinusoidal model, Theory lecture 1: Sinusoidal model 1

Therefore, it seems that the frequencies that make up the sound can be identified well by following the steps below.

  1. Create a magnitude spectrum so that peaks can be detected well
  2. Identify the location of the peak

First, go to "1. Create a magnitude spectrum so that peaks can be detected well". The point of this is the window size, and it is necessary to set so that multiple frequency components fit inside the window.

image Week 5: Sinusoidal model, Theory lecture 1: Sinusoidal model 1

As shown in above, the detectable frequency is in $ f_s / N $ units. Therefore, this should be less than the difference between the two frequencies being detected (because otherwise the two frequencies cannot be decomposed). First of all\frac{f_s}{|f_{k+1} - f_k|}Is derived. And the number of bins that fit in the main lobe of the window function is fixed, and this isB_sWill be. Therefore,B_s \frac{f_s}{|f_{k+1} - f_k|}The appropriate size can be derived with.

An example of a mixture of the above 440/490 is as follows.

image Week 5: Sinusoidal model, Theory lecture 1: Sinusoidal model 1

Well, I don't really know what the frequency of the sound is (440 and 490!), So I'll add about 100 to 2000Hz and decide on an M that works well over the entire frequency in between. (Like this. The horizontal axis is frequency and the vertical axis is k (M = k * 100 +) 1). Since k = 21 is stable at any frequency, 2101 is used as M).

Harmonic model Sinusoidal plus residual model(Stochastic model)

Conversion / filtering

filtering

**A3: Fourier Properties Part-4: Suppressing frequency components using Quoted from DFT model **

Implementation

conversion

Week8 - Sound transformations

Audio classification

Week9 - Sound and music description

To the world of speech analysis research

Week10 - Concluding topics

Recommended Posts

An introduction to voice analysis for music apps
An introduction to statistical modeling for data analysis
An introduction to Mercurial for non-engineers
An introduction to Python for non-engineers
An introduction to OpenCV for machine learning
An introduction to Python for machine learning
An introduction to Python for C programmers
An introduction to machine learning for bot developers
An introduction to object-oriented programming for beginners by beginners
Reading Note: An Introduction to Data Analysis with Python
An introduction to private TensorFlow
An introduction to machine learning
An introduction to Python Programming
Introduction to discord.py (3) Using voice
An introduction to Bayesian optimization
Introduction to Python For, While
Introduction to Statistical Modeling for Data Analysis GLM Model Selection
An introduction to statistical modeling for data analysis (Midorimoto) reading notes (in Python and Stan)
Introduction to image analysis opencv python
[Python Tutorial] An Easy Introduction to Python
Introduction to Statistical Modeling for Data Analysis GLM Likelihood-Ratio Test and Test Asymmetry
Recurrent Neural Networks: An Introduction to RNN
Introduction to discord.py (1st day) -Preparation for discord.py-
Beginners read "Introduction to TensorFlow 2.0 for Experts"
How to use Pylint for PyQt5 apps
An introduction to self-made Python web applications for a sluggish third-year web engineer
An Introduction to Object-Oriented-Give an object a child.
Introduction to Statistical Modeling for Data Analysis Expanding the range of applications of GLM
An introduction to data analysis using Python-To increase the number of video views-
[What is an algorithm? Introduction to Search Algorithm] ~ Python ~
An introduction to Cython that doesn't go deep
How to use data analysis tools for beginners
[Introduction to minimize] Data analysis with SEIR model ♬
[Introduction to Udemy Python3 + Application] 43. for else statement
Introduction to Python "Re" 1 Building an execution environment
Introduction to Programming (Python) TA Tendency for beginners
[For beginners] Introduction to vectorization in machine learning
Understand Python for Pepper development. -Introduction to Python Box-
An introduction to Cython that doesn't go deep -2-
Introduction to MQTT (Introduction)
Introduction to Scrapy (1)
Introduction to Scrapy (3)
Introduction to Supervisor
Introduction to Scrapy (2)
[Linux] Introduction to Linux
Introduction to discord.py
An introduction to Python distributed parallel processing with Ray
[Introduction to python] A high-speed introduction to Python for busy C ++ programmers
An introduction to Word2Vec that even cats can understand
[Introduction to Python] How to write repetitive statements using for statements
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
An introduction to machine learning from a simple perceptron
Introduction to Flask Part 1: First try running it locally & create an executable file for distribution
An introduction to Web API development for those who have completed the Progate Go course