An introduction to voice analysis for music apps

This is a summary of the contents of Audio Signal Processing for Music Applications that I took at Coursera. I think it will be a gateway when you want to go one step further than simply playing a track.

Purpose of voice analysis

Roughly speaking, voice analysis is to clarify "what frequency the sound you hear is composed of". This is called spectrum analysis. Once this is revealed, you will be able to:

Characterize from the composition of the sound (identify the instrument / speaker, estimate the genre of music, etc.)
Exclude sounds other than the desired frequency (filtering)
Compose the desired sound by synthesizing sounds (synthesizer, etc.)

As mentioned above, if the decomposition (analysis) is successful, it will be possible to create sound by vice versa synthesis. In other words, music analysis is a process of ** decomposition-> analysis-> (conversion / filtering)-> reconstruction ** as a whole (it can be said to be Fullmetal Alchemist).

Speech expression

The first step in speech analysis begins with expressing speech as a function.

x[n] = Acos(\omega nT + \phi) = Acos(2\pi fnT + \phi)

$ A $: amplitude Amplitude-> Loudness
$ \ omega $: angular frequency in radius Angular velocity-> Pitch (frequency)
f: frequency in Hz (\frac{w}{2\pi})
$ \ phi $: initial phase in radius Initial phase
n: time index
$ f_s $: sampling rate sampling frequency
T: sampling period in second (1/f_s, t=nT=n/f_s)

I think it is in my memory that the graph of trigonometric functions resembles a sound wave. This is used to express the sound as a function. The angular velocity of $ \ omega $, as the name implies, is the angle of travel per unit time (1 second). The unit of angle is radians (for those who say radians, see here), in the case of radians. One lap with $ 2 \ pi $, that is, one cycle, so dividing $ \ omega $, whose unit is radian, by $ 2 \ pi $ gives $ f $ per second, that is, Hz.

The point to note is $ n , which is the index of the final sound ( x [n] $). This $ n $ advances by $ T $ seconds when it advances by 1. $ T $ is the reciprocal of the sampling frequency, which is the sampling interval. The sampling frequency represents the frequency with which sound is acquired, and the higher this is, the higher the sound (higher frequency sound) can be handled. The sampling frequency must have a margin of at least twice the frequency you want to acquire for synthesis (sampling theorem), and even though the human audible range is about 20,000 Hz for general CDs, etc. 44,100Hz, which is about twice that, is adopted.

The sampling frequency of the sound source called high resolution is 48,000Hz, which is higher than the rate of general CDs, which is the reason why it is called High Resolution = high resolution. If you can't hear it, you might think it's meaningless, but it seems that it includes the feeling of air that you feel with your body instead of your ears.

The introduction has become longer, but the figure below is a plot of this function with the actual parameters. When you hear it as a sound, it makes a sound like that of a hearing test.

CC by MTG(Basicmathematics-Sinewaveplot)

However, it is difficult to handle as it is, so we will use Euler's formula to convert it to an exponential expression.

\bar{x}[n] = Ae^{j(\omega nT + \phi)} = Acos(\omega nT + \phi) + Asin(\omega nT + \phi)

Euler's formula shows the relationship between the complex exponential and trigonometric function $ e ^ {j \ phi} = cos \ phi + jsin \ phi $ (see figure below), and by applying this, the trigonometric function It is possible to convert the representation into a complex exponential representation.

CC by MTG(Basicmathematics-Euler'sformula)

The merit of using complex exponential notation is that the amplitude $ A $ can be easily expressed as the absolute value of the complex number, and the initial phase $ \ phi $ can be easily expressed as the argument.

CC by MTG(Basicmathematics-Complexnumbers)

The story so far can be summarized as follows.

Assuming that sound is a periodic function related to time, express this with trigonometric functions.
The larger the amplitude $ A $, the louder the sound, and the higher the angular velocity $ \ omega $, the higher the sound.
For ease of use, the trigonometric notation is converted to a complex exponential notation using Euler's formula.
With the introduction of complex exponential notation, amplitude can be expressed in absolute value and initial phase can be expressed in argument.
Sound is expressed as a value per unit time determined by the sampling frequency (discrete time expression). Specifically, the time interval between $ x [n] $ and $ x [n + 1] $ is $ 1 / f_s $ when the sampling frequency is $ f_s $.

Now that the sound can be expressed, I will explain how to decompose and reconstruct this sound.

Audio decomposition / reconstruction

Discrete Fourier Transform (DFT)

From the conclusion, it is the Fourier transform that decomposes the speech and the inverse Fourier transform that reconstructs it. However, since the time handled by the computer is discrete as explained above, it is called the discrete Fourier transform / inverse transform.

First, let's see how the Discrete Fourier Transform (DFT) that performs decomposition is defined.

X[k] = \sum^{N-1}_{n=0} x[n] e^{-j2\pi kn/N}

$ N $: Number of samples
n: discrete time index
k: discrete frequency index k=0,...,N-1
$ \ omega_k = 2 \ pik / N $: Angular frequency
$ f_k = f_s k / N $: Frequency (Hz)

The essence of the Fourier transform is to examine the breakdown $ X [k] $ of how many sounds of each frequency $ k $ are contained in the obtained $ N $ samples $ x [n] $. However, it should be noted here that $ k $ is not a simple frequency. The upper limit of the frequency that can be examined is determined by the sampling frequency $ f_s $, but if only $ N $ is sampled from this, it will be examined every $ f_s / N $ (= discrete). This $ f_s / N $ is the unit of k, so the relationship between $ k $ and frequency is $ f_k = f_s k / N $ as described above. The point is that the fineness of the frequency to be examined depends on the number of samples.

$ n $ is a discrete-time unit, $ 1 / f_s $ seconds
$ k $ is a unit of discrete frequency, $ f_s / N $ Hz unit (this is called bin)

If you do not keep these two points in mind, you will be confused by unit conversion. The $ e ^ {-j2 \ pi kn / N} $ part in DFT has the effect of canceling anything other than the desired frequency $ k $ from $ x [n] $ (The Discrete Fourier Transform 2 of 2, [Euler's identity Generalizations](http://en.wikipedia.org/ See wiki / Euler's_identity)), you can get only $ k $ by multiplying $ x [n] $ by this. By calculating this for $ n $ each time and taking the sum, we calculate the sum of the $ k $ components contained in $ n $ each time, that is, how much frequency $ k $ is included in the entire sound. You can do it.

The figure below shows what was actually calculated and plotted.

CC by MTG(TheDiscreteFourierTransform1of2)

The first stage shows the actual sound, the second and third stages show the DFT results, the second stage is the amplitude, and the third stage is the initial phase graph. In the second row, the vertical axis is the intensity of the amplitude converted to a decibel, and the horizontal axis is Hz, which shows how high the sound is and how strong it is (this is called the magnitude spectrum). ). In the third row, the vertical axis is the initial phase (angular velocity) and the horizontal axis is Hz, which indicates which pitch sound starts to sound at what timing (this is called the phase spectrum).

Here, I will summarize a few related matters.

Symmetry Analyzing the actual sound, the magnitude / phase is symmetrical. This is $ A_ocos (2 \ pi k_0 n / N) = \ frac {A_0} {2} e ^ {j2 \ pi k_0 n / M} + \ frac {A_0} {2} e ^ {-j2 \ pi k_0 It can be derived by expanding the expression of n / M} $. In this case, the magnitude appears in the form of line symmetry centered on the two points of $ k = k_0 and -k_0 $. (Strictly speaking, in the case of a sound composed only of the real part (ordinary sound), the magnitude is an even object and the phase is oddly symmetric, and when it is composed of the real part and the imaginary part, the magnitude is an even object and the phase is 0)

CC by MTG(TheDiscreteFourierTransform2of2DFTofrealsinusoid)

Since we don't need the negative side, we usually deal only with 0 or more parts (positive half).

Energy
The sum of squares of magnitude is called Energy.

\sum_{n=-N/2}^{N/2-1}|x[n]|^2 = \frac{1}{N}\sum_{k=-N/2}^{N/2-1}|X[k]|^2

Decibel (decibel) A unit that represents magnitude and is defined below.

db = 20 * log_{10}(abs(X)=magnitude)

Phase unwrapping
The phase is $ 2 \ pi $, that is, it returns to 0 after one rotation, so processing the phase with this added is called unwrapping. By performing this process, it becomes easier to see the transition of the phase.

CC by MTG(FourierTransformproperty2of2Phaseunwrapping)

Zero-padding
By adding a voice 0 element after the voice to be analyzed, the magnitude curve obtained by DFT becomes smooth and the peak can be easily estimated. Therefore, the number of specimens $ N $, which is the size for performing DFT, is often larger than the actual sound size $ M $, and the difference is often filled with 0 for calculation.

CC by MTG(FourierTransformproperty2of2Zero-padding)

Performing DFT and clarifying the magnitude and phase is the process equivalent to "sound decomposition".

Reconstruction (IDFT: Inverse Discrete Fourier Transform)

The Inverse Discrete Fourier Transform (IDFT) is used to reconstruct the original sound by using the magnitude and phase obtained as a result of DFT.

x[n] = \frac{1}{N}\sum^{N-1}_{k=0}X[k]e^{j2\pi kn/N}

This is the reverse operation of DFT and resynthesizes the voice.

Speeding up

The actual calculation uses the Fast Fourier Transform / Inverse Transform (FFT / IFFT), which is a faster version of DFT / IDFT. To use the FFT, do the following:

Zero-phase windowing (See the figure below because it is difficult to explain in words)
Match size to power 2 (128, 256, 512 ...)
Fill in the wrong size with 0 (Zero-padding)

FFT / IFFT is implemented in scipy and can be easily implemented using it.

The above explanation is quicker to read in code, so please refer to it as well.

MTG/sms-tools/software/models/dftModel.py

I think you can now decompose and reconstruct the audio. However, as it is, we are simply reconstructing the original sound, so I would like to go a little further on how to analyze the decomposition results from here, and then finally look at the parts related to conversion and filtering.

Analysis of speech decomposition results

Analysis of time series changes (STFT: Short-Time Fourier Transform)

A typical song has a length of several minutes, and when it is applied to the FFT, the sounds that are ringing at various timings in the song are mixed, and the magnitude and phase become cluttered and the characteristics become difficult to grasp. I will end up. Therefore, the song is divided into fixed lengths, and discrete Fourier transform is performed for each division unit. By arranging these, you can grasp the change in magnitude / phase in time series. This technique is called the Short-Time Fourier Transform (STFT).

X_l[k] = \sum^{N/2-1}_{n=-N/2}w[n]x[n+lH]e^{-j2\pi kn/N}

w: analysis window
l: frame number (0,1,...)
H: hop size

This is an image of shifting the analysis range of size N by H (see the figure below).

CC by MTG(TheShort-TimeFourierTransform(1of2))

$ w $ is called a window function. The reason for applying such a thing is that it is assumed that the cut out part of size N is a "periodic signal". However, in reality, this is not the case, so the first part is applied with a function that converges to 0 at the end to make it look like a periodic signal. The function for this is called the window function.

You may wonder if it's okay to apply something like this, but on the contrary, if you do not apply the window function, the assumption of "periodic signal" will be broken, so noise will be generated when the sound is reconstructed. It will sound like it is included.

Specifically, the window function is as follows (the following is a simple rectangular window).

CC by MTG(TheShort-TimeFourierTransform(1of2)Analysiswindow)

It takes a value of 1 at the peak and takes the form of a single peak that attenuates around it. When the magnitude is taken, it becomes as shown in the above figure. The peak is called main-lobe, and the peak next to the peak is called side-lobe. Since it is preferable that the window function passes only the target frequency, the narrower the main-lobe, the higher the frequency resolution, and the lower the side-lobe other than the main-lobe, the more the effect on sounds other than the main (small sound). Will be reduced. There is a trade-off between the narrowness of the main-lobe and the low side-lobe, and it is necessary to use them properly according to the situation. In general, select the one with low side-lobe in the case where both loud and soft sounds are included, and select the one with narrow main-lobe in the case where it fits in a certain volume, giving priority to resolution. (The former often uses blackman / blackman-harris, and the latter often uses hammering / hanning).

Reference
Window function　Windowing Understanding FFT Windows Choosing a Windowing Function

Sinusoidal model

The Sinusoidal model is to think of a complex sound as a set of simple sounds (Sinusoid) of a specific frequency. In the model below, the sound at time n is represented by R sounds of specific frequencies.

y[n] = \sum^R_{r=1} A_r[n]cos(2\pi f_r [n]n)

$ R $: Number of sine waves
$ A_r [n] $: Amplitude at time n
$ f_r [n] $: Frequency at time n

If it can be applied to this model, it seems possible to reconstruct complex sounds from simple sounds. So how do you find out what frequency a sound is made up of? The clue to this is still the spectrum.

Spectral analysis reveals that sounds of different frequencies are detected as magnitude peaks, as shown below.

Week 5: Sinusoidal model, Theory lecture 1: Sinusoidal model 1

Therefore, it seems that the frequencies that make up the sound can be identified well by following the steps below.

Create a magnitude spectrum so that peaks can be detected well
Identify the location of the peak

First, go to "1. Create a magnitude spectrum so that peaks can be detected well". The point of this is the window size, and it is necessary to set so that multiple frequency components fit inside the window.

Week 5: Sinusoidal model, Theory lecture 1: Sinusoidal model 1

As shown in above, the detectable frequency is in $ f_s / N $ units. Therefore, this should be less than the difference between the two frequencies being detected (because otherwise the two frequencies cannot be decomposed). First of all\frac{f_s}{|f_{k+1} - f_k|}Is derived. And the number of bins that fit in the main lobe of the window function is fixed, and this isB_sWill be. Therefore,B_s \frac{f_s}{|f_{k+1} - f_k|}The appropriate size can be derived with.

An example of a mixture of the above 440/490 is as follows.

Week 5: Sinusoidal model, Theory lecture 1: Sinusoidal model 1

Well, I don't really know what the frequency of the sound is (440 and 490!), So I'll add about 100 to 2000Hz and decide on an M that works well over the entire frequency in between. (Like this. The horizontal axis is frequency and the vertical axis is k (M = k * 100 +) 1). Since k = 21 is stable at any frequency, 2101 is used as M).

How to detect peaks
Decomposition / synthesis by Sinusoid model

Harmonic model Sinusoidal plus residual model(Stochastic model)

Conversion / filtering

filtering

**A3: Fourier Properties Part-4: Suppressing frequency components using Quoted from DFT model **

Implementation

conversion

Week8 - Sound transformations

Audio classification

Week9 - Sound and music description

To the world of speech analysis research

Week10 - Concluding topics