This is a summary of the contents of Audio Signal Processing for Music Applications that I took at Coursera. I think it will be a gateway when you want to go one step further than simply playing a track.
Roughly speaking, voice analysis is to clarify "what frequency the sound you hear is composed of". This is called spectrum analysis. Once this is revealed, you will be able to:
As mentioned above, if the decomposition (analysis) is successful, it will be possible to create sound by vice versa synthesis. In other words, music analysis is a process of ** decomposition-> analysis-> (conversion / filtering)-> reconstruction ** as a whole (it can be said to be Fullmetal Alchemist).
The first step in speech analysis begins with expressing speech as a function.
I think it is in my memory that the graph of trigonometric functions resembles a sound wave. This is used to express the sound as a function. The angular velocity of $ \ omega $, as the name implies, is the angle of travel per unit time (1 second). The unit of angle is radians (for those who say radians, see here), in the case of radians. One lap with $ 2 \ pi $, that is, one cycle, so dividing $ \ omega $, whose unit is radian, by $ 2 \ pi $ gives $ f $ per second, that is, Hz.
The point to note is $ n
The introduction has become longer, but the figure below is a plot of this function with the actual parameters. When you hear it as a sound, it makes a sound like that of a hearing test.
CC by MTG(Basicmathematics-Sinewaveplot)
However, it is difficult to handle as it is, so we will use Euler's formula to convert it to an exponential expression.
Euler's formula shows the relationship between the complex exponential and trigonometric function $ e ^ {j \ phi} = cos \ phi + jsin \ phi $ (see figure below), and by applying this, the trigonometric function It is possible to convert the representation into a complex exponential representation.
CC by MTG(Basicmathematics-Euler'sformula)
The merit of using complex exponential notation is that the amplitude $ A $ can be easily expressed as the absolute value of the complex number, and the initial phase $ \ phi $ can be easily expressed as the argument.
CC by MTG(Basicmathematics-Complexnumbers)
The story so far can be summarized as follows.
Now that the sound can be expressed, I will explain how to decompose and reconstruct this sound.
From the conclusion, it is the Fourier transform that decomposes the speech and the inverse Fourier transform that reconstructs it. However, since the time handled by the computer is discrete as explained above, it is called the discrete Fourier transform / inverse transform.
First, let's see how the Discrete Fourier Transform (DFT) that performs decomposition is defined.
The essence of the Fourier transform is to examine the breakdown $ X [k] $ of how many sounds of each frequency $ k $ are contained in the obtained $ N $ samples $ x [n] $. However, it should be noted here that $ k $ is not a simple frequency. The upper limit of the frequency that can be examined is determined by the sampling frequency $ f_s $, but if only $ N $ is sampled from this, it will be examined every $ f_s / N $ (= discrete). This $ f_s / N $ is the unit of k, so the relationship between $ k $ and frequency is $ f_k = f_s k / N $ as described above. The point is that the fineness of the frequency to be examined depends on the number of samples.
If you do not keep these two points in mind, you will be confused by unit conversion. The $ e ^ {-j2 \ pi kn / N} $ part in DFT has the effect of canceling anything other than the desired frequency $ k $ from $ x [n] $ (The Discrete Fourier Transform 2 of 2, [Euler's identity Generalizations](http://en.wikipedia.org/ See wiki / Euler's_identity)), you can get only $ k $ by multiplying $ x [n] $ by this. By calculating this for $ n $ each time and taking the sum, we calculate the sum of the $ k $ components contained in $ n $ each time, that is, how much frequency $ k $ is included in the entire sound. You can do it.
The figure below shows what was actually calculated and plotted.
CC by MTG(TheDiscreteFourierTransform1of2)
The first stage shows the actual sound, the second and third stages show the DFT results, the second stage is the amplitude, and the third stage is the initial phase graph. In the second row, the vertical axis is the intensity of the amplitude converted to a decibel, and the horizontal axis is Hz, which shows how high the sound is and how strong it is (this is called the magnitude spectrum). ). In the third row, the vertical axis is the initial phase (angular velocity) and the horizontal axis is Hz, which indicates which pitch sound starts to sound at what timing (this is called the phase spectrum).
Here, I will summarize a few related matters.
CC by MTG(TheDiscreteFourierTransform2of2DFTofrealsinusoid)
Since we don't need the negative side, we usually deal only with 0 or more parts (positive half).
CC by MTG(FourierTransformproperty2of2Phaseunwrapping)
CC by MTG(FourierTransformproperty2of2Zero-padding)
Performing DFT and clarifying the magnitude and phase is the process equivalent to "sound decomposition".
The Inverse Discrete Fourier Transform (IDFT) is used to reconstruct the original sound by using the magnitude and phase obtained as a result of DFT.
This is the reverse operation of DFT and resynthesizes the voice.
The actual calculation uses the Fast Fourier Transform / Inverse Transform (FFT / IFFT), which is a faster version of DFT / IDFT. To use the FFT, do the following:
FFT / IFFT is implemented in scipy and can be easily implemented using it.
The above explanation is quicker to read in code, so please refer to it as well.
MTG/sms-tools/software/models/dftModel.py
I think you can now decompose and reconstruct the audio. However, as it is, we are simply reconstructing the original sound, so I would like to go a little further on how to analyze the decomposition results from here, and then finally look at the parts related to conversion and filtering.
A typical song has a length of several minutes, and when it is applied to the FFT, the sounds that are ringing at various timings in the song are mixed, and the magnitude and phase become cluttered and the characteristics become difficult to grasp. I will end up. Therefore, the song is divided into fixed lengths, and discrete Fourier transform is performed for each division unit. By arranging these, you can grasp the change in magnitude / phase in time series. This technique is called the Short-Time Fourier Transform (STFT).
This is an image of shifting the analysis range of size N by H (see the figure below).
CC by MTG(TheShort-TimeFourierTransform(1of2))
$ w $ is called a window function. The reason for applying such a thing is that it is assumed that the cut out part of size N is a "periodic signal". However, in reality, this is not the case, so the first part is applied with a function that converges to 0 at the end to make it look like a periodic signal. The function for this is called the window function.
You may wonder if it's okay to apply something like this, but on the contrary, if you do not apply the window function, the assumption of "periodic signal" will be broken, so noise will be generated when the sound is reconstructed. It will sound like it is included.
Specifically, the window function is as follows (the following is a simple rectangular window).
CC by MTG(TheShort-TimeFourierTransform(1of2)Analysiswindow)
It takes a value of 1 at the peak and takes the form of a single peak that attenuates around it. When the magnitude is taken, it becomes as shown in the above figure. The peak is called main-lobe, and the peak next to the peak is called side-lobe. Since it is preferable that the window function passes only the target frequency, the narrower the main-lobe, the higher the frequency resolution, and the lower the side-lobe other than the main-lobe, the more the effect on sounds other than the main (small sound). Will be reduced. There is a trade-off between the narrowness of the main-lobe and the low side-lobe, and it is necessary to use them properly according to the situation. In general, select the one with low side-lobe in the case where both loud and soft sounds are included, and select the one with narrow main-lobe in the case where it fits in a certain volume, giving priority to resolution. (The former often uses blackman / blackman-harris, and the latter often uses hammering / hanning).
Reference
Window function Windowing
Understanding FFT Windows Choosing a Windowing Function
Sinusoidal model
The Sinusoidal model is to think of a complex sound as a set of simple sounds (Sinusoid) of a specific frequency. In the model below, the sound at time n is represented by R sounds of specific frequencies.
If it can be applied to this model, it seems possible to reconstruct complex sounds from simple sounds. So how do you find out what frequency a sound is made up of? The clue to this is still the spectrum.
Spectral analysis reveals that sounds of different frequencies are detected as magnitude peaks, as shown below.
Week 5: Sinusoidal model, Theory lecture 1: Sinusoidal model 1
Therefore, it seems that the frequencies that make up the sound can be identified well by following the steps below.
First, go to "1. Create a magnitude spectrum so that peaks can be detected well". The point of this is the window size, and it is necessary to set so that multiple frequency components fit inside the window.
Week 5: Sinusoidal model, Theory lecture 1: Sinusoidal model 1
As shown in above, the detectable frequency is in $ f_s / N $ units. Therefore, this should be less than the difference between the two frequencies being detected (because otherwise the two frequencies cannot be decomposed).
First of all
An example of a mixture of the above 440/490 is as follows.
Week 5: Sinusoidal model, Theory lecture 1: Sinusoidal model 1
Well, I don't really know what the frequency of the sound is (440 and 490!), So I'll add about 100 to 2000Hz and decide on an M that works well over the entire frequency in between. (Like this. The horizontal axis is frequency and the vertical axis is k (M = k * 100 +) 1). Since k = 21 is stable at any frequency, 2101 is used as M).
Harmonic model Sinusoidal plus residual model(Stochastic model)
**A3: Fourier Properties Part-4: Suppressing frequency components using Quoted from DFT model **
Week8 - Sound transformations
Week9 - Sound and music description
Week10 - Concluding topics
Recommended Posts