[Introduction to pytorch] Preprocessing by audio I / O and torch audio (> <;)

As the title suggests, I tried to move the code on the following reference page. It's difficult to say that it's particularly helpful, but it worked, so I'll summarize it.

【reference】 ①AUDIO I/O AND PRE-PROCESSING WITH TORCHAUDIOTORCHAUDIO.TRANSFORMSSOURCE CODE FOR TORCHAUDIO.TRANSFORMS

The sales in this section are as follows. “A lot of effort is spent preparing the data to solve machine learning problems. Torchaudio leverages PyTorch's GPU support to provide many tools to make loading data easier and easier to read. To do. This tutorial will show you how to load and preprocess data from a simple dataset. For more information, see Preprocessing with Audio I / O and torchaudio (https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html). For this tutorial, make sure you have the matplotlib package installed for ease of visualization. "

What i did

・ Preparation ・ Opening a file ・ Transformations ・ Functional ・ Migrating to torchaudio from Kaldi ・ Available Datasets

・ Preparation

# Uncomment the following line to run in Google Colab
# !pip install torchaudio
import torch
import torchaudio
import requests
import matplotlib.pyplot as plt

When you run it for the first time, you need to install the following.

pip install torchaudio

Also, when I try to read a file, it throws an error and does not read. cannot import torch audio ' No audio backend is available.' So, as the link above, For Windows files

pip install PySoundFile

On Linux

pip install sox

・ Opening a file

"Torchaudio also supports loading wav and mp3 format sound files. Waveforms are called raw audio signals." The following code reads the wav file that exists in the url with r = requests.get (url) into r and then stores it locally as'steam-train-whistle-daniel_simon-converted-from-mp3.wav' I will.

url = "https://pytorch.org/tutorials/_static/img/steam-train-whistle-daniel_simon-converted-from-mp3.wav"
r = requests.get(url)

with open('steam-train-whistle-daniel_simon-converted-from-mp3.wav', 'wb') as f:
    f.write(r.content)

filename = "steam-train-whistle-daniel_simon-converted-from-mp3.wav"
waveform, sample_rate = torchaudio.load(filename)

print("Shape of waveform: {}".format(waveform.size()))
print("Sample rate of waveform: {}".format(sample_rate))

plt.figure()
plt.plot(waveform.t().numpy())

And it is drawn with plt.plot (waveform.t (). Numpy ()). The result is drawn as follows. fig_1__.png At first, I wondered why two overlapped, but when I looked at the standard output, Shape of waveform: torch.Size([2, 276858]) Sample rate of waveform: 44100 It is, and torch.Size is 2, and you can see that there are two data of 276858. In other words, it is 2ch data (stereo data). So, if you divide the drawing, it will be drawn as follows. fig_2_.png

・ Transformations

"Torchaudio is still growing, but it supports conversions like the ones listed below."

function function
Resample: Resampling the waveform at a different sample rate.
Spectrogram: Create a spectrogram from the waveform.
GriffinLim: Griffin-Use the Lim transform to calculate the waveform from a linear scale magnitude spectrogram.
ComputeDeltas: Calculate the delta factor of a tensor (usually a spectrogram)
ComplexNorm: Calculates the norm of a complex tensor.
MelScale: A transformation matrix is ​​used to convert a regular STFT to a mel frequency STFT.
AmplitudeToDB: This powers the spectrogram/Converts from an amplitude scale to a decibel scale.
MFCC: Create a mel frequency cepstrum coefficient from the waveform.
MelSpectrogram: Use PyTorch's STFT function to create a MEL spectrogram from the waveform.
MuLawEncoding: mu-law Encodes the waveform based on compression.
MuLawDecoding: mu-Decodes law-encoded waveforms.
TimeStretch: Stretches the spectrogram in time without changing the pitch at a specific rate.
FrequencyMasking: Apply masking to the frequency domain spectrogram.
TimeMasking: Apply masking to the time domain spectrogram.

The original English text is as follows.

function function
Resample: Resample waveform to a different sample rate.
Spectrogram: Create a spectrogram from a waveform.
GriffinLim: Compute waveform from a linear scale magnitude spectrogram using the Griffin-Lim transformation.
ComputeDeltas: Compute delta coefficients of a tensor, usually a spectrogram.
ComplexNorm: Compute the norm of a complex tensor.
MelScale: This turns a normal STFT into a Mel-frequency STFT, using a conversion matrix.
AmplitudeToDB: This turns a spectrogram from the power/amplitude scale to the decibel scale.
MFCC: Create the Mel-frequency cepstrum coefficients from a waveform.
MelSpectrogram: Create MEL Spectrograms from a waveform using the STFT function in PyTorch.
MuLawEncoding: Encode waveform based on mu-law companding.
MuLawDecoding: Decode mu-law encoded waveform.
TimeStretch: Stretch a spectrogram in time without modifying pitch for a given rate.
FrequencyMasking: Apply masking to a spectrogram in the frequency domain.
TimeMasking: Apply masking to a spectrogram in the time domain.

“Each conversion supports batch processing. You can perform conversions on a single raw audio signal or spectrogram, or many of the same shape. All transformations are nn.Modules or jit.ScriptModules, so they can always be used as part of a neural network. "

View the logarithm of the spectrogram on a logarithmic scale

specgram = torchaudio.transforms.Spectrogram()(waveform)
print("Shape of spectrogram: {}".format(specgram.size()))
plt.figure()
plt.imshow(specgram.log2()[0,:,:].numpy(), cmap='gray')
torchaudio.transforms.Spectrogram(n_fft: int = 400, win_length: Optional[int] = None, hop_length: Optional[int] = None, pad: int = 0, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: Optional[float] = 2.0, normalized: bool = False, wkwargs: Optional[dict] = None)
paeameters type explaination
n_fft (int, optional) – Size of FFT creates n_fft // 2 + 1 bins. (Default: 400)
win_length (int or None, optional) – Window size. (Default: n_fft)
hop_length (int or None, optional) – Length of hop between STFT windows. (Default: win_length // 2)
pad (int, optional) – Two sided padding of signal. (Default: 0)
window_fn (Callable[.., Tensor], optional) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default: torch.hann_window)
power (float or None, optional) – Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. If None, then the complex spectrum is returned instead. (Default: 2)
normalized (bool, optional) – Whether to normalize by magnitude after stft. (Default: False)
wkwargs (dict or None, optional) – Arguments for window function. (Default: None)

From the table above, the following code outputs a decent spectrogram.

filename = "10ohayo0hirakegoma_out.wav" 
waveform, sample_rate = torchaudio.load(filename)
print("Shape of waveform: {}".format(waveform.size()))
print("Sample rate of waveform: {}".format(sample_rate))

sk = "waveform"
fig, (ax1,ax2,ax3) = plt.subplots(3,1,figsize=(1.6180 * 4, 4*2))
lns1=ax1.plot(waveform.t().numpy(),"red",label = "waveform[0]")
lns2=ax2.plot(waveform.t().numpy(),"red",label = "waveform[0]")
lns3=ax3.plot(waveform.t().numpy(),"blue",label = "waveform[0]")
ax1.legend(loc=0)
ax2.legend(loc=0)
ax3.legend(loc=0)
ax1.set_title(sk)
ax2.set_xlim(50000,50000+44100*0.0625) #0,44100*0.25
ax3.set_xlim(3*44100,44100*3.0625)
plt.pause(1)
plt.savefig('./fig/fig_{}_double_.png'.format(sk)) 
plt.close()

specgram = torchaudio.transforms.Spectrogram(n_fft=1024)(waveform)
print("Shape of spectrogram: {}".format(specgram.size()))

sk = "specgram"
fig, (ax1,ax2,ax3) = plt.subplots(3,1,figsize=(1.6180 * 4, 4*2))
lns1=ax1.imshow(specgram.log2()[0,:,:].numpy(), cmap='gray') 
lns2=ax2.imshow(specgram.log2()[0,:,:].numpy(), cmap='hsv') 
lns3=ax3.imshow(specgram.log2()[0,:,:].numpy(), cmap='hsv') 

ax2.set_ylim(250,0)
ax3.set_ylim(125,0)
ax1.set_title(sk)

plt.pause(1)
plt.savefig('./fig/fig_{}_double_.png'.format(sk)) 
plt.close()

View the mel spectrogram on a logarithmic scale

Then print the mel spectrogram. What is MelSpectrogram? ** Internally, the spectrogram is multiplied by what is called a mel filter bank ** ⇒ Amplitude adjustment filter that emphasizes the low frequency region A filter similar to brightness adjustment 【reference】 ④ Understanding Melfilter BankMel scale @wikipedia The mel scale is as follows. .. .. So it's log2.

m = 1000\log _2(\frac{f}{1000Hz}+1)
specgram = torchaudio.transforms.MelSpectrogram()(waveform)
print("Shape of spectrogram: {}".format(specgram.size()))
plt.figure()
p = plt.imshow(specgram.log2()[0,:,:].detach().numpy(), cmap='gray')

Shape of spectrogram: torch.Size([2, 128, 1385]) Well, a picture similar to the above Spectrogram comes out, but it seems that the above conversion is done. fig_3__.png Therefore, check the specifications in the same way as the above Spectrogram. It looks like the following.

torchaudio.transforms.MelSpectrogram(sample_rate: int = 16000, n_fft: int = 400, win_length: Optional[int] = None, hop_length: Optional[int] = None, f_min: float = 0.0, f_max: Optional[float] = None, pad: int = 0, n_mels: int = 128, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: Optional[float] = 2.0, normalized: bool = False, wkwargs: Optional[dict] = None)

The meaning of each parameter is almost the same as the above Spectrogram. However, you can specify Sampling_rate.

paeameters type explaination
sample_rate (int, optional) – Sample rate of audio signal. (Default: 16000)
win_length (int or None, optional) – Window size. (Default: n_fft)
hop_length (int or None, optional) – Length of hop between STFT windows. (Default: win_length // 2)
n_fft (int, optional) – Size of FFT, creates n_fft // 2 + 1 bins. (Default: 400)
f_min (float, optional) – Minimum frequency. (Default: 0.)
f_max (float or None, optional) – Maximum frequency. (Default: None)
pad (int, optional) – Two sided padding of signal. (Default: 0)
n_mels (int, optional) – Number of mel filterbanks. (Default: 128)
window_fn (Callable[.., Tensor], optional) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default: torch.hann_window)
wkwargs (Dict[.., ..] or None, optional) – Arguments for window function. (Default: None)

Based on the table above, output with the following code.

specgram = torchaudio.transforms.MelSpectrogram(sample_rate=44100,n_fft=2048)(waveform)
print("Shape of MelSpectrogram: {}".format(specgram.size()))

It has a similar waveform to the spectrogram above, but with a smaller vertical scale, but the resolution looks better than above. fig_MekSpecgram_double_.png

Resampling

You can resample the waveform one channel at a time. sample_rate is set to 1/10. The horizontal axis is 1/10 of the original waveform.

new_sample_rate = sample_rate/10
# Since Resample applies to a single channel, we resample first channel here
channel = 0
transformed = torchaudio.transforms.Resample(sample_rate, new_sample_rate)(waveform[channel,:].view(1,-1))
print("Shape of transformed waveform: {}".format(transformed.size()))
plt.figure()
plt.plot(transformed[0,:].numpy())

Shape of transformed waveform: torch.Size([1, 27686]) fig_4__.png

μ-Law algorithm

As another example of conversion, you can encode the signal based on Mu-Law encoding.

However, this requires the signal to be between -1 and 1. The tensor is a regular PyTorch tensor, so you can apply standard operators. "

# Let's check if the tensor is in the interval [-1,1]
print("Min of waveform: {}\nMax of waveform: {}\nMean of waveform: {}".format(waveform.min(), waveform.max(), waveform.mean()))

Min of waveform: -0.572845458984375 Max of waveform: 0.575958251953125 Mean of waveform: 9.293758921558037e-05

"The waveform is already between -1 and 1, so there is no need to normalize." The following is a standardization function.

def normalize(tensor):
    # Subtract the mean, and scale to the interval [-1,1]
    tensor_minusmean = tensor - tensor.mean()
    return tensor_minusmean/tensor_minusmean.abs().max()
# Let's normalize to the full interval [-1,1]
# waveform = normalize(waveform)

If you uncomment, it will be standardized to [-1,1].

"Let's apply the waveform encoding."

transformed = torchaudio.transforms.MuLawEncoding()(waveform)
print("Shape of transformed waveform: {}".format(transformed.size()))
plt.figure()
plt.plot(transformed[0,:].numpy())

Shape of transformed waveform: torch.Size([2, 276858]) fig_5__.png "And decode."

reconstructed = torchaudio.transforms.MuLawDecoding()(transformed)
print("Shape of recovered waveform: {}".format(reconstructed.size()))
plt.figure()
plt.plot(reconstructed[0,:].numpy())

Shape of recovered waveform: torch.Size([2, 276858]) fig_6__.png "Finally, you can compare the original waveform with the reconstructed version."

# Compute median relative difference
err = ((waveform-reconstructed).abs() / waveform.abs()).median()
print("Median relative difference between original and MuLaw reconstucted signals: {:.2%}".format(err))

Median relative difference between original and MuLaw reconstucted signals: 1.28% In other words, the compression was encoded and decoded, and the error was 1.28%. Functional "The above conversion relies on low-level stateless functions for computation. These functions are available in torchaudio.functional. A complete list is available here" () (broken link), Things are included. " [Reference] The site with broken links seems to be changed to the following. ⑥TORCHAUDIO.FUNCTIONAL Unfortunately, the Torchaudio.Functional page has changed and doesn't include stft etc. I found it to exist in the link below.

functions Contents
istft : Inverse short time Fourier Transform.
stft : Short time Fourier Transform.
gain : Applies amplification or attenuation to the whole waveform.
dither : Increases the perceived dynamic range of audio stored at a particular bit-depth.
compute_deltas : Compute delta coefficients of a tensor.
equalizer_biquad : Design biquad peaking equalizer filter and perform filtering.
lowpass_biquad : Design biquad lowpass filter and perform filtering.
highpass_biquad :Design biquad highpass filter and perform filtering.

STFT So, although it is an extra edition, I tried STFT. First, the specifications are as follows.

torch.stft(input: torch.Tensor, n_fft: int, hop_length: Optional[int] = None, win_length: Optional[int] = None, window: Optional[torch.Tensor] = None, center: bool = True, pad_mode: str = 'reflect', normalized: bool = False, onesided: Optional[bool] = None, return_complex: Optional[bool] = None) → torch.Tensor

The code is below. Here, I think that input is a waveform, Shape of waveform: torch.Size([1, 220160]) And the input is a 1D tensor or a 2D (batch, waveform), so waveform.reshape(220160) And reshaped to 1D. Drawing is simplified by extracting only one element as shown below. lns1=ax1.imshow(specgram.log2().numpy()[:,:,0], cmap='gray') The result, albeit tentatively (not sure if it is correct), is the following figure. fig_stft_double.png

sk = "stft"
specgram = torch.stft(input = torch.tensor(waveform.reshape(220160)) ,n_fft=1024) #(waveform)
print("Shape of stftSpectrogram: {}".format(specgram.size()))
fig, (ax1,ax2,ax3) = plt.subplots(3,1,figsize=(1.6180 * 4, 4*2))
lns1=ax1.imshow(specgram.log2().numpy()[:,:,0], cmap='gray')
lns2=ax2.imshow(specgram.log2().numpy()[:,:,0], cmap='hsv')
lns3=ax3.imshow(specgram.log2().numpy()[:,:,0], cmap='hsv')
ax2.set_ylim(250,0)
ax3.set_ylim(125,0)
ax1.set_title(sk)
plt.pause(1)
plt.savefig('./fig/fig_{}_double_.png'.format(sk)) 
plt.close()
The STFT computes the Fourier transform of short overlapping windows of the input. 
This giving frequency components of the signal as they change over time. 
The interface of this function is modeled after the librosa_ stft function.

    .. _librosa: https://librosa.org/doc/latest/generated/librosa.stft.html

In other words, it seems that the design was borrowed from librosa.stft. So, I will hit the head family. This seems to be displayed correctly.

# Feature extraction example
import numpy as np
import librosa
import librosa.display

y, sr = librosa.load('10ohayo0hirakegoma_out.wav')  #trumpet'))

S = np.abs(librosa.stft(y))

S_left = librosa.stft(y, center=False)

D_short = librosa.stft(y, hop_length=64)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
img = librosa.display.specshow(librosa.amplitude_to_db(S,
                                                       ref=np.max),
                               y_axis='log', x_axis='time', ax=ax)
sk = 'Power spectrogram'
ax.set_title(sk)
fig.colorbar(img, ax=ax, format="%+2.0f dB")
plt.pause(1)
plt.savefig('./fig/fig_{}_librose_.png'.format(sk)) 
plt.close()

Here, the vertical axis and the horizontal axis show a reliable display as shown below. It has nothing to do with pytorch, or it can't use GPU, which is the original purpose, but it can be used for preprocessing. Example) Preprocessing to use this stft image for voice recognition And since I saw this figure, I understood that the above graph is top and bottom and the scale is simply the number of elements. fig_Power spectrogram_librose_.png

There is the rest of the explanation, but this time I will pass it. ・ Mu_law_encoding functional: ・ Visualize a waveform with the highpass biquad filter. ・ Migrating to torchaudio from Kaldi ・ Create mel frequency cepstral coefficients from a raw audio signal Also, I feel that the provided Datasets are newer at the link below. ・ Available Datasets ⇒ TORCHAUDIO.DATASETS

And looking at the time stamp, it was as follows. © Copyright 2017, PyTorch. Is it lukewarm ww

It may take some time to become a culture (understand where you need the information). .. ..

Summary

・ I tried to move torch audio ・ I expected to be able to calculate gpu, but not on Uwan's machine. ・ Since the time stamp is old, it is recommended to refer to the new page as much as possible.

By the way, the following pages are 2017-2018. © Copyright 2018, Torchaudio Contributors. TORCHAUDIO © Copyright 2018, Torchaudio Contributors. [TORCHAUDIO.FUNCTIONAL] (https://pytorch.org/audio/stable/functional.html) © Copyright 2017, PyTorch. SPEECH COMMAND RECOGNITION WITH TORCHAUDIO

Recommended Posts

[Introduction to pytorch] Preprocessing by audio I / O and torch audio (> <;)
[Introduction to Pytorch] I played with sinGAN ♬
[Introduction to PID] I tried to control and play ♬
Introduction to Lightning pytorch
I tried to implement and learn DCGAN with PyTorch
[Introduction to Pytorch] I tried categorizing Cifar10 with VGG16 ♬
Introduction to PyTorch (1) Automatic differentiation
[Introduction to simulation] I tried playing by simulating corona infection ♬
Introduction to Nonlinear Optimization (I)
[Introduction to Pandas] I tried to increase exchange data by data interpolation ♬
Asynchronous I / O and non-blocking I / O
I tried to classify MNIST by GNN (with PyTorch geometric)
[Introduction to infectious disease model] I tried fitting and playing ♬
[Introduction to Pytorch] I want to generate sentences in news articles
I want Sphinx to be convenient and used by everyone
[Introduction to Mac] Convenient Mac apps and settings that I use
[Introduction to Python] I compared the naming conventions of C # and Python.
[Introduction to simulation] I tried playing by simulating corona infection ♬ Part 2
I tried to implement sentence classification by Self Attention with PyTorch
[Introduction to Python3 Day 1] Programming and Python
Introduction to Generalized Estimates by statsmodels
Scraping, preprocessing and writing to postgreSQL
[Details (?)] Introduction to pytorch ~ CNN CIFAR10 ~
I tried to explain Pytorch dataset
[Introduction to AWS] I played with male and female voices with Polly and Transcribe ♪
I tried moving the image to the specified folder by right-clicking and left-clicking
I tried to verify and analyze the acceleration of Python by Cython
I tried to classify Oba Hana and Emiri Otani by deep learning
I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique
Introduction to Deep Learning ~ Convolution and Pooling ~
[PyTorch] Introduction to document classification using BERT
[Python] Introduction to CNN with Pytorch MNIST
I tried to implement CVAE with PyTorch
[Introduction to AWS] Text-Voice conversion and playing ♪
[Introduction to AWS] I tried porting the conversation app and playing with text2speech @ AWS ♪
I tried to pass the G test and E qualification by training from 50
I tried to classify Oba Hana and Emiri Otani by deep learning (Part 2)
I tried to implement sentence classification & Attention visualization by Japanese BERT in PyTorch
[Introduction to Docker] I tried to summarize various Docker knowledge obtained by studying (Windows / Python)