A model that can sample music containing singing voice (≈ generate music) using VQ-VAE (variational self-encoder) was introduced. I have translated this paper for my own study, so I have summarized the contents. The implementation was also published on Github, so I'm trying it out. I want to summarize it easily next time.
** I would be grateful if you could point out any expressions that are not translated properly. ** **
Jukebox: A Generative Model for Music Prafulla https://arxiv.org/abs/2005.00341
Abstract
Introducing the model Jukebox that generates music with singing voice. In this study, it was shown that ** multi-scale VQ-VAE ** can be used to generate a variety of faithfully reproduced songs with coherence (≈ high reproducibility) up to a few minutes. ** Also, by setting conditions for artists and genres, it was possible to control the style of music and vocals, and by not aligning the lyrics, it was possible to make singing easier to control.
1.Introduction
The authors' models allow us to generate advanced songs. There are various genres of music such as rock, hip hop and jazz. They can capture melodies, rhythms, long-range compositions, tones for various instruments, and the style and voice quality of the singer produced with the music, as well as existing songs.
The authors' approach uses a ** hierarchical VQ-VAE architecture (Razavi et al., 2019). It uses a loss function designed to compress audio into a discrete space and increase the compression level while retaining the maximum amount of music information. ** We are using an autoregressive sparse transformer (Child et al., 2019; Vaswani et al., 2017) trained with maximum likelihood estimation on this compressed space. We are also training an autoregressive upsampler to reproduce the information lost at each level of compression.
** New complements to existing songs are also possible **. The authors' approach can influence the generation process. By swapping the top-preceding and conditional-preceding, you can condition the lyrics to tell the singer what to sing, or the MIDI to control the composition.
2.Background
Music is represented by a continuous waveform $ x ∈ [-1,1] ^ T $, and the number of samples T is the product of the duration of the voice and the sampling rate, and is generally considered to be in the range of 16kHz to 48kHz. .. Sufficient quality for CD audio is typically ** 44.1kHz samples stored with 16bit accuracy **.
** Learning and generating music requires a lot of calculation compared to image generation **. For example, a 4-minute voice at a 16-bit bit rate would have an input length of 10 million. On the other hand, a high-quality RGB image has 3 million inputs in 24-bit. To solve this problem, ** VQ-VAE is used to convert the original music into a low-dimensional space. ** ** Decode the embedded vector with the encoder $ E (x) $ that encodes the embedded vector, the bottleneck that quantizes the embedded vector from the codebook $ C = {\ bf e_k} ^ K_ {k = 1} $. It consists of a decoder $ D (e) $ that returns to the input space.
The objective function for learning is as follows.
Note that $ sg $ is an abbreviation for stop gradient, which means that the gradient is not calculated. Here you will train a single encoder and decoder. Divide the latent sequence $ h $ into multi-valued representations $ [h (1), ..., h (L)] $ whose sequence length decreases, each learning its own codebook $ C (l) $. I am. They use a non-auto-regressive encoder / decoder and collaborate on all levels using a simple mean square loss.
It is an algorithm called Vector Quantization Variational Auto Encoder. ** In order to solve the problem of posterior collapse *, which is a problem peculiar to VAE, we tried to solve it by introducing Vector Quantization. ** **
Quantize is a word that means quantization. Vector Quantization is the approximate representation of continuous quantities such as analog signals as discrete values such as ** integers and discrete latent spaces. ** ** The latent space at this time is expressed as follows (called a codebook).
e = [e_1,e_2,....e_K] \in\mathbb{R}^{D×K}
The size of this space is $ K $, and each point $ e_i $ is further represented by a $ D $ dimensional real vector. This space itself will be learned at the same time later.
A phenomenon called * posterior collapse is a phenomenon in which latent variables are ignored when using a strong Decoder such as PixelCNN.
As a learning point, if it is just an autoencoder, define the loss function as follows. $ E (x) $ is the encoder function, $ Q (x) $ is the quantization function, and $ D (x) $ is the decoder function.
L = ||x - D(Q(E(x)))||^2
L = log p(x|D(Q(E(x))))
** VQ-VAE adds the following $ e $ update element to this. ** **
L = ||sg[Q(E(x))] - E(x)||^2 +\beta ||Q(E(x))-sg[E(x)]||^2
$ \ Beta $ is a hyperparameter, and it seems that it should be a value between 0.1 and 2.
Reference material
Journey through deep generative models (2): VAE https://qiita.com/shionhonda/items/e2cf9fe93ae1034dd771
3.Music VQ-VAE
Inspired by the results of applying the hierarchical VQ-VAE to images (link below), the authors considered applying the same technology to music.
Generating Diverse High-Fidelity Images with VQ-VAE-2 https://arxiv.org/abs/1906.00446
First, train three VQ-VAE models. Music input in each of the three layers is classified and encoded into the vector $ h_t $. It is then quantized to be the closest to the codebook vector $ e_ {z_t} $. The code $ z_t $ learns discrete speech. The decoder receives a sequence of codebook vectors and reconstructs the music.
Figure 1
3.1.Random restarts for embeddings
The problem with VQ-VAE is that it suffers from ** codebook collapse **. Codebook collapse is a bottleneck reduction in information capacity because all encodings are mapped to a single or a few embedded vectors while no other embedded vector in the codebook is in use. It is a phenomenon. Randomly reboot to prevent this.
Randomly resets one of the encoder outputs in the current batch when the average usage of the codebook vector falls below the threshold. This ensures that all the vectors in the codebook are used, allowing you to maintain a learning gradient to mitigate codebook collapse.
3.2.Separated Autoencoders
When using the hierarchical VQ-VAE for music, the bottleneck top level was rarely used. A complete collapse could have been seen as the model passed all the information to the lower levels where it was not the bottleneck. To maximize the amount of information stored at each level, the authors trained separate autoencoders with varying hop lengths. Discrete codes at each level can now be treated as independent coding of inputs at different compression levels.
3.3.Spectral Loss When dealing with sample-level reconstruction loss, the model learns to reconstruct only low frequencies. To capture the mid-high frequency range, add a spectral loss defined as follows:
This allows the model to match the spectral components without paying attention to the phases, which are more difficult to train.
Each of these is an autoregressive modeling problem in the discrete token space generated by VQ-VAE.
4.1.Artist, Genre, and Timing Conditioning
The authors' generative models can be made more controllable by adding conditioning signals during ** training. ** The first model gives the song an artist label and a genre label. This has two advantages. First, it reduces the entropy of audio prediction, so the model can achieve better quality in a particular style. Second, you can manipulate the model to generate it in any style you choose when it is generated. Furthermore, the timing signal is added to each segment at the time of learning. This signal includes the overall duration of the song, the start time of a particular sample, and how much of the song has passed. This allows the model to learn musical patterns that depend on the overall structure.
4.2. Lyrics Conditioning
The above conditional model can generate songs of various genres and artistic styles. However, the singing voices produced by those models are often sung with compelling melodies. ** However, it was not possible to generate recognizable English words because most of them consisted of words that sounded like one word. ** **
Therefore, by conditioning the lyrics corresponding to each music segment in order to be able to control the generative model with lyrics, we provided more context and made it possible to generate singing voice at the same time as the music.
4.3. Decoder Pretraining
To reduce the amount of computation required to train the lyrics model, the authors used a pre-trained unconditional top-level prio as the decoder and introduced a lyrics encoder using model surgery. Thus, at initialization, the model behaves like a pre-trained decoder, but there is still a gradient in terms of encoder state and parameters. This allows the model to learn to use the encoder.
4.4Decoder Pretraining Ancestral sampling
Each model takes conditional information such as genre, artist, timing, lyrics, etc., and the up-sampler model also requires a higher level code. To generate music, the conditioning information is used to sample and control the VQ-VAE code from top to bottom, and then the VQ-VAE decoder converts the underlying code to audio.
Windowed sampling
In order to generate music that is longer than the model's context length (12 in this figure), the context is the window overlap of the previous chord, and the continuation is sampled repeatedly at each level. The amount of overlap is a hyperparameter and the figure shows an example of 75% overlap with a hop length of 3.
Primed sampling: You can generate continuity of an existing audio signal by converting the existing audio signal to a VQ-VAE code and sampling the subsequent code at each level.
5.Experiments
5.1. Dataset
The authors scraped a new dataset of ** 1.2 million songs (600,000 of which are in English) and paired them with ** lyrics and metadata from LyricWiki (LyricWiki). The metadata contains keywords for artists, albums, genres, year of release, and common moods and playlists associated with each song. Training is performed using 32bit, 44.1kHz live audio, and data is enhanced by randomly downmixing the right and left channels to generate monochannel audio.
5.2. Training Details
In VQ-VAE for music, ** 44kHz audio is dimensionally compressed with a codebook size of 2048 at each level using three levels of bottlenecks: 8x, 32x, and 128x. ** VQ-VAE has 2 million parameters and has learned 9 seconds of audio clips on 256 V100s for 3 days.
Upsampler has 1 billion parameters and 128 V100s for 2 weeks, top-level pre-learning has 5 billion parameters and 512 V100s for 4 weeks. Adam with a learning rate of 0.00015 and a weight attenuation of 0.002 is used. To condition the lyrics, I reused the pre-processing, added a small encoder, and then learned for two weeks on 512 V100s.
5.3.Samples
The authors learned a set of models while improving sample quality. ** The first model was trained on the MAESTRO dataset using a 22 kHz VQ-VAE code and a relatively low priority model. ** We found that this allowed us to generate high fidelity classical music samples, including piano and violin. Next, we collected a larger and more diverse set of songs with genre and artist labels. Applying the same model to this new dataset, we were able to generate a wide variety of non-classical samples, demonstrating over a minute of musicality and coherence.
Coherence Throughout the length of the top-level pre-processing context (about 24 seconds), we have confirmed that the samples remain musically very consistent. We also found that sliding the window to generate a longer sample maintained similar harmonics and textures.
Musicality
The samples mimic the musical harmonies that often appear, and the lyrics are usually set in a very natural way. The highest and longest notes of a melody often match the words emphasized by a human singer, and the lyrics are almost always rendered in a way that captures the prosody of the phrase.
Novel styles
The authors produce songs of unusual genres that have nothing to do with the artist. In general, it can be quite difficult to generalize to a new singing style while using the same voice as the artist. However, mixing country singer Alan Jackson with unusual genres such as hip-hop and punk did not result in a sample that departed from country style.
Novel riffs Another useful development of the Jukebox is that you can explore different continuations by recording incomplete ideas. The authors curated a novel riff recording by a musician and primed the model during sampling. Sample 6 starts with a musical style that is rarely used in Elton John's songs. Sample 6 begins with a musical style that isn't often used in Elton John's songs, but the model takes this song a step further.
5.4. VQ-VAE Ablations
** The figure above shows a comparison of reconstructions from different VQ-VAEs, where the x-axis is time and the y-axis is frequency. ** From left to right column, hop lengths are 8, 32, 128 bottom level, middle level and top level reconstruction. Each is visualized as a Mel spectrum gram. In the third line, we can see that the spectral loss is removed and the high frequency information is lost at the intermediate level and the top level. The fourth line uses a hierarchical VQ-VAE (Razavi et al., 2019) instead of a separate autoencoder (Figure 1). Finally, line 5 shows the baseline using the Opus codec, which encodes audio at a constant bit rate comparable to VQ-VAE. It also failed to capture high frequencies, adding significant artifacts at the highest levels of compression.
6.Related Work
Generative models for music: The history of the generative model of symbolic music goes back more than half a century. Early approaches include rule-based systems (Moorer, 1972), chaos and self-similarity (Pressing, 1988), cellular automaton (Beyls, 1989), concatenated synthesis (Jehan, 2005), and constraint planning (Anders & Mi-). There are various approaches such as randa, 2011). More recent data-driven approaches include Deep Bach (Hadjeres et al., 2017) and Coconet (Huang et al., 2017), which use Gibbs sampling to generate notes in the style of Bach's chorus. Examples include MidiNet (Yang et al., 2017) and MuseGAN (Dong et al., 2018), which use ad lib.
Also, for symbolic music information such as N Synth (Engel et al., 2017), Mel2Mel (Kim et al., 2019), Wave2Midi2Wave (Hawthorne et al., 2019) using WaveNet style autoencoder. There are also many approaches to synthesizing music based on it.
Sample-level generation of audio:
In recent years, various speech generation models have been announced. WaveNet (Oord et al., 2016) will be able to use an extended convolution series to exponentially increase the length of the context. Then, autoregressive sample-by-sample stochastic modeling of the raw waveform is performed. This makes it possible to generate realistic audio dimensionlessly or by conditioning acoustic features and spec programs.
Parallel WaveNet (Oord et al., 2018) is an improvement on this, using a mixture of logistics distribution and continuous probability distribution instead, and a probability density distribution that learns a parallel feedforward network from a pre-trained autoregressive model. By executing, high-speed sampling of highly reproducible audio is possible.
WaveGlow (Prenger et al., 2019) is a flow-based model for parallel sample-level audio synthesis, which can be trained with straightforward maximum likelihood estimation and therefore the two steps required for knowledge distillation. It is advantageous for the training process.
VQ-VAE Oord et al. (2017) introduced VQ-VAE, an approach that uses vector quantization to downsample very long contextual inputs to longer-length discrete latent coding. As a result, we have shown that it is possible to generate high-quality images and sounds and learn unsupervised phoneme expressions. Razavi et al. (2019) extend the above model by introducing a hierarchy of discrete iterative representations to the image, and the resulting model is local, such as texture, in lower layers with smaller receptive fields. We have shown that it is possible to learn to separate high-level semantics into the highest hierarchy of discrete codes with the largest receptive fields while capturing the features.
Speech synthesis In order to generate natural human voice, it is necessary to understand linguistic features, map sounds, and maneuver expressions. Many text-to-speech synthesis (TTS) systems have highly designed features (Klatt, 1980), carefully tuned sound segments (Hunt & Black, 1996), and statistical parametric modeling (Zen et al., 2009). , And (Arık et al., 2017) relies on more than a dozen complex pipelines.
Recent works such as Deep Voice3 (Ping et al., 2018), Tacotron 2 (Shen et al., 2018), and Char2Wav (Sotelo et al., 2017) use speech synthesis using an architecture between sequences. Learn end-to-end (Sutskever et al., 2014). Although the design space is vast, a typical approach generally consists of text representations, audio features, and bidirectional encoders, decoders, and vocoders for building the final raw waveform.
7.Future work
The authors' approach has been able to advance in the ability to produce consistently long music samples. However, we recognize that there are several directions for future work. The production of great music must be of high quality across all time scales. The authors believe that the current model is strong on the middle range timescale. Models often produce very good sound samples locally, with a variety of interesting harmonies, rhythms, instruments, and voices.
The authors were impressed that the generated melodies and rhythms fit very well with the particular lyrics. However, although the sample is consistent over a long time scale, it turns out that it does not have the traditional large musical structure (such as repetitive choruses or question-and-answer melodies). You may also hear noise and scratches in the small scales.
Also, with the current model, it takes about an hour to generate a ** 1 minute top-level token. ** The upsampling process is very time consuming as the samples are processed sequentially. Currently, it takes about 8 hours to upsample a minute of top-level tokens.
8.Conclusion The Jukebox is a model that produces music that imitates various styles and artists. In Jukebox, you can specify sample lyrics based on songs of a specific artist or genre. The authors trained the hierarchical VQ-VAE and laid out the details needed to effectively compress the music into tokens. Previous studies have produced live audio music in the 20-30 second range. However, our model has made it possible to generate songs that are minutes in length, and that have the natural sound of a recognizable singing voice.
I would like to correct any mistakes in the understanding of the dissertation during implementation.