OpenAI announced the release of Jukebox – a new neural network model that is able to generate realistic music of a wide variety of genres and styles.
Automatic music generation is a challenging task and there have been many attempts in the past to solve this difficult problem. Some proposed methods were able to generate somewhat limited audio signals that can be regarded as “music”. In their recent work, researchers and engineers from OpenAI tried to model music directly as raw audio, which represents an even more challenging problem, due to the long-range dependencies in raw audio signals.
To address the problem, researchers propose an Autoencoder model called Jukebox that compresses audio signals to a discrete space leveraging previous advances in variational autoencoders such as VQ-VAE. VQ-VAE or Vector-quantized Variational Autoencoders proposed by Aaron van den Oord, is a generative model that solves the problem of posterior collapse in VAEs by using a discrete output from the encoder and a learned prior instead of a static one. OpenAI researchers drew inspiration from VQ-VAE (and its variants) and applied it to music generation in the raw audio signal space.
The proposed model encodes or compresses an input audio signal on three levels: 8x, 32x, and 128x, which is then decoded using a cascade of transformers in order to generate novel audio as well as a reconstructed audio signal. A new dataset of more than 1.2 million songs was collected by researchers in order to train the proposed model, which researchers extended to make conditioning on style, artist, and lyrics possible.
The implementation of the method along with the pre-trained model weights was open-sourced and can be found on Github. More details about the method and the experiments conducted can be read in the official blog post of OpenAI.