MusicGen: Open Source Neural Network for Generating Music in Any Genre

musicgen

MusicGen is a neural network that generates music based on textual descriptions and melody examples, providing more precise control over the generated output. Researchers conducted extensive empirical research to demonstrate the superiority of the proposed approach compared to existing methods on standard text-to-music benchmarks. You can create music yourself using the neural network in the demo version available on Hugging Face, and the full model code is available in the GitHub repository.

The method is based on a language model that operates on multiple streams of compressed discrete representations of music in the form of tokens. A distinctive feature of MusicGen is the use of efficient token interface patterns, which avoids the need for cascading multiple models that increase the sampling rate. While MusicGen is not the first neural network that generates music (for example, GoogleAI published their method MusicLM in January 2023), they did not release the code.

Results of the Neural Network’s Work

Example 1

Prompt: 80s electronic track with melodic synthesizers, catchy beat, and groovy bass

Example 2

Prompt: smooth jazz, with a saxophone solo, piano chords, and snare full drums

Example 3 – Style Transfer Based on the Source Sample

Source Melody (reference):

Prompt: 90s rock song with electric guitar and heavy drums

Result:

Example 4 – Long Fragment

Prompt: lofi slow bpm electro chill with organic samples

Method

The MusicGen method is based on a transformer-based autoregressive decoding model. It uses quantized units from the audio tokenizer EnCodec to model music. For compression and representation of parallel data streams, vector quantization with multiple trained encoders is used.

musicgen method
Each time step (t1, t2, …, tn) consists of 4 quantized values (k1, k2, k3, k4). In autoregressive modeling, they can be smoothed or interleaved in various ways, creating a new sequence with 4 parallel streams and steps (s1, s2, …, sm). The total number of steps in the sequence, M, depends on the pattern and the original number of steps, N. Token 0 indicates empty positions in the pattern.

The researchers have released four pre-trained models:

  1. small: a model with 300 million parameters that works only with text;
  2. medium: a model with 1.5 billion parameters that works only with text;
  3. melody: a model with 1.5 billion parameters that works with both text and melody references;
  4. large: a model with 3.3 billion parameters that works only with text.

Dataset

The researchers used 20,000 hours of licensed music for training MusicGen. They compiled an internal dataset of 10,000 high-quality music tracks and also utilized music collections from ShutterStock and Pond5, consisting of 25,000 and 365,000 instrumental samples, respectively. To evaluate the method, they used the MusicCaps benchmark, which includes 5,500 expert-prepared music samples and a balanced subset of 1,000 samples from various genres.

Comparison with Other Neural Networks for Music Generation

The authors compared the results of MusicGen with other state-of-the-art models: MusicLM, Riffusion, and Musai. Subjectively, MusicLM shows comparable results, unlike the latter two models.

musicgen comparison - создание музыки нейросетями
FAD scores on the MusicCaps dataset for the Noise2Music and MusicLM models

In comparison with other neural networks for music generation, MusicGen demonstrated superiority in objective metrics. The researchers also conducted studies on the impact of different encoder interface patterns on the quality of generated samples and found that the best results are achieved using the “flattening” pattern.

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments

aischool