Sing a Tune, Generate an Accompaniment SingSong, a tool that generates instrumental music for unaccompanied input vocals

Published

Jan 17, 2024

Reading time

2 min read

A neural network makes music for unaccompanied vocal tracks.

What's new: Chris Donahue, Antoine Caillon, Adam Roberts, and colleagues at Google proposed SingSong, a system that generates musical accompaniments for sung melodies. You can listen to its output here.

Key insight: To train a machine learning model on the relationship between singers’ voices and the accompanying instruments, you need a dataset of music recordings with corresponding isolated voices and instrumental accompaniments. Neural demixing tools can separate vocals from music, but they tend to leave remnants of instruments in the resulting vocal track. A model trained on such tracks may learn to generate an accompaniment based on the remnants, not the voice. Then, given a pure vocal track, it can’t produce a coherent accompaniment. One way to address this issue is to add noise to the isolated voices. The noise drowns out the instrumental remnants and forces the model to learn from the voices.

How it works: The authors based their approach on AudioLM, a system that generates audio by attending to both small- and large-scale features.

The authors built a dataset of 1 million recordings that totaled 46,000 hours of music. They separated the recordings into voices and instrumental accompaniments using a pretrained MDXNet and divided the recordings into 10-second clips of matching isolated vocal and instrumental tracks. They added noise to the vocal tracks.
Following AudioLM and its successor MusicLM, the authors tokenized the instrumental tracks at two time scales to represent large-scale compositional features and moment-to-moment details. A w2v-BERT pretrained on speech plus the authors’ initial dataset produced 25 tokens per second. A SoundStream audio encoder-decoder pretrained on speech, music, and the authors’ initial dataset produced 200 tokens per second.
To represent the noisy vocal tracks, they produced 25 tokens per second using the w2vBERT.
They trained a T5 transformer, given vocal tokens, to generate the corresponding instrumental tokens.
Given the instrumental tokens, a separate transformer learned to generate tokens for SoundStream’s decoder to reconstruct the instrumental audio.
To generate an instrumental track, the authors fed tokens produced by the transformer to SoundStream’s decoder.

Results: Listeners compared 10-second clips from the test set of MUSDB18, a dataset that contains 10 hours of isolated vocal and instrumental tracks. Each clip came in multiple versions that paired the original vocal with accompaniment supplied by (i) SingSong, (ii) a random instrumental track from MUSDB18’s training set, (iii) the instrumental track from MUSDB18’s training set most similar to the vocal in key and tempo according to tools in the Madmom library, and (iv) the original instrumental track. The listeners preferred SingSong to the random accompaniment 74 percent of the time, to the most similar accompaniment 66 percent of the time, and to the original instrumental track 34 percent of the time.

Why it matters: The authors used data augmentation in an unusual way that enabled them to build a training dataset for a novel, valuable task. Typically, machine learning practitioners add noise to training data to stop a model from memorizing individual examples. In this case, the noise stopped the model from learning from artifacts in the data.

We’re thinking: Did you always want to sing but had no one to play along with you? Now you can duet yourself.

Subscribe to The Batch