Self-supervised Method for Sound Localisation and Separation

We, as human beings are able to digest a sound video with almost no effort. Being able to detect and track objects within the frames of a video allows us to gain contextual information and understand more or less what is going on in the video. On the other hand, we are able to process audio information very tightly coupled with those frames and gain an even better understanding of the action happening in the video.

Previous work

Quite a lot of research has been done exploring the relationship between vision and sound. A number of problems arise from this relationship and a number of methods have been proposed to solve some of them. In the past and especially in the recent past, researchers have addressed problems such as sound localization in videos, sound generation in silent videos, self-supervision in videos using audio signal etc.

State-of-the-art idea

Recent work, conducted and published by researchers from Massachusetts Institute of Technology, MIT-IBM Watson AI Lab and Columbia University has explored the relationship between vision and sound in a different way. In fact, researchers proposed a novel self-supervised method that learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel.

Method

Their approach is focused on leveraging the natural synchronization of the visual and audio information to learn to separate and locate sound components within a video in a self-supervised manner. They introduce their system called PixelPlayer that is able to learn to recognize and localize objects in images and to separate the audio component coming from each object. Additionally, the researchers introduced new musical instrument video dataset, called MUSIC built for the purpose of this work.

Example frames and sounds from the created video dataset – MUSIC.

As mentioned before, the proposed method localizes sound sources in a video and separates the audio into its components without supervision. It is composed of three major modules: Video analysis network, audio analysis network and audio synthesizer network. This kind of architecture allows extracting visual and audio features for the goal of audio-visual source separation and localization.

Video Analysis Network

The video analysis network is trying to extract visual features for each frame in the video. Temporal pooling is applied to each of the per-frame extracted features to obtain a visual feature vector for each pixel. To do so, the researchers employ a variation of the popular ResNet-18 network with dilated convolutions.

Audio Analysis Network

Parallel to the extraction of visual features, the audio analysis network is trying to split the input sound from the video into K components. To solve this problem, the researchers propose to use sound spectrograms instead of raw sound waveforms and employ a convolutional architecture that has proven successful with audio data in the past. Their choice here falls on Audio U-Net. Using this kind of encoder-decoder architecture, they are able to extract K feature maps out of the spectrogram containing features of different components of the input sound. Previously, they use STFT (Short-Time Fourier Transform) to obtain the sound spectrogram which is used as input to the network.

Audio Synthesizer Network

The core module of the proposed method is the synthesizer network which takes both the output from the video analysis network as well as the audio analysis network. More precisely, it takes the pixel-level visual feature and the audio feature and it outputs a mask that could separate the sound of that pixel from the input spectrogram. The way the final resulting sound for each pixel is obtained is by taking the corresponding mask and multiplying it with the input spectrogram. To get the resulting sound as a waveform, inverse STFT is applied.

The proposed “Mix-and-Separate” training framework.

In order to train such an architecture in an unsupervised (or self-supervised) manner, the researchers proposed a training framework that they call Mix-and-Separate. Their idea builds upon the assumption that sounds are approximately additive. Therefore, they mix sounds from different videos to generate a complex audio input signal and then the learning objective is to separate a sound source of interest conditioned on the visual input associated with it.

**“Which pixels are making sounds?”** Energy distribution of sound across image pixels.

**“What sounds do these pixels make?”** Clustering of sound in the pixel space.

Results

In order to perform an initial evaluation, the authors use again their training framework called Mix-and-Separate. They created a validation set of synthetic mixture audios and the separation is evaluated on this set. Since the goal is to learn and predict the spectrogram they do qualitative evaluation by comparing the ground-truth and the estimated spectrogram. For a quantitative evaluation, they use Normalized Signal-to-Distortion Ratio (NSDR), Signal-to-Interference Ratio (SIR), and Signal-to-Artifact Ratio (SAR) as evaluation metrics on the validation set of the synthetic videos.

Quantitative evaluation and comparison with traditional methods such as NMF (Non-negative Matrix Factorization) and Spectral Regression. Also, comparison between using a binary and ratio mask in the proposed method.

Qualitative evaluation. Comparison between the ground-truth and the estimated spectrogram.

Conclusion

Overall, the proposed method is interesting from several points of view. First, it shows that self-supervised learning can be applied to problems like this. Second, the method is able to jointly perform several tasks such as locating image regions which produce sounds and separating the input sounds into a set of components that represents the sound from each pixel. And third, it represents one of the first studies that explore the correspondence between single pixels and sound in videos.

Sound localisation Sound separation