Neural Network Generates Video of Person Speaking Given Only Audio

A group of researchers from the Technical University of Munich, have published a new method that given an audio sequence, generates a realistic video of a person speaking in synch with the audio signal.

The method, nicknamed as Neural Voice Puppetry, is based on deep neural networks and achieves state-of-the-art results for audio-visual sync in facial reenactment. Researchers employed a specific architecture which allows learning a latent 3D face model representation by leveraging the power of DeepSpeech RNN networks.

The proposed method works by taking audio signal as input and propagating it into a deep recurrent neural network that extracts rich features. These features are then converted into an expression basis which is used to update a generic 3D model of a person’s face. A neural network renderer is used in the end to render the 3D model and output a video of a person speaking, which is synchronized with the input audio.

Researchers performed extensive evaluations comparing their method with existing state-of-the-art methods. They also performed a user study where the performance of the method was assessed by qualitatively by the people’s perception of the generated videos. According to researchers the method shows better visual and lip sync quality compared to other methods for audio-driven face reenactment. However, researchers mention in their paper that the method has limitations, especially when the input audio contains multiple voices.

More details about the new method can be found in the paper published on arxiv.