Google’s Live Caption Uses Deep Neural Networks to Transcribe Video and Audio in Real-time

Google has announced a new feature for Google Pixel and Android called Live Caption. The novel feature can automatically caption media playing on the phone in real-time. It leverages advances in deep learning to make real-time captioning possible on mobile devices.

Live Caption is running completely on-device and without any network connection preserving lower latency and showing instant results. It uses three on-device neural network models that work together in order to deliver a single caption track. A sequence transduction recurrent neural network (RNN-T) is used for speech recognition. Another text-based recurrent neural network is used for unspoken punctuation, while a convolutional neural network (CNN) model is utilized for sound events classification.

All of the models were quantized and serialized using the Tensorflow Lite runtime and are optimized for running on a mobile device. For example, the speech recognition model is running only during speech periods and the model is loaded and unloaded from memory based on the usage.

How the new feature Live Caption works.

The new feature – Live Caption, is currently available only on Pixel 4 and Pixel 4 XL phones and will become available for Pixel 3 models later this year. According to Google Live Caption will also come to other Android devices soon.

