• AV-HuBERT: speech recognition by lips

    Meta introduced AV-HuBERT, a speech recognition framework based on both its sound and the movement of the speaker’s lips. AV-Hubert recognition accuracy is 75% higher than that of state-of-the-art models trained on the same number of transcriptions.

    People perceive speech both by listening to it and by observing the movement of the speakers’ lips. Research shows that lip movement is even more important when learning a language than the sound of speech. Speech recognition systems, however, only work with audio material. Their training requires voluminous datasets, usually including tens of thousands of hours of audio recordings.

    AV-Hubert surpasses the previous best audiovisual speech recognition system by using one tenth of the tagged data, which makes it potentially useful for languages with a small amount of audio data. Meta believes that in the future artificial intelligence frameworks, such as AV-Hubert, can be used to improve the efficiency of speech recognition in high-noise environments – for example, at a party or on a busy street. In particular, smartphones, augmented reality glasses and other devices with a camera will help people communicate in such situations.

    Meta is not the first company to apply artificial intelligence to the lip-reading task. In 2016, researchers from Oxford University created a system that in some tests was almost twice as accurate as humans and could process video in near real time. In 2017, Alphabet-owned DeepMind trained the system on thousands of hours of TV shows to translate 5 times more accurately than lip-reading experts. These models, however, are limited in the range of vocabulary they can recognize.

    AV-HuBERT is a multimodal model — it combines lip movement with sound information and registers the relationships between these data. The model was trained on 2442 hours of English-language celebrity videos uploaded to YouTube.

    Notify of
    Inline Feedbacks
    View all comments