• FAIR introduced a self-supervised neural network for speech recognition

    wav2vec-U is a framework for creating speech recognition systems that do not require training on annotated datasets. The algorithm presented by FAIR enables speech recognition in rare languages and dialects.

    To date, speech recognition technology is only available for a small number of languages. This is because training neural networks that recognize speech requires thousands of hours of transcribed audio recordings. This data is not available for most languages and dialects. Wav2vec Unsupervised (wav2vec-U) is a method for creating speech recognition systems that do not require decrypted data at all. It has comparable efficiency with neural networks that were trained on 960 hours of decoded speech (Fig. 1).

    Figure 1. Comparison of wave2vec-U performance based on the Librispeech benchmark with neural networks trained on annotated datasets.

    The algorithm works as follows. The neural network studies speech structures using the self-supervised FAIR wave2vec 2.0 model and the k-means method, which segments voice recordings into speech units that correspond to individual sounds in the first approximation (for example, the word ” cat “is decomposed into three sounds:” /K/“,” /AE/ “and”/T/”.) To recognize individual words, a generative-adversarial network consisting of a generator and a discriminator is used. The generator for each selected sound segment predicts the phoneme corresponding to the sound in the language. It learns by trying to trick a discriminator that evaluates whether the predicted phoneme sequences look realistic. The discriminator itself is also a neural network. It is trained on the input data of the generator and texts decomposed in advance into phonemes.

    Figure 2. wav2vec-U error rate for speech recognition in different languages.

    The evaluation of wav2vec-U efficiency was based on the TIMIT benchmark, which showed a 57% reduction in the number of errors compared to the previous best self-learning neural network. The development of self-learning speech recognition models is important for languages for which there are practically no annotated datasets. wav2vec-U has been tested in languages such as Swahili and Tatar, which currently do not have high-quality speech recognition models (Figure 2).

    Notify of
    Inline Feedbacks
    View all comments