GSLM is a FAIR natural language processing model that does not require training on datasets with texts. The key advantages of GSLM are the possibility of teaching it to any language and the emotional expressiveness of the generated speech.
Text language models such as BERT, RoBERTa and GPT-3 are used in a wide class of natural language processing applications, including tonality analysis, translation, information retrieval, generation of conclusions and generalizations. However, their use is limited to languages for which there are large datasets with texts.
The Generative Spoken Language Model (GSLM) does not require text-based training datasets. Built on the basis of representation learning, GSLM works only with raw audio signals. This circumstance can potentially use the model with any language. The developers of GSLM claim that this possibility is due to the ability of children to learn a language solely on the basis of raw sounds.
GSLM which consists of three components:
- an encoder that converts speech into discrete units that represent frequently repeated sounds in spoken speech;
- an autoregressive language model trained to predict the next discrete unit based on already available data;
- a decoder that converts discrete units into speech.
In addition to the ability to use a model for any language, GSLM has three advantages over text models. First, it provides greater expressiveness of speech, including intonation, irony, anger, uncertainty, laughter and others. Secondly, GSLM can be trained on new speech sources, such as podcasts and radio. Third, the model will allow developmental psychologists and speech and language doctors to investigate how language differences affect infants ‘ ability to learn to speak and understand speech.