MASSIVE: Amazon dataset for multilingual model training

Amazon has introduced the MASSIVE open-source dataset with translations of texts into 51 languages. The dataset is aimed at creating natural language processing models that can be easily generalized to new languages.

MASSIVE is based on the concept of multilingual natural language understanding, in which a single machine learning model can analyze and understand input data from many typologically diverse languages. By studying the general data representation covering languages, the model can transfer knowledge from languages with a large amount of training data to those in which training data is insufficient. The dataset consists of 1 million marked-up texts and source code that provides examples of how to perform mass multilingual modeling.

MASSIVE is a parallel dataset, which means that each utterance is given in all 51 languages. This allows models to study common representations of utterances with the same intentions, facilitating cross-linguistic learning of natural language comprehension tasks, and also allows them to adapt to other NLP tasks, such as machine translation, multilingual paraphrasing, linguistic analysis of imperative morphologies, and much more.

The data covers 18 domains, 60 types and 55 categories of statements. MASSIVE was assembled by professional translators.