FLORES-101 is a FAIR dataset for evaluating and testing multilingual translation models. The dataset contains 3,000 sentences from Wikipedia, translated into 101 languages by professional translators, and allows you to work with 101,000 translation directions.
FLORES-101 enables researchers to quickly test and improve multilingual translation models, such as the M2M-100 FAIR. The dataset focuses on languages such as Amharic, Mongolian, and Urdu, which do not currently have extensive datasets for natural language processing research. FLORES-101 contains the same set of sentences in all languages, which allows researchers to evaluate the effectiveness of any translation direction. Each sentence is first translated by a professional translator and manually checked by the editor. This is followed by a spell check, grammar check, and punctuation check, as well as a comparison with translations from commercial engines. After that, another group of translators conducts an additional assessment of the quality of the translation.
When developing the dataset, FAIR pursued the following goals:
- Emphasis on low-resource languages. Unlike most existing datasets, more than 80% of the languages used in FLORES-101 are currently low-resource, meaning there is virtually no data available for model training.
- Enabling a large number of translation directions. Since the dataset has the same set of sentences translated into all languages, it can be used to evaluate the effectiveness of any of the 10,100 different translation directions.
- Diversity of the context of the proposals. Today, many datasets contain the same type of text, such as news. FLORES-101 contains texts from various fields, including news, travel guides, and books.
- Context-sensitive translation. The dataset is designed to translate multiple sentences with context in mind, meaning it allows you to evaluate whether the context at the document level improves the quality of the model’s translation.
- Extensive metadata. In FLORES-101, each translation is assigned data such as links to the source, images, and the subject of the text.