Visual Genome: labeled images dataset

25 November 2021

Visual Genome: labeled images dataset

Visual Genome is a dataset with more than 100,000 images and descriptions of all objects on them. The dataset is intended for use in object search and recognition tasks. Visual…

Commonsense-Dialogues: Amazon dataset of everyday dialogues

12 November 2021

Commonsense-Dialogues: Amazon dataset of everyday dialogues

Commonsense-Dialogues is an Amazon dataset containing 11,000 dialogues from everyday life. The dataset is aimed at teaching models to understand the hidden meanings of replicas. To date, AI assistants do…

Go Emotions: Google AI dataset for sentiment analysis

31 October 2021

Go Emotions: Google AI dataset for sentiment analysis

The Google AI GoEmotions dataset consists of comments from Reddit users with labels of their emotional coloring. GoEmotions is designed to train neural networks to perform deep analysis of the…

ORBIT: Microsoft dataset of images of household items

20 October 2021

ORBIT: Microsoft dataset of images of household items

ORBIT is a Microsoft dataset for training models to recognize objects from multiple images. ORBIT includes from 1 to 10 videos of 468 objects from everyday life. Usually object recognition…

Fake It Till You Make It: Microsoft synthetic face images dataset

9 October 2021

Fake It Till You Make It: Microsoft synthetic face images dataset

Microsoft has introduced a dataset of synthetic facial images Fake It Till You Make It. The dataset is aimed at pre-training facial recognition algorithms before being used in real-world scenarios.…

OpenRooms: manipulating objects in 3D scenes

15 September 2021

OpenRooms: manipulating objects in 3D scenes

OpenRooms is an open-source dataset and a set of tools for managing objects, materials, lighting and other parameters of 3D scenes of indoor interiors. The dataset is intended for use…

CO3D: dataset with three-dimensional object reconstructions

5 September 2021

CO3D: dataset with three-dimensional object reconstructions

FAIR presented CO3D, a dataset containing accurate three-dimensional reconstructions of 19,000 real objects. The dataset is intended for use in augmented reality tasks and in game development. Common Objects in…

7 sites with publicly available datasets

2 September 2021

7 sites with publicly available datasets

The article provides an overview of sites containing tens of thousands of datasets in the public domain. The datasets presented on these resources cover such areas as healthcare, geography, sociology,…

RADIATE: road traffic dataset in bad weather

14 August 2021

RADIATE: road traffic dataset in bad weather

RADIATE contains data on the movement of 200,000 cars and pedestrians registered using radars, cameras, lidars and GPS in adverse weather conditions. The dataset is aimed at improving the models…

Hypersim: Apple’s synthetic dataset with interior images

5 August 2021

Hypersim: Apple’s synthetic dataset with interior images

Apple has introduced Hypersim, a synthetic dataset of photorealistic images of rooms and interiors. Hypersim consists of 77,400 images of 461 scenes and provides semantic segmentation. The main limitation of…

OpenBuildings: Google AI dataset with building annotations

30 July 2021

OpenBuildings: Google AI dataset with building annotations

Google AI has introduced an open-source dataset Open Buildings, containing information about the location and area of 500 million buildings in Africa. Open Buildings will allow solving practical, scientific and…

Habitat 2.0: FAIR platform for robot training

6 July 2021

Habitat 2.0: FAIR platform for robot training

FAIR has introduced Habitat 2.0, a platform that allows you to train robots to navigate in virtual three-dimensional spaces and interact with objects in the same way as they would…

FLORES-101: FAIR dataset with translations of texts into rare languages

16 June 2021

FLORES-101: FAIR dataset with translations of texts into rare languages

FLORES-101 is a FAIR dataset for evaluating and testing multilingual translation models. The dataset contains 3,000 sentences from Wikipedia, translated into 101 languages by professional translators, and allows you to…

FAIR1M: High-resolution satellite image dataset

9 June 2021

FAIR1M: High-resolution satellite image dataset

The FAIR1M dataset, developed at the Chinese Academy of Sciences, contains more than 15,000 satellite images with 1,000,000 detailed annotations, including specific aircraft models, ship types, and vehicles. The images…

ABCD: dataset for increasing the quality of customer service

2 June 2021

ABCD: dataset for increasing the quality of customer service

Asapp, a company dedicated to improving communication with customers based on artificial intelligence, has introduced ABCD-a dataset designed for the development of dialog systems. ABCD includes more than 10,000 dialogues…

CodeNet: IBM dataset for neural networks that generate and analyze code

27 May 2021

CodeNet: IBM dataset for neural networks that generate and analyze code

At the Think conference, IBM presented Project CodeNet – the largest open-source dataset for training neural networks in programming. The dataset consists of 14 million code examples written in 55…

MLS: FAIR’s Multilingual Speech Recognition Dataset

4 March 2021

MLS: FAIR’s Multilingual Speech Recognition Dataset

Facebook AI published a multilingual dataset used to train speech recognition models. Multilingual LibriSpeech (MLS) contains 50 thousand hours of audio with people speaking in 8 languages: English, German, Spanish,…

Twitter Opens Tweet Archive for Scientific Researchers

20 February 2021

Twitter Opens Tweet Archive for Scientific Researchers

Twitter has opened an archive of tweets for scientific researchers. This way the IT-company supports research on online discourse and trends on the platform. More data and access to them…

DAF:re – new public dataset for recognizing anime characters

20 February 2021

DAF:re – new public dataset for recognizing anime characters

DAF:re is a public dataset for recognizing anime characters. The dataset consists of 500 thousand images with 3000 object classes. Data across classes is not evenly distributed. Besides, the researchers…

TracIn: a way to evaluate the impact of specific data on model predictions

10 February 2021

TracIn: a way to evaluate the impact of specific data on model predictions

TracIn is a scalable method for assessing the impact of individual features in data on predictions. The idea behind TracIn is to track the learning process of the model to…

Pile: 825-gigabyte open-source dataset for language models training

28 January 2021

Pile: 825-gigabyte open-source dataset for language models training

Pile is an 825 gigabyte dataset for teaching language models. The dataset consists of 22 smaller datasets, which are combined into one. In addition to the dataset, the creators published…