LAION-5B: the largest dataset of image-text pairs

28 May 2022

LAION-5B: the largest dataset of image-text pairs

LAION-5B — dataset of image-text pairs collected on the Internet. LAION-5B contains more than 5 billion pairs, which makes it the largest among similar datasets. AION-5B was assembled by parsing…

MASSIVE: Amazon dataset for multilingual model training

29 April 2022

MASSIVE: Amazon dataset for multilingual model training

Amazon has introduced the MASSIVE open-source dataset with translations of texts into 51 languages. The dataset is aimed at creating natural language processing models that can be easily generalized to…

SORDI: dataset of synthetic images of production resource

20 April 2022

SORDI: dataset of synthetic images of production resource

BMW Group presented SORDI, the largest open-source dataset of marked-up photorealistic images of factories and other industries. SORDI contains more than 800,000 images in 80 categories and is aimed at…

Datasets for music generation and analysis

27 February 2022

Datasets for music generation and analysis

The article provides an overview of datasets with musical compositions. Datasets are designed to train models of music generation, recognition and analysis. NSynth The largest dataset consisting of 305,979 musical…

Intel announced the largest datasets for speech recognition

9 December 2021

Intel announced the largest datasets for speech recognition

Intel introduced datasetsΒ People’s Speech and MSWC, aimed at recognizing and transcribing spoken speech. Both datasets are among the largest in their class and include audio recordings in 59 languages. The…

Visual Genome: labeled images dataset

25 November 2021

Visual Genome: labeled images dataset

Visual Genome is a dataset with more than 100,000 images and descriptions of all objects on them. The dataset is intended for use in object search and recognition tasks. Visual…

Commonsense-Dialogues: Amazon dataset of everyday dialogues

12 November 2021

Commonsense-Dialogues: Amazon dataset of everyday dialogues

Commonsense-Dialogues is an Amazon dataset containing 11,000 dialogues from everyday life. The dataset is aimed at teaching models to understand the hidden meanings of replicas. To date, AI assistants do…

Go Emotions: Google AI dataset for sentiment analysis

31 October 2021

Go Emotions: Google AI dataset for sentiment analysis

The Google AI GoEmotions dataset consists of comments from Reddit users with labels of their emotional coloring. GoEmotions is designed to train neural networks to perform deep analysis of the…

ORBIT: Microsoft dataset of images of household items

20 October 2021

ORBIT: Microsoft dataset of images of household items

ORBIT is a Microsoft dataset for training models to recognize objects from multiple images. ORBIT includes from 1 to 10 videos of 468 objects from everyday life. Usually object recognition…

Fake It Till You Make It: Microsoft synthetic face images dataset

9 October 2021

Fake It Till You Make It: Microsoft synthetic face images dataset

Microsoft has introduced a dataset of synthetic facial images Fake It Till You Make It. The dataset is aimed at pre-training facial recognition algorithms before being used in real-world scenarios.…

OpenRooms: manipulating objects in 3D scenes

15 September 2021

OpenRooms: manipulating objects in 3D scenes

OpenRooms is an open-source dataset and a set of tools for managing objects, materials, lighting and other parameters of 3D scenes of indoor interiors. The dataset is intended for use…

7 sites with publicly available datasets

2 September 2021

7 sites with publicly available datasets

The article provides an overview of sites containing tens of thousands of datasets in the public domain. The datasets presented on these resources cover such areas as healthcare, geography, sociology,…

RADIATE: road traffic dataset in bad weather

14 August 2021

RADIATE: road traffic dataset in bad weather

RADIATE contains data on the movement of 200,000 cars and pedestrians registered using radars, cameras, lidars and GPS in adverse weather conditions. The dataset is aimed at improving the models…

Hypersim: Apple’s synthetic dataset with interior images

5 August 2021

Hypersim: Apple’s synthetic dataset with interior images

Apple has introduced Hypersim, a synthetic dataset of photorealistic images of rooms and interiors. Hypersim consists of 77,400 images of 461 scenes and provides semantic segmentation. The main limitation of…

OpenBuildings: Google AI dataset with building annotations

30 July 2021

OpenBuildings: Google AI dataset with building annotations

Google AI has introduced an open-source dataset Open Buildings, containing information about the location and area of 500 million buildings in Africa. Open Buildings will allow solving practical, scientific and…

ABCD: dataset for increasing the quality of customer service

2 June 2021

ABCD: dataset for increasing the quality of customer service

Asapp, a company dedicated to improving communication with customers based on artificial intelligence, has introduced ABCD-a dataset designed for the development of dialog systems. ABCD includes more than 10,000 dialogues…

CodeNet: IBM dataset for neural networks that generate and analyze code

27 May 2021

CodeNet: IBM dataset for neural networks that generate and analyze code

At the Think conference, IBM presented Project CodeNet – the largest open-source dataset for training neural networks in programming. The dataset consists of 14 million code examples written in 55…

MLS: FAIR’s Multilingual Speech Recognition Dataset

4 March 2021

MLS: FAIR’s Multilingual Speech Recognition Dataset

Facebook AI published a multilingual dataset used to train speech recognition models. Multilingual LibriSpeech (MLS) contains 50 thousand hours of audio with people speaking in 8 languages: English, German, Spanish,…

Twitter Opens Tweet Archive for Scientific Researchers

20 February 2021

Twitter Opens Tweet Archive for Scientific Researchers

Twitter has opened an archive of tweets for scientific researchers. This way the IT-company supports research on online discourse and trends on the platform. More data and access to them…

DAF:re – new public dataset for recognizing anime characters

20 February 2021

DAF:re – new public dataset for recognizing anime characters

DAF:re is a public dataset for recognizing anime characters. The dataset consists of 500 thousand images with 3000 object classes. Data across classes is not evenly distributed. Besides, the researchers…

TracIn: a way to evaluate the impact of specific data on model predictions

10 February 2021

TracIn: a way to evaluate the impact of specific data on model predictions

TracIn is a scalable method for assessing the impact of individual features in data on predictions. The idea behind TracIn is to track the learning process of the model to…