Google VRDU: Advancing Document Content Understanding with Dataset and Benchmark

27 August 2023
google vrdu 2

Google VRDU: Advancing Document Content Understanding with Dataset and Benchmark

Google has publicly released VRDU, a dataset and benchmark designed for training models in understanding document content. VRDU aims to accelerate the development of models capable of processing complex documents…

Prithvi: NASA Model and Dataset for Ecological Phenomena Analysis

6 August 2023
prithvi

Prithvi: NASA Model and Dataset for Ecological Phenomena Analysis

NASA and IBM have unveiled the open-source model Prithvi, designed to empower scientists in tracking the consequences of climate change, monitoring deforestation, predicting agricultural crop yields, and analyzing greenhouse gas…

Stability AI Publishes PickScore Dataset and Evaluation Function for Generative Models

6 June 2023
pickscore оценка изображений

Stability AI Publishes PickScore Dataset and Evaluation Function for Generative Models

Stability AI, in collaboration with Tel Aviv University, has released the PickScore dataset, a collection of over 500,000 images accompanied by user ratings. They have also introduced the PickScore evaluation…

LAION-5B: the largest dataset of image-text pairs

28 May 2022
LAION-5B

LAION-5B: the largest dataset of image-text pairs

LAION-5B — dataset of image-text pairs collected on the Internet. LAION-5B contains more than 5 billion pairs, which makes it the largest among similar datasets. AION-5B was assembled by parsing…

MASSIVE: Amazon dataset for multilingual model training

29 April 2022

MASSIVE: Amazon dataset for multilingual model training

Amazon has introduced the MASSIVE open-source dataset with translations of texts into 51 languages. The dataset is aimed at creating natural language processing models that can be easily generalized to…

SORDI: dataset of synthetic images of production resource

20 April 2022

SORDI: dataset of synthetic images of production resource

BMW Group presented SORDI, the largest open-source dataset of marked-up photorealistic images of factories and other industries. SORDI contains more than 800,000 images in 80 categories and is aimed at…

Datasets for music generation and analysis

27 February 2022

Datasets for music generation and analysis

The article provides an overview of music datasets. Datasets are designed to train models of music generation, recognition, and analysis. NSynth The largest dataset consists of 305,979 musical notes, including…

Intel announced the largest datasets for speech recognition

9 December 2021

Intel announced the largest datasets for speech recognition

Intel introduced datasets People’s Speech and MSWC, aimed at recognizing and transcribing spoken speech. Both datasets are among the largest in their class and include audio recordings in 59 languages. The…

Visual Genome: labeled images dataset

25 November 2021

Visual Genome: labeled images dataset

Visual Genome is a dataset with more than 100,000 images and descriptions of all objects on them. The dataset is intended for use in object search and recognition tasks. Visual…

Commonsense-Dialogues: Amazon dataset of everyday dialogues

12 November 2021
датасет диалогов Amazon Mechanical Turk

Commonsense-Dialogues: Amazon dataset of everyday dialogues

Commonsense-Dialogues is an Amazon dataset containing 11,000 dialogues from everyday life. The dataset is aimed at teaching models to understand the hidden meanings of replicas. To date, AI assistants do…

GoEmotions: Google AI dataset for sentiment analysis

31 October 2021

GoEmotions: Google AI dataset for sentiment analysis

The GoEmotions dataset consists of comments from Reddit users with labels of their emotional coloring. GoEmotions is designed to train neural networks to perform deep analysis of the tonality of…

ORBIT: Microsoft dataset of images of household items

20 October 2021

ORBIT: Microsoft dataset of images of household items

ORBIT is a Microsoft dataset for training models to recognize objects from multiple images. ORBIT includes from 1 to 10 videos of 468 objects from everyday life. Usually object recognition…

Fake It Till You Make It: Microsoft synthetic face images dataset

9 October 2021

Fake It Till You Make It: Microsoft synthetic face images dataset

Microsoft has introduced a dataset of synthetic facial images Fake It Till You Make It. The dataset is aimed at pre-training facial recognition algorithms before being used in real-world scenarios.…

OpenRooms: manipulating objects in 3D scenes

15 September 2021

OpenRooms: manipulating objects in 3D scenes

OpenRooms is an open-source dataset and a set of tools for managing objects, materials, lighting and other parameters of 3D scenes of indoor interiors. The dataset is intended for use…

7 Websites with Publicly Available Datasets

2 September 2021
Public datasets open-access

7 Websites with Publicly Available Datasets

The article provides an overview of sites containing tens of thousands of datasets in the public domain. The datasets presented on these resources cover such areas as healthcare, geography, sociology,…

RADIATE: road traffic dataset in bad weather

14 August 2021

RADIATE: road traffic dataset in bad weather

RADIATE contains data on the movement of 200,000 cars and pedestrians registered using radars, cameras, lidars and GPS in adverse weather conditions. The dataset is aimed at improving the models…

Apple’s Hypersim synthetic dataset with interior images

5 August 2021

Apple’s Hypersim synthetic dataset with interior images

Apple has introduced Hypersim, a synthetic dataset of photorealistic images of rooms and interiors. Hypersim consists of 77,400 images of 461 scenes and provides semantic segmentation. The main limitation of…

OpenBuildings: Google AI dataset with building annotations

30 July 2021

OpenBuildings: Google AI dataset with building annotations

Google AI has introduced an open-source dataset Open Buildings, containing information about the location and area of 500 million buildings in Africa. Open Buildings will allow solving practical, scientific and…

ABCD: dataset for increasing the quality of customer service

2 June 2021

ABCD: dataset for increasing the quality of customer service

Asapp, a company dedicated to improving communication with customers based on artificial intelligence, has introduced ABCD-a dataset designed for the development of dialog systems. ABCD includes more than 10,000 dialogues…

CodeNet: IBM dataset that generates and analyzes code

27 May 2021

CodeNet: IBM dataset that generates and analyzes code

At the Think conference, IBM presented Project CodeNet – the largest open-source dataset for training neural networks in programming. The dataset consists of 14 million code examples written in 55…

MLS: FAIR’s Multilingual Speech Recognition Dataset

4 March 2021

MLS: FAIR’s Multilingual Speech Recognition Dataset

Facebook AI published a multilingual dataset used to train speech recognition models. Multilingual LibriSpeech (MLS) contains 50 thousand hours of audio with people speaking in 8 languages: English, German, Spanish,…