Datasets for Machine Learning Tasks

VBVR: 2 Million Videos for Reasoning Training — an Open Dataset That Changes the Rules

26 February 2026

VBVR: 2 Million Videos for Reasoning Training — an Open Dataset That Changes the Rules

A team of more than 50 researchers from around the world — from Berkeley, Stanford, CMU, Oxford and other universities — has published Very Big Video Reasoning (VBVR), a massive…

From Millions Spent on “Thank You” to Efficient Inference: Boilerplate Detection in a Single Token

31 October 2025

From Millions Spent on “Thank You” to Efficient Inference: Boilerplate Detection in a Single Token

Researchers from JFrog published a study demonstrating a method for early detection of boilerplate responses in large language models after generating just a single token. The method enables computational cost…

DeepMath-103K: Advancing AI Reasoning Through Challenge

21 April 2025

DeepMath-103K: Advancing AI Reasoning Through Challenge

Mathematical reasoning stands as a crucial benchmark for artificial intelligence systems, requiring logical deduction, symbolic manipulation, and multi-step problem-solving. Recent breakthroughs in AI reasoning have been significantly driven by reinforcement…

Zyda: 1.3T Dataset for Open Language Modeling

12 June 2024

Zyda: 1.3T Dataset for Open Language Modeling

Zyda is a 1.3 trillion-token open-source dataset designed for open language modeling. Zyda integrates a range of high-quality open datasets, including RefinedWeb, Starcoder, C4, Pile, enhancing them through comprehensive filtering…

Google Gecko: Benchmark for Text-to-Image Models

6 May 2024

Google Gecko: Benchmark for Text-to-Image Models

Google DeepMind has developed Gecko, a benchmark that ensures more accurate and reliable testing and comparison of text-to-image models than existing approaches. A study by Google DeepMind has identified hidden…

Gretel: The Largest Open Text-to-SQL Dataset

7 April 2024

Gretel: The Largest Open Text-to-SQL Dataset

Gretel, a startup specializing in generating high-quality synthetic data, has announced the creation of the largest open text-to-SQL dataset aimed at accelerating the development of no-code analytics tools. The dataset…

SCIN: Dataset of Dermatological Disease Images

25 March 2024

SCIN: Dataset of Dermatological Disease Images

Google, in collaboration with Stanford Medicine, has introduced SCIN – an open dataset comprising 10,000 images of dermatological diseases. Models trained on this dataset will be able to remotely diagnose…

Google VRDU: Advancing Document Content Understanding with Dataset and Benchmark

27 August 2023

Google VRDU: Advancing Document Content Understanding with Dataset and Benchmark

Google has publicly released VRDU, a dataset and benchmark designed for training models in understanding document content. VRDU aims to accelerate the development of models capable of processing complex documents…

Prithvi: NASA Model and Dataset for Ecological Phenomena Analysis

6 August 2023

Prithvi: NASA Model and Dataset for Ecological Phenomena Analysis

NASA and IBM have unveiled the open-source model Prithvi, designed to empower scientists in tracking the consequences of climate change, monitoring deforestation, predicting agricultural crop yields, and analyzing greenhouse gas…

Stability AI Publishes PickScore Dataset and Evaluation Function for Generative Models

6 June 2023

Stability AI Publishes PickScore Dataset and Evaluation Function for Generative Models

Stability AI, in collaboration with Tel Aviv University, has released the PickScore dataset, a collection of over 500,000 images accompanied by user ratings. They have also introduced the PickScore evaluation…

LAION-5B: the largest dataset of image-text pairs

28 May 2022

LAION-5B: the largest dataset of image-text pairs

LAION-5B — dataset of image-text pairs collected on the Internet. LAION-5B contains more than 5 billion pairs, which makes it the largest among similar datasets. AION-5B was assembled by parsing…

MASSIVE: Amazon dataset for multilingual model training

29 April 2022

MASSIVE: Amazon dataset for multilingual model training

Amazon has introduced the MASSIVE open-source dataset with translations of texts into 51 languages. The dataset is aimed at creating natural language processing models that can be easily generalized to…

SORDI: dataset of synthetic images of production resource

20 April 2022

SORDI: dataset of synthetic images of production resource

BMW Group presented SORDI, the largest open-source dataset of marked-up photorealistic images of factories and other industries. SORDI contains more than 800,000 images in 80 categories and is aimed at…

Datasets for music generation and analysis

27 February 2022

Datasets for music generation and analysis

The article provides an overview of music datasets. Datasets are designed to train models of music generation, recognition, and analysis. NSynth The largest dataset consists of 305,979 musical notes, including…

Intel announced the largest datasets for speech recognition

9 December 2021

Intel announced the largest datasets for speech recognition

Intel introduced datasets People’s Speech and MSWC, aimed at recognizing and transcribing spoken speech. Both datasets are among the largest in their class and include audio recordings in 59 languages. The…

Visual Genome: labeled images dataset

25 November 2021

Visual Genome: labeled images dataset

Visual Genome is a dataset with more than 100,000 images and descriptions of all objects on them. The dataset is intended for use in object search and recognition tasks. Visual…

Commonsense-Dialogues: Amazon dataset of everyday dialogues

12 November 2021

Commonsense-Dialogues: Amazon dataset of everyday dialogues

Commonsense-Dialogues is an Amazon dataset containing 11,000 dialogues from everyday life. The dataset is aimed at teaching models to understand the hidden meanings of replicas. To date, AI assistants do…

GoEmotions: Google AI dataset for sentiment analysis

31 October 2021

GoEmotions: Google AI dataset for sentiment analysis

The GoEmotions dataset consists of comments from Reddit users with labels of their emotional coloring. GoEmotions is designed to train neural networks to perform deep analysis of the tonality of…

ORBIT: Microsoft dataset of images of household items

20 October 2021

ORBIT: Microsoft dataset of images of household items

ORBIT is a Microsoft dataset for training models to recognize objects from multiple images. ORBIT includes from 1 to 10 videos of 468 objects from everyday life. Usually object recognition…

Fake It Till You Make It: Microsoft synthetic face images dataset

9 October 2021

Fake It Till You Make It: Microsoft synthetic face images dataset

Microsoft has introduced a dataset of synthetic facial images Fake It Till You Make It. The dataset is aimed at pre-training facial recognition algorithms before being used in real-world scenarios.…

OpenRooms: manipulating objects in 3D scenes

15 September 2021

OpenRooms: manipulating objects in 3D scenes

OpenRooms is an open-source dataset and a set of tools for managing objects, materials, lighting and other parameters of 3D scenes of indoor interiors. The dataset is intended for use…