Researchers Expanded the Popular MNIST Dataset With 50 000 New Images

Researchers from New York University and Facebook AI Research have restored and expanded the wide known MNIST dataset. Their idea was to recover the original MNIST (who is assumed to be forever lost) by reconstructing the missing part of the MNIST data.

MNIST is one of the most popular and most used datasets for building and testing image processing systems. A lot of research work in the past decades has developed methods using MNIST and the dataset itself has become a baseline for many image processing problems.

Arguing that the official MNIST dataset with only 10 000 images is too small to provide meaningful confidence intervals, they tried to recreate the MNIST preprocessing algorithms.

Through an iterative process, researchers tried to generate an additional 50 000 images of MNIST-like data. They started with a reconstruction process given in the paper introducing MNIST and used the Hungarian algorithm to find the best matches between the original MNIST samples and their reconstructed samples.

After many iterations of improvements in the reconstruction algorithm trying to extract the best matches between the generated and the original samples, researchers improved the samples and generated a dataset of an additional 50 000 digit images.

The new expanded MNIST dataset will allow examining the existing methods and investigating their generalization capabilities since many of them might have been overfitting on the small MNIST official testing set.

The dataset, as well as a detailed explanation of the reconstruction process, can be found on Github. The pre-print paper is available on arxiv.