WebDataset: A New Python Library For Dealing with Large Datasets

WebDataset is a novel Python library that allows efficient data access and overcomes most challenges introduced by working with very large datasets.

The library was developed in order to overcome some apparent problems when dealing with large datasets such as: dataset size, data ranges, scalability, etc, and the main idea behind WebDataset was to implement a PyTorch counter-part for Tensorflow’s TFRecord/tf.Example class.

The engineers behind WebDataset, went a step further than the implementation of tf.Example, in the sense of making the library work with a standard data format – the POSIX tar object. As a widely accepted format in the Python community, this solution seems like the most feasible one, considering also the fact that no data conversion will be required when working with WebDataset.

The library provides significant speedups in all the cases: from working on a local single desktop machine to running large deep learning experiments on clusters of GPUs. It achieves that by converting datasets into POSIX tar archive files, and each training sample obtained from WebDataset consists of adjacent original files with the same basename.

The library is available as a standalone Python library and will soon be incorporated into Pytorch. The implementation of the library was open-sourced and it is available on Github. More details on how to use the library, code examples as well as details about the high-performance speed-ups can be found in the blog post and in the official repository.