PyTorch-Amazon plugin for direct access to S3 datasets

Amazon has released an open-source plugin for PyTorch, designed to gain access to datasets stored in Amazon Simple Storage Service (S3). The plugin allows you to use datasets of any size in streaming mode, eliminating the need to use local storage.

Thanks to this plugin, users can directly access data using the PyTorch API, such as dataset and dataloader. Since the plugin is implemented in the internal interfaces of PyTorch, its use does not require changes to the existing code.

The plugin works with files of any format and represents S3 objects as blob data. Users can perform additional transformations of the data received from S3, and expand its functionality to load and process data as needed. In addition, the plugin allows for random mixing of data to reduce the variance.

The plugin has high performance and significantly reduces the time of embedding data from S3 into deep learning models.