LAION-5B: the largest dataset of image-text pairs

LAION-5B — dataset of image-text pairs collected on the Internet. LAION-5B contains more than 5 billion pairs, which makes it the largest among similar datasets.

AION-5B was assembled by parsing the Common Crawl dataset to search for images with a description. The images were uploaded and filtered using CLIP to leave only those images whose content matches their textual description.

In total, the dataset contains 2.32 billion images with text in English, 2.26 billion with text in other languages and 1.27 billion, the language of the text of which could not be determined unambiguously. Image labels also includes several nearest neighbor indexes.

A web demonstration of semantic search and playback of a clip trained on the basis of data has been developed for the dataset.

The purpose of the dataset development is to democratize multimodal research in the field of artificial intelligence. Anological large-scale datasets, in particular, the OpenAI dataset with 400 million pairs, are not publicly available.