While big corporations like Google and Facebook have built even more massive, labeled datasets (JFT-300M from Google and Facebook’s Instagram dataset), they still keep them as proprietary and internal. The largest available dataset up to now was Google’s Open Images, containing around 9 million labeled images.
Now, Tencent’s ML Images has taken the lead as the biggest annotated image dataset with 17,609,752 training and 88,739 validation image URLs. It was built to serve the research communities as well as SMEs who can only afford to use open-source datasets.
ML Images was built collecting images from existing image datasets, i.e., Open Images and ImageNet. Both class vocabularies of these datasets were merged into one unified vocabulary and organized in a semantic hierarchy using WordNet. Redundant classes were removed, and the dataset was finally left with something more than 11 thousand categories.
To verify the quality of the merged image dataset, Tencent conducted representation learning experiments with a popular deep neural network model – ResNet 101. The researchers showed that ResNet-101 could be efficiently trained with the novel dataset. To achieve this goal, they additionally contributed to a new loss function that takes care of the large class imbalance in the dataset.
This represents an important contribution to the machine learning community, and it is expected to foster the development of new and improved methods in computer vision. Tencent has released the new Tencent ML Images database along with trained Resnet-101 checkpoints, as well as the complete code for data preparation, pre-training, fine-tuning and feature extraction. The Github repository contains the procedure for downloading the dataset, the models and all the code.
More about the creation of the largest multi-label image dataset – Tencent ML Images and the experiments on ResNet-101 can be read in the published paper.