UVO is a FAIR benchmark for models that detect, segment, and track objects on video. UVO contains more than 10,000 videos with annotations of all objects in each frame and is aimed at studying the possibilities of segmentation models of previously unknown objects.
In applications such as embodied artificial intelligence or augmented reality, there is a wide class of objects that are usually not used when training models and therefore models cannot perform their segmentation. At the same time, people can detect unfamiliar objects, such as new musical instruments or unknown sports equipment, even without having any prior knowledge about them. To investigate whether models can also segment objects without prior knowledge about them, a UVO benchmark was developed, in which models need to detect and segment any object that appears in the frame, regardless of whether it was included in the training dataset or not.
UVO contains videos from Kinetics, a popular benchmark for recognizing actions. Videos contain an average of 13.5 unique objects in each frame. FAIR used crowdsourcing to add annotations to the video of all visible objects in each frame, including fast-moving, overlapping and blurred objects. A feature of the benchmark is the heterogeneity of the classes of objects that appear in the video. In particular, 57% of them do not belong to any category of objects of the COCO dataset.
The benchmark is available here.