FAIR presented Ego4D — a dataset with videos shot in the first person. The dataset is aimed at teaching computer vision systems to perceive actions in a similar way as a person does.
Modern computer vision systems, as a rule, are trained on images and videos taken from a third-person perspective, where the camera acts as an observer. Ego4D is aimed at solving problems in the field of first-person perception. Scientists from 13 universities in nine countries worked on the collection of the dataset, who recorded more than 2,200 hours of first-person video.
More than 700 participants in each of the partner universities distributed cameras that were mounted on their heads and recorded videos with everyday scenarios, such as grocery shopping, cooking and chatting with friends. These videos reflect what a person prefers to look at in a certain environment, what he does with his hands with objects in front of him and how he interacts with other people. In terms of its volume, Ego4D is 20 times larger than any other dataset in the field of first-person perception in terms of hours of footage.
The videos were extensively annotated. In particular, dense text descriptions describing the actions of the camera user, spatial and temporal labels of objects and actions, as well as transcriptions of dialogues were prepared for them. In addition to the data, FAIR has developed a set of benchmarks that allow us to study the presence of episodic memory, the ability to recognize and predict actions, as well as the analysis of speech and social interactions in models trained using Ego4D.
Access to the dataset will be open to everyone in November 2021, subject to the signing of an agreement on the use of data.