New Datasets for Object Tracking

8 November 2018

New Datasets for Object Tracking

Object tracking in the wild is far from being solved. Existing object trackers do quite a good job on the established datasets (e.g., VOT, OTB), but these datasets are relatively…

Object tracking in the wild is far from being solved. Existing object trackers do quite a good job on the established datasets (e.g., VOT, OTB), but these datasets are relatively small and do not fully represent the challenges of real-life tracking tasks. Deep learning is at the core of the most state-of-the-art trackers today. However, a dedicated large-scale dataset to train deep trackers is still lacking.

In this article, we discuss three recently introduced datasets for object tracking. They differ in scale, annotations and other characteristics but all of them can contribute something to solving object tracking problem: TrackingNet is a first large-scale dataset for object tracking in the wild, MOT17 is a benchmark for multiple object tracking, and Need for Speed is the first higher frame rate video dataset.

TrackingNet

Number of videos: 30,132 (train) + 511 (test)

Number of annotations: 14,205,677 (train) + 225,589 (test)

Year: 2018

Examples from TrackingNet test set

TrackingNet is a first large-scale dataset for object tracking in the wild. It includes over 30K videos with an average duration of 16.6s and more than 14M dense bounding box annotations. The dataset is not limiting to a specific context but instead covers a wide selection of object classes in a broad and diverse context. TrackingNet has a number of notable advantages:

  • a large scale of this dataset enables the development of deep design specific for tracking;
  • by being specifically created for object tracking, the dataset enables model architectures to focus on the temporal context between the consecutive frames;
  • the dataset was sampled from YouTube videos and thus, represents real-world scenarios and contains a large variety of frame rates, resolutions, context and object classes.

TrackingNet training set was derived from the YouTube-Bounding Boxes (YT-BB), a large-scale dataset for object detection with roughly 300K video segments, annotated every second with upright bounding boxes. To build TrackingNet, the researchers filtered out 90% of the videos by selecting the videos that a) are longer than 15 seconds; b) include bounding boxes that cover less than 50% of the frame; c) contain a reasonable amount of motion between bounding boxes.

To increase annotation density from 1fps provided by YT-BB, the creators of TrackingNet rely on a mixture of state-of-the-art trackers. They claim that any tracker is reliable on a short interval of 1 second. So, they have densely annotated 30,132 videos using a weighted average between a forward and a backward pass using the DCF tracker. Furthermore, the code for automatically downloading videos from YouTube and extracting the annotated frames is also available.

Comparison of tracking datasets across the number of videos, the average length of the videos, and the number of annotated bounding boxes (reflected with the circle’s size)

Finally, TrackingNet dataset comes with a new benchmark composed of 511 novel videos from YouTube with Creative Commons license, namely YT-CC. These videos have the same object class distribution as the training set and are annotated with the help of Amazon Mechanical Turk workers. With the tight supervision in the loop, TrackingNet team has ensured the quality of the annotations after a few iterations, discouraged bad annotators and incentivized the good ones.

Thus, by sequestering the annotation of the test set and maintaining an online evaluation server, the researchers behind TrackingNet provide a fair benchmark for the development of object trackers.

MOT17

Number of videos: 21 (train) + 21 (test)

Number of annotations: 564,228

Year: 2017

Examples from the MOT17 dataset

MOT17 (Multiple Object Tracking) is an extended version of the MOT16 dataset with new and more accurate ground truth. As evident from its name, the specific focus of this dataset is on multi-target tracking. It should be also noted that the context of MOTChallenge datasets, including this last MOT17 dataset, is limited to the street scenes.

The new MOT17 benchmark includes a set of 42 sequences with crowded scenarios, camera motions, and weather conditions. The annotations for all sequences have been carried out by qualified researchers from scratch following a strict protocol. Even more, to ensure the highest annotations accuracy, all the annotations were double-checked. Another thing that distinguishes this dataset from the earlier versions of MOTChallenge datasets is that here not only pedestrians are annotated, but also vehicles, sitting people, occluding objects, as well as other significant object classes.

An overview of annotated classes and example of an annotated frame

The researchers have defined some classes as the target ones – they are depicted with orange in the above image; these classes are the central ones to evaluate on. The red classes include ambiguous cases such that both recovering and missing them will not be penalized in the evaluation. Finally, the classes in green are annotated for training purposes and for computing the occlusion level of all pedestrians.

An exemplar of an annotated frame demonstrates how partially cropped objects are also marked outside of the frame. Also, note that the bounding box encloses the entire person but not the white bag of the pedestrian.

Rich ground truth information provided within the MOT17 dataset can be very useful for developing more accurate tracking methods and advancing the field further.

NfS

Number of videos: 100

Number of annotations: 383,000

Year: 2017

The effect of tracking higher frame rate videos

NfS (Need for Speed) is the first higher frame rate video dataset and benchmark for visual object tracking. It includes 100 videos comprised out of 380K frames and captured with 240 FPS cameras, which are now often used in real-world scenarios.

Particularly, 75 videos were captured using the iPhone 6 (and above) and the iPad Pro, while 25 videos were taken from YouTube. The tracking targets include vehicles, humans, faces, animals, aircraft, boats and generic objects such as sport balls, cups, bags etc.

All frames in NfS dataset are annotated with axis-aligned bounding boxes using the VATIC toolbox. Moreover, all videos are manually labeled with nine visual attributes: occlusion, illumination variation, scale variation, object deformation, fast motion, viewpoint change, out of view, background clutter, and low resolution.

Comparing lower frame rate (green boxes) to higher frame rate (red boxes) tracking. Ground truth is shown by blue boxes

NfS benchmark provides a great opportunity to evaluate state-of-the-art trackers on higher frame rate sequences. Actually, some surprising results were already revealed thanks to this dataset: apparently, at higher frame rates, simple trackers such as correlation filters outperform complex deep learning algorithms.

Bottom Line

The scarcity of the dedicated large-scale tracking datasets leads to the situation when object trackers based on the deep learning algorithms are forced to rely on the object detection datasets instead of the dedicated object tracking ones. Of course, this limits advances in object tracking field. Fortunately, the object tracking datasets introduced recently, especially the large-scale TrackingNet dataset, provide data-hungry trackers with the great opportunities for significant performance upgrades.

New Datasets for 3D Human Pose Estimation

8 November 2018

New Datasets for 3D Human Pose Estimation

Human pose estimation is а fundamental problem in computer vision. Computer’s ability to recognize and understand humans in images and videos is crucial for multiple tasks including autonomous driving, action…

Human pose estimation is а fundamental problem in computer vision. Computer’s ability to recognize and understand humans in images and videos is crucial for multiple tasks including autonomous driving, action recognition, human-computer interaction, augmented reality and robotics vision.

In recent years, significant progress has been achieved in 2D human pose estimation. The crucial factor behind this success is the availability of large-scale annotated human pose datasets that allow training networks for 2D human pose estimation. At the same time, advances in 3D human pose estimation remain limited because obtaining ground-truth information on the dense correspondence, depth, motion, body-part segmentation, occlusions is a very challenging task.

In this article, we present several recently created datasets that attempt to address the shortage of annotated datasets for 3D human pose estimation.

DensePose

Number of images: 50K

Number of annotated correspondences: 5M

Year: 2018

DensePose is a large-scale ground-truth dataset with image-to-surface correspondences manually annotated on 50K COCO images. To build this dataset Facebook AI Research team involved human annotators, who were establishing dense correspondences from 2D images to surface-based representations of the human body using a specifically developed annotation pipeline.

As shown below, in the first stage annotators define regions corresponding to visible, semantically defined body parts. In the second stage, every part region is sampled with a set of roughly equidistant points and annotators are requested to bring these points in correspondence with the surface. The researchers wanted to avoid manual rotation of the surface and for this purpose, they provide annotators with six pre-rendered views of the same body part and allow users to place landmarks on any of them.

Annotation pipeline

Below are visualizations of annotations on images from the validation set: Image (left), U (middle) and V (right) values for the collected points.

Visualizing of annotations

DensePose is the first manually-collected ground truth dataset for the task of dense human pose estimation.

SURREAL

Number of frames: 6.5M

Number of subjects: 145

Year: 2017

Generating photorealistic synthetic images

SURREAL (Synthetic hUmans foR REAL tasks) is a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. It includes over 6 million frames accompanied with the ground-truth pose, depth maps, and segmentation masks.

As described in the original research paper, images in SURREAL are rendered from 3D sequences of MoCap data. The realism of synthetic data is usually limited. So, to ensure the realism of human bodies in this dataset, the researchers decided to create synthetic bodies using SMPL body model, whose parameters are fit by the MoSh method given raw 3D MoCap marker data. Moreover, creators of SURREAL dataset ensured a large variety of viewpoints, clothing, and lighting.

A pipeline for generating synthetic human is demonstrated below:

  • a 3D human body model is posed using motion capture data;
  • a frame is rendered using a background image, a texture map on the body, lighting and a camera position;
  • all the “ingredients” are randomly sampled to increase the diversity of the data;
  • generated RGB images are accompanied with 2D/3D poses, surface normal, optical flow, depth images, and body-part segmentation maps.
Pipeline for generating synthetic data

The resulting dataset contains 145 subjects, more than 67.5K clips and over 6.5M frames:

Even though SURREAL contains synthetic images, the researchers behind this dataset demonstrate that CNNs trained on SURREAL allow for accurate human depth estimation and human part segmentation in real RGB images. Hence, this dataset provides new possibilities for advancing 3D human pose estimation using cheap and large-scale synthetic data.

UP-3D

Number of subjects: 5,569

Number of images: 5,569 training images and 1208 test images

Year: 2017

Bottom: Validated 3D body model fits on various datasets constitute the initial UP-3D dataset. Top: improved 3D fits can extend the initial dataset

UP-3D is a dataset, which “Unites the People” of different datasets for multiple tasks. In particular, using the recently introduced SMPLify method the researchers obtain high-quality 3D body model fits for several human pose datasets. Human annotators only sort good and bad fits.

This dataset combines two LSP datasets (11,000 training images and 1,000 test images) and the single-person part of the MPII-HumanPose dataset (13,030 training images and 2622 test images). While it was possible to use an automatic segmentation method to provide foreground silhouettes, the researchers decided to involve human annotators for reliability. They have built an interactive annotation tool on top of the Opensurfaces package to work with Amazon Mechanical Turk (AMT) and have been using the interactive Grabcut algorithm to obtain image consistent silhouette borders.

So, the annotators were asked to evaluate fit for:

  • foreground silhouettes;
  • six body part segmentation.

While the average foreground labeling task was solved in 108s on the LSP and 168s on the MPII datasets respectively, annotating the segmentation for six body parts took on average more than twice as long as annotating foreground segmentation: 236s.

The annotators were sorting good and bad fits and here are the percentages of accepted fits per dataset:

Thus, the validated fits formed the initial UP-3D dataset with 5,569 training images and 1,208 test images. After the experiments on semantic body part segmentation, pose estimation and 3D fitting, the improved 3D fits can extend the initial dataset.

Results from various methods trained on labels generated from the UP-3D dataset

The presented dataset allows for a holistic view on human-related prediction tasks. It sets a new mark in terms of levels of detail by including high-fidelity semantic body part segmentation in 31 parts and 91 landmark human pose estimation. It was also demonstrated that training the pose estimator on the full 91 keypoint dataset helps to improve the state-of-the-art for 3D human pose estimation on the two popular benchmark datasets HumanEva and Human3.6M.

Bottom Line

As you can see, there are many possible approaches to building a dataset for 3D human pose estimation. The datasets presented here focus on different aspects of recognizing and understanding humans in images. However, all of them can be handy for estimating human poses in some of the real-life applications.

New Datasets for 3D Object Recognition

6 November 2018

New Datasets for 3D Object Recognition

Robotics, augmented reality, autonomous driving – all these scenarios rely on recognizing 3D properties of objects from 2D images. This puts 3D object recognition as one of the central problems…

Robotics, augmented reality, autonomous driving – all these scenarios rely on recognizing 3D properties of objects from 2D images. This puts 3D object recognition as one of the central problems in computer vision.

Remarkable progress has been achieved in this field after the introduction of several databases that provide 3D annotations to 2D objects (e.g., IKEA, Pascal3D+). However, these datasets are limited in scale and include only about a dozen object categories.

This is not even close to the large-scale image datasets such as ImageNet or Microsoft COCO, while these are  huge datasets that stay behind the significant progress in image classification task in recent years. Consequently, large-scale datasets with 3D annotations are likely to significantly benefit 3D object recognition.

In this article, we present one large-scale dataset, ObjectNet3D, and also several specialized datasets for 3D object recognition: MVTec ITODD and T-LESS – for industry settings and Falling Things dataset – for object recognition tasks in the context of robotics.

ObjectNet3D

Number of images: 90,127

Number of objects: 201,888

Number of categories: 100

Number of 3D shapes: 44,147

Year: 2016

An example image from ObjectNet3D with 2D objects aligned with 3D shapes

ObjectNet3D is a large-scale database, where objects in the images are aligned with the 3D shapes, and the alignment provides both accurate 3D pose annotation and the closest 3D shape annotation for each 2D object. The scale of this dataset allows for significant progress with such computer vision tasks as recognizing 3D pose and 3D shape of objects from 2D images.

Examples of 3D shape retrieval. Green boxes indicate the selected shape. Bottom row illustrates two cases where a similar shape was not found among the top 5 shapes

To construct this database, researchers from Stanford University resort to images from existing image repositories and propose an approach to align 3D shapes (available from existing 3D shapes repositories) to the objects in these images.

In their work, the researchers consider only rigid object categories, for which they can collect a large number of 3D shapes from the web. Here is the full list of categories:

Object categories in ObjectNet3D

 

2D images were collected from the ImageNet dataset and additionally, through Google Image Search for categories that are not sufficiently covered by the ImageNet dataset. 3D shapes were acquired from Trimble 3D Warehouse and ShapeNet repository. Then, objects in the image were aligned with the 3D shapes using a camera model, which is described in detail in the corresponding paper. Finally, 3D annotations were provided to objects in 2D images.

The resulting dataset can be used for object proposal generation, 2D object detection, joint 2D detection and 3D object pose estimation, image-based 3D shape retrieval.

MVTec ITODD

Number of scenes: 800

Number of objects: 28

Number of 3D transformations: 3500

Year: 2017

Example scene of the dataset from all sensors. Top row: grayscale cameras. Bottom row: Z and grayscale image of the High-Quality (left) and Low-Quality (right) 3D sensor

MVTec ITODD is a dataset for 3D object detection and pose estimation with a strong focus on industrial settings and applications. It contains 28 objects arranged in over 800 scenes and labeled with their rigid 3D transformation as ground truth. The scenes are observed by two industrial 3D sensors and three grayscale cameras, allowing to evaluate methods that work on 3D, image, or combined modalities. The dataset’s creators from MVTec Software GmbH have chosen to use grayscale cameras because they are much more prominent in industrial setups.

As mentioned in the dataset description, the objects were selected such that they cover a range of different values with respect to surface reflectance, symmetry, complexity, flatness, detail, compactness, and size. Here are the images of all objects included to the MVTec ITODD along with their names:

Images of 28 objects used in the dataset

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

For each object, scenes with only a single instance and scenes with multiple instances (e.g., to simulate bin picking) are available. Each scene was acquired once with each of the 3D sensors, and twice with each of the grayscale cameras: once with and once without a random projected pattern.

Finally, for all objects, manually created CAD models are available for training the detection methods. The ground truth was labeled using a semi-manual approach based on the 3D data of the high-quality 3D sensor.

This dataset provides a great benchmark for the detection and pose estimation of 3D objects in the industrial scenarios.

T-LESS

Number of images: 39K training + 10K test images from each of three sensors

Number of objects: 30

Year: 2017

Data Examples of T-LESS test images (left) overlaid with colored 3D object models at the ground truth 6D poses (right). Instances of the same object have the same color

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

T-LESS is a new public dataset for estimating the 6D pose, i.e. translation and rotation, of texture-less rigid objects. This dataset includes 30 industry-relevant objects with no significant texture and no discriminative color or reflectance properties. Another unique property of this dataset is that some of the objects are parts of others.

Researchers behind T-LESS have chosen different approaches to the training images and test images. Thus, training images in this dataset depict individual objects against a black background, while test images originated from twenty scenes with varying degree of complexity. Here are the examples of training and test images:

Top: training images and 3D models of 30 objects. Bottom: test images of 20 scenes overlaid with colored 3D object models at the ground truth poses

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

All the training and test images were captured with three synchronized sensors, including a structured-light and a time-of-flight RGB-D sensor and a high-resolution RGB camera.

Finally, two types of 3D models are provided for each object: 1) manually created CAD model, and 2) a semi-automatically reconstructed one.

This dataset can be very useful for evaluating approaches to 6D object pose estimation, 2D object detection and segmentation, 3D object reconstruction. Considering the availability of images from three sensors, it is also possible to study the importance of different input modalities for a given problem.

Falling Things

Number of images: 61, 500

Number of objects: 21 household objects

Year: 2018

A sample from FAT dataset

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The Falling Things (FAT) dataset is a synthetic dataset for 3D object detection and pose estimation, created by NVIDIA team. It was generated by placing 3D household object models (e.g., mustard bottle, soup can, gelatin box, etc.) in virtual environments.

Each snapshot in this dataset consists of per-pixel class segmentation, 2D/3D bounding box coordinates for all objects, mono and stereo RGB images, dense depth images, and of course, 3D poses. Most of these elements are illustrated in the above image.

Sample images from the FAT dataset

The FAT dataset includes the variety of object poses, backgrounds, composition, and lighting conditions. See some examples below:

For more details on the process of building the FAT dataset, check our article dedicated entirely to this dataset.

The Falling Things dataset provides a great opportunity to accelerate research in object detection and pose estimation, as well as segmentation, depth estimation, and sensor modalities.

Bottom Line

3D object recognition has multiple important applications, but progress in this field is limited by the available datasets. Fortunately, there were several new 3D object recognition datasets introduced in recent years. While they have different scale, focus and characteristics, each of these datasets makes a significant contribution to the improvement of current 3D object recognition systems.

New Datasets for Action Recognition

22 October 2018

New Datasets for Action Recognition

Action recognition is vital for many real-life applications, including video surveillance, healthcare, and human-computer interaction. What do we need to do to classify video clips based on the actions being…

Action recognition is vital for many real-life applications, including video surveillance, healthcare, and human-computer interaction. What do we need to do to classify video clips based on the actions being performed in these videos?

We need to identify different actions from video clips where the action may or may not be performed throughout the entire duration of the video. This looks similar to the image classification problem, but in this case, the task is extended to multiple frames with further aggregation of the predictions from each frame. And we know that after the introduction of the ImageNet dataset, deep learning algorithms are doing a pretty good job in image classification. But do we observe the same progress in video classification or action recognition tasks?

Actually, there is a number of things that turn action recognition into a much more challenging task. This includes huge computational cost, capturing long context, and of course, a need for good datasets.

A good dataset for action recognition problem should have a number of frames comparable to ImageNet and diversity of action classes that will allow for generalization of the trained architecture to many different tasks.

Fortunately, several such datasets were presented during the last year. Let’s have a look.

Kinetics-600

Number of videos: 500,000

Number of action classes: 600

Year: 2018

Samples from the Kinetics-600 dataset
Samples from the Kinetics-600 dataset

We start with the dataset introduced by Google’s DeepMind team. This is a Kinetics dataset – a large-scale, high-quality dataset of YouTube URLs created to advance models for human action recognition. Its last version is called Kinetics-600 and includes around 500,000 video clips that cover 600 human action classes with at least 600 video clips for each action class.

Each clip in Kinetics-600 is taken from a unique YouTube video, lasts around 10 seconds and is labeled with a single class. The clips have been through multiple rounds of human annotation. A single-page web application was built for the labeling task, and you can see the labeling interface below.

Labeling Interface
Labeling Interface

If a worker responded with ‘Yes’ to the initial question “Can you see a human performing the action class-name?”, he was also asked the follow-up question “Does the action last for the whole clip?” in order to use this signal later during model training.

The creators of Kinetics-600 have also checked if the dataset is gender balanced and discovered that approximately 15% of action classes are imbalanced but this doesn’t lead to a biased performance.

The actions cover a broad range of classes including human-object interactions such as playing instruments, arranging flowers, mowing a lawn, scrambling eggs and so on.

Moments in Time

Number of videos: 1,000,000

Number of action classes: 339

Year: 2018

Samples from the Moments in Time dataset
Samples from the Moments in Time dataset

Moments in Time is another large-scale dataset, which was developed by the MIT-IBM Watson AI Lab. With a collection of one million labeled 3-second videos, it is not restricted to human actions only and includes people, animals, objects and natural phenomena, that capture the gist of a dynamic scene.

The dataset has a significant intra-class variation among the categories. For instance, video clips labeled with the action “opening” include people opening doors, gates, drawers, curtains and presents, animals and humans opening eyes, mouths and arms, and even a flower opening its petals.

That’s natural for the humans to recognize that all of the above-mentioned scenarios belong to the same category “opening” even though visually they look very different from each other. So, as pointed out by the researchers, the challenge is to develop deep learning algorithms that will be also able to discriminate between different actions, yet generalize to other agents and settings within the same action.

The action classes in Moments in Time dataset are chosen such that they include the most commonly used verbs in the English language, covering a wide and diverse semantic space. So, there are 339 different action classes in the dataset with 1,757 labeled videos per class on average; each video is labeled with only one action class.

Labeling interface
Labeling interface

As you can see from the image, the annotation process was very straightforward: workers were presented with video-verb pairs and asked to press a Yes or No key responding if the action is happening in the scene. For the training set, the researchers run each video through annotation at least 3 times and required a human consensus of at least 75%. For the validation and test sets, they increased the minimum number of rounds of annotation to 4 with a human consensus of at least 85%.

SLAC

Number of videos: 520,000 videos –> 1.75M 2-second clips

Number of action classes: 200

Year: 2017

Data collection procedure
Data collection procedure

A Sparsely Labeled ACtions Dataset (SLAC) is introduced by the group of researchers from MIT and Facebook. The dataset is focused on human actions, similarly to Kinetics, and includes over 520K untrimmed videos retrieved from YouTube with an average length of 2.6 minutes. 2-second clips were sampled from the videos by a novel active sampling approach. This resulted in 1.75M clips, including 755K positive samples and 993K negative samples as annotated by a team of 70 professional annotators.

As you see, the distinctive feature of this dataset is the presence of negative samples. See the illustration of negative samples below.

Negative samples from the SLAC dataset
Negative samples from the SLAC dataset

The dataset includes 200 action classes taken from the ActivityNet dataset.

Please note that even though the paper introducing this dataset was released in December 2017, the dataset is still not available for download. Hopefully, this will change very soon.

VLOG

Number of videos: 114,000

Year: 2017

Samples from the VLOG dataset
Samples from the VLOG dataset

The VLOG dataset differs from the previous datasets in the way it was collected. The traditional approach to getting data starts with a laundry list of action classes and then searching for the videos tagged with the corresponding labels.

However, such an approach runs into trouble because everyday interactions are not likely to be tagged on the Internet. Could you imagine uploading and tagging video of yourself opening a microwave, opening a fridge, or getting out of bed? The people tend to tag unusual things like for example, jumping in a pool, presenting the weather, or playing the harp. As a result, available datasets are often imbalanced with more data featuring unusual events and less data on our day-to-day activities.

To solve this issue, the researchers from the University of California suggest starting out with a superset of what we actually need, namely interaction-rich video data, and then annotating and analyzing it after the fact. They start data collection from the lifestyle VLOGs – an immensely popular genre of video that people publicly upload to YouTube to document their lives.

Illustration of the automatic gathering process
Illustration of the automatic gathering process

As the data was gathered implicitly, it represents certain challenges for annotation. The researchers decided to focus on the crucial part of the interaction, the hands, and how they interact with the semantic objects at a frame level. Thus, this dataset can also make headway on the difficult problem of understanding hands in action.

Bottom Line

The action recognition problem requires huge computational costs and lots of data. Fortunately, several very good datasets have appeared during the last year. Together with the previously available benchmarks (ActivityNet, UCF101, HMDB), they build a great foundation for significant improvements in the performance of the action recognition systems.

New Datasets for Disguised Face Recognition

9 October 2018
Face recognition datasets in the wild

New Datasets for Disguised Face Recognition

Face recognition is a common task in deep learning, and convolutional neural networks (CNNs) are doing a pretty good job here. I guess Facebook usually performs right at recognizing you…

Face recognition is a common task in deep learning, and convolutional neural networks (CNNs) are doing a pretty good job here. I guess Facebook usually performs right at recognizing you and your friends in the uploaded images.

But is this really a solved problem? What if the picture is obfuscated? What if the person impersonates somebody else? Can heavy makeup trick the neural network? How easy is it to recognize a person who wears glasses?

Disguised face recognition is still quite a challenging task for neural networks and primarily due to the lack of corresponding datasets. In this article, we are going to feature several face datasets presented recently. Each of them reflects different aspects of face obfuscation, but their goal is the same – to help developers create better models for disguised face recognition.

Disguised Faces in the Wild

Number of images: 11,157

Number of subjects: 1,000

Year: 2018

Sample genuine, cross-subject impostor, impersonator, and obfuscated face images for a single subject
Sample genuine, cross-subject impostor, impersonator, and obfuscated face images for a single subject

We will start with the most recent dataset presented earlier this year – Disguised Faces in the Wild (DFW). It primarily contains images of celebrities of Indian or Caucasian origin. The dataset focuses on a specific challenge of face recognition under the disguise covariate.

According to the DFW’s description, it covers disguise variations for hairstyles, beard, mustache, glasses, make-up, caps, hats, turbans, veils, masquerades and ball masks. This is coupled with other variations for pose, lighting, expression, background, ethnicity, age, gender, clothing, and camera quality.

There are four types of images in the dataset:

  • Normal Face Image: each subject has a non-disguised frontal face image.
  • Validation Face Image: 903 subjects have an image, which corresponds to a non-disguised frontal face image and can be used for generating a non-disguised pair within a subject.
  • Disguised Face Image: each subject has from 1 to 12 face images with intentional or unintentional disguise.
  • Impersonator Face Image: 874 subjects have from 1 to 21 images of the impersonators. An impersonator of a subject refers to a picture of any other person (intentionally or unintentionally) pretending to be the subject’s identity.
Sample images of 3 subjects from the DFW dataset. Each row corresponds to one subject, containing the normal (gray), validation (yellow), disguised (green), and impersonated (blue) images.
Sample images of 3 subjects from the DFW dataset. Each row corresponds to one subject, containing the normal (gray), validation (yellow), disguised (green), and impersonated (blue) images.

In total, the DFW dataset contains 1,000 normal face images, 903 validation face images, 4,814 disguised face images, and 4,440 impersonator images.

Makeup Induced Face Spoofing

Number of images: 642

Number of subjects: 107 + 107 target subjects

Year: 2017

An attempt of one subject to spoof multiple identities
An attempt of one subject to spoofing multiple identities

Makeup Induced Face Spoofing dataset (MIFS) is also about impersonating but with the specific focus on makeup. The researchers extracted the images from the YouTube videos where female subjects were applying makeup to transform their appearance to resemble celebrities. It should be noted, though, that the subjects were not trying to deceive an automated face recognition system deliberately but rather, they intended to impersonate a target celebrity from a human vision perspective.

The dataset consists of 107 makeup-transformations with two before-makeup and two after-makeup images per subject. Additionally, two face images of the target identity were taken from the Internet and included to the dataset. However, it is important to point out that the target images are not necessarily those used by the spoofer as a reference during the makeup transformation process. The celebrities sometimes change their facial appearance drastically, and so, the researchers were trying to select target identity images that most resembled the after-makeup image.

And finally, all the acquired images were subjected to face cropping. This routine eliminates hair and accessories. The examples of the cropped images are provided below.

Examples of images in the MIFS dataset after cropping: before makeup – after makeup – target identity
Examples of images in the MIFS dataset after cropping: before makeup – after makeup – target identity

So, in total, the MIFS dataset contains 214 images of a subject before makeup; 214 images of the same subject after makeup with the intention of spoofing; and 214 images of the target subject who is being spoofed. It should also be noted that subjects are attempting to spoof multiple target identities resulting in duplicate subject identities and even multiple subjects are attempting to spoof the same target identity resulting in duplicate target identities.

Specs on Faces dataset

Number of images: 42,592

Number of subjects: 112

Year: 2017

Samples from SoF dataset: metadata for each image includes 17 facial landmarks, a glass rectangle, and a face rectangle
Samples from SoF dataset: metadata for each image includes 17 facial landmarks, a glass rectangle, and a face rectangle

It looks like glasses as a natural occlusion threaten the performance of many face detectors and facial recognition systems. That’s why such a dataset with all the subjects wearing glasses is of particular importance. Specs on Faces dataset (SoF) comprises 2,662 original images of size 640 × 480 pixels for 112 persons (66 males and 46 females) from different ages. The glasses are the common natural occlusion in all images of the dataset. This original set of images consists of two parts:

  • 757 unconstrained face images in the wild that were captured over a long period in several locations under indoor and outdoor illumination environments;
  • 1905 images that are specifically dedicated to challenging harsh illumination changes: 12 persons were filmed under a single lamp located in arbitrary locations to emit light rays in random directions.
Images captured under different lighting directions
Images captured under different lighting directions

Then, for each image of the original set, there are:

  • 6 extra pictures generated by the synthetic occlusions – nose and mouth occlusion using a white block;
  • 9 additional pictures made with the image filters: Gaussian noise, Gaussian blur, and image posterization using fuzzy logic.

So, in total, the SoF dataset includes 42,592 images of 112 persons and a huge bonus – handcrafted metadata that contains subject ID, view (frontal/near-frontal) label, 17 facial feature points, face and glasses rectangle, gender and age labels, illumination quality, and facial emotion for each subject.

Large Age-Gap Face Verification

Number of images: 3,828

Number of subjects: 1,010 celebrities

Year: 2017

Examples of face crops for matching pairs in the LAG dataset
Examples of face crops for matching pairs in the LAG dataset

Another challenge is a large age gap. Can the algorithm recognize a personality based on her picture from early childhood? Large-age gap dataset (LAG) was created to help developers with solving this challenging task.

The dataset is constructed with photos of celebrities discovered through the Google Image Search and YouTube videos. The large age gap may have different interpretations: from one side it refers to images with extreme difference in age (e.g., 0 to 80 years old) but on the other hand, it also refers to a significant difference in appearance due to the aging process. For instance, as pointed out by the dataset author, “0 to 15 years old is a relatively small difference in age but has a large change in appearance”.

The LAG dataset reflects both aspects of a large-age gap concept. It contains 3,828 images of 1,010 celebrities. For each identity, at least one child/young image and one adult/old image are present. Starting from the collected images, a total of 5051 matching pairs has been generated.

Additional examples of matching pairs in the LAG dataset
Additional examples of matching pairs in the LAG dataset

Bottom Line

The face recognition problem is still topical. There are lots of challenging tasks that significantly threaten the performance of the current facial recognition systems – it turns out that even glasses are a huge problem. Fortunately, the new face image datasets appear regularly. While each of them is focusing on the different aspects of the problem, together they build a great foundation for significant improvements in the performance of the facial recognition systems.

Pushing the Limits of Unconstrained Face Detection: a New Challenge Dataset

30 August 2018
Face detection datasets

Pushing the Limits of Unconstrained Face Detection: a New Challenge Dataset

If your application performs landmark detection, face alignment, face recognition or face analysis, the first step is always face detection. And in fact, face detection progressed tremendously in the last…

If your application performs landmark detection, face alignment, face recognition or face analysis, the first step is always face detection. And in fact, face detection progressed tremendously in the last few years. The existing algorithms successfully address such challenges as large variations in scale, pose or appearance. However, there are still some issues that are not specifically captured by the existing approaches and face detection datasets.

The group of researchers, headed by Hajime Nada from Fujitsu, identified a new set of challenges for face detection and even collected a dataset of face images that involved these issues. In particular, their dataset includes images with rain, snow, haze, illumination variations, motion and focus blur, and lens impediments. Finally, it also contains a set of distractors – images that don’t include human faces but include objects that can be easily mistaken for faces.

Let’s now discover how existing state-of-the-art approaches to face detection perform on this new challenging dataset. Is there a gap between their performance and real-world requirements? We’ll find out right away!

Face detection datasets

Several datasets have been created specifically for face detection. The table below summarizes information on the most widely used datasets.

Let’s briefly discuss some pros and cons of these datasets:

  • AFW includes 205 images collected from Flickr. It has 473 face annotations as well as a facial landmark and poses labels for each face. Variations in face appearance are very limited.
  • PASCAL FACE has a total of 851 images with 1,341 annotations. It also has limited variations in facial appearance.
  • FDDB has 2,845 images with 5,171 annotations. The authors of this dataset attempted to capture a wide range of difficulties. However, the images were collected from Yahoo! and mainly picture celebrities, making this dataset inherently biased.
  • MALF is a large dataset with 5,250 images and 11,900 annotations. It is constructed explicitly for fine-grained evaluations.
  • IJB-C is a massive dataset containing 138,000 face images, 11,000 face videos, and 10,000 non-face images. It was explicitly constructed for face detection and recognition.
  • WIDER FACE is a recently introduced dataset with over 32,300 images. It includes large variations in scale, pose, and occlusion but doesn’t focus on specifically capturing weather-based degradations.
  • UCCS dataset contains some weather-based degradations. However, the images were collected from a single location using a surveillance camera. Hence, this dataset lacks diversity.

As you can see, even though there are some huge datasets with large variations in face appearance, there is still a lack of datasets that capture weather-based degradations and other challenging conditions with a large set of images in each condition.

Here is where the proposed dataset comes in!

UFDD Dataset

Unconstrained Face Detection Dataset (UFDD) includes 6,424 images with 10,895 annotations. It captures variations in weather conditions (rain, snow, haze), motion and focus blur, illumination variations, lens impediments. See the distribution of images in the table below.

Notably, the UFDD dataset also includes a large set of distractor images that is usually ignored by the existing datasets. Distractors either contain non-human faces such as animal faces or no faces at all. The presence of such images is especially important to measure the performance of a face detector in rejecting non-face images and to study the false positive rate of the algorithms.

Images were collected from different sources on the web such as Google, Bing, Yahoo, Creative commons search, Pixabay, Pixels, Wikimedia commons, Flickr, Unsplash, Vimeo, and Baidu. After collection and duplicates removal, the images were resized to have a width of 1024 while preserving their original aspect ratio.

For annotations, the images were uploaded to Amazon mechanical turk (AMT). Each image was assigned to around 5 to 9 AMT workers, who were asked to annotate all recognizable faces in the image. Once the annotation was complete, the labels were cleaned and consolidated.

Evaluation and Analysis

The researchers selected several recent face detection approaches to evaluate them on the proposed UFDD dataset:

  • Faster-RCNN is among the first end-to-end CNN-based object detection methods. It was selected as a baseline approach because this method was the first to propose anchor boxes, and now most face detectors are based on anchor boxes.
  • HR-ER approach specifically addresses the problem of large variations in scale by designing scale-specific detectors based on ResNet-101.
  • SSH consists of multiple detectors placed on top of different conv layers of VGG-16 to explicitly address scale variations.
  • S3FD is based on the popular object detection framework called single shot detector (SSD) with VGG-16 as the base network.

These algorithms were evaluated on the proposed UFDD dataset in two different scenarios:

  • After they were pre-trained on the original WIDER FACE dataset.
  • After they were pre-trained on the synthetic WIDER FACE dataset, which was created by complementing original images with such variations as rain, snow, blur and lens impediments (see the example on the image below).
Sample annotated images from the synthetic WIDER FACE dataset (left to right and top to bottom: rain, snow, motion blur, Gaussian blur, illumination, lens impediments)
Sample annotated images from the synthetic WIDER FACE dataset (left to right and top to bottom: rain, snow, motion blur, Gaussian blur, illumination, lens impediments)

The next figure shows the precision-recall curves corresponding to different approaches as evaluated on the UFDD dataset.

Evaluation results of different face detection algorithms on the proposed UFDD dataset, trained on the original WIDER FACE dataset (left) and synthetic WIDER FACE dataset (right)
Evaluation results of different face detection algorithms on the proposed UFDD dataset, trained on the original WIDER FACE dataset (left) and synthetic WIDER FACE dataset (right)

Table 3 below contains the mean average precision (mAP) corresponding to different methods and different training sets.

As you can see, these new challenging conditions are not well-addressed by the existing state-of-the-art approaches. However, the detection performance improves when the networks are trained on the synthesized dataset. This further confirms the necessity of the dataset that reflects the real-world conditions such as rain and haze.

Cohort analysis

Next, the researchers individually analyzed the effect of different conditions on the performance of recent state-of-the-art face detection methods. See below the detections results for all benchmark methods:

 Face detection results on the proposed UFDD dataset
Face detection results on the proposed UFDD dataset

Here are the results in the form of precision-recall curves.

Cohort analysis: Individual precision-recall curves of different face detection algorithms on the proposed UFDD dataset
Cohort analysis: Individual precision-recall curves of different face detection algorithms on the proposed UFDD dataset

The results demonstrate that all the degradations hinder the performance of the benchmarked methods. This doesn’t come as a surprise considering that they are trained on the datasets that usually don’t include a sufficient number of images with these conditions.

Evaluation results also uncover a significant effect of the distractors on the performance of our face detection algorithms. These images contain objects that can be easily mistaken for human faces, and thus, lead to a high false positive rate. See the drop in the detection accuracies in the presence of distractor images:

table 4

Face detection results on the proposed UFDD dataset with and without distractors
Face detection results on the proposed UFDD dataset with and without distractors

Bottom Line

Despite the immense progress in the last few years, face detection algorithms still demonstrate a significant gap in their performance when processing the images taken in extreme weather conditions, containing motion and focus blur or lens impediments. That’s mainly due to the fact that existing datasets ignore these conditions.

The newly created UFDD dataset addresses this issue, and hopefully, it will fuel further research in unconstrained face detection, and we’ll soon witness some new state-of-the-art approaches that can easily detect faces in the extreme conditions.

More Than 10 NLP Datasets Available from IBM’s Project Debater

30 August 2018
project debater NLP datasets

More Than 10 NLP Datasets Available from IBM’s Project Debater

In June 2018, IBM announced that an Artificial Intelligence System engaged in the first-ever live, public debates with humans. Their so-called “Project Debater” as the first cognitive system able to…

In June 2018, IBM announced that an Artificial Intelligence System engaged in the first-ever live, public debates with humans. Their so-called “Project Debater” as the first cognitive system able to debate humans on complex topics, has been tested against a champion debater and proved it is able to engage in a complex debate on controversial topics.

Project Debater

Project Debater is in fact, just one in the series of successful large-scale AI projects from IBM research, that is mainly focused to push one of the boundaries of AI: mastering language. Over a period of six years, a global research team led by IBM’s Haifa, Israel Lab was able to build a cognitive system with remarkable debating capabilities: first, data-driven speech writing and delivery; second, listening comprehension that can identify key claims hidden within long continuous spoken language; and third, modeling human dilemmas in a unique knowledge graph to enable principled arguments.

Building such a complex AI system requires careful identification of separate tasks, as well as careful design and implementation of many modules and sub-modules that will be capable of solving those tasks. To build “Project Debater” involves advancing research in a range of artificial intelligence fields (as noted by IBM). To facilitate this research, the large research team working on Project Debater developed and used a number of datasets and published them as open-source datasets for the community.

IBM’s project debater datasets can be found on this page, and they can be downloaded upon request after filling out a request form. The datasets are released under the following licenses:

© Copyright Wikipedia.

© Copyright IBM 2014. Released under CC-BY-SA.

Datasets

All the data under Project Debater dataset repository have been divided into 5 major groups, each one comprising several sub-groups of datasets. This kind of structuring the datasets helps not only organizing the data in the most convenient way but also designing the whole system modules and sub-modules.

1. Argument Detection

The first major group of datasets is Argument Detection and it falls within the Argument Mining research field, which is considered a prominent AI field. 4 datasets are available under this group: “Claims Sentences Search”, “Evidence Sentences “ and two “Claims and Evidence” datasets.

2. Argument Stance Classification and Sentiment Analysis

Another major group of NLP datasets from Project Debater is the “Argument Stance Classification and Sentiment Analysis”. It contains three subgroups: Claim Stance (contains one dataset that includes stance annotations for claims), Sentiment Analysis (contains two large datasets that were used to build the stance classification engine) and Expert Stance under which are contained datasets about experts’ stance towards a debate (currently there is only one dataset – Wikipedia Category stance, containing manually extracted Wikipedia categories).

Claim Stance

Semantic Analysis

Expert Stance

 

3. Debate Speech Analysis

An important part of a debating cognitive system (and many other conversational systems) is speech analysis and understanding. Project Debater built and used “Recorded Debating Dataset” containing recordings of 10 expert debaters. This dataset is available in three versions: full dataset compressed audio files and light version (no audio data).

4. Expressive Text to Speech

This group contains data on translating text to speech and more specifically (in the single dataset available now under this category) emphasizing some parts or words in the speech.

5. Basic NLP Tasks

The development of a cognitive debating system such as Project Debater involves many basic NLP tasks. This category contains the datasets developed for Project Debater which fall into “basic NLP” and are divided into three sub-groups:

5.1 Semantic Relatedness: Two datasets are available for semantic relatedness tasks: Wikipedia Oriented Relatedness Dataset and Multi-word Term Relatedness Benchmark

5.2 Mention Detection: A category which contains the datasets related to the task of detecting mentioned concepts (from a knowledge database) in a text (or speech). One dataset is available in this sub-group for the moment.

5.3 Text Clustering: A general group which comprises datasets used for text clustering. As of now, Project Debater has developed and used a single dataset named: “Thematic Clustering of Sentences”.

More datasets from the Project Debater are expected to be released as the project evolves, making the Debater Dataset a large and comprehensive repository of diverse NLP datasets.

“Falling Things”: A Synthetic Dataset by Nvidia for Pose Estimation

26 April 2018
A Synthetic Dataset by Nvidia for Pose Estimation

“Falling Things”: A Synthetic Dataset by Nvidia for Pose Estimation

Deep learning made a lot of progress in image recognition and visual arts. The deep learning approach is used to solve the problems of image recognition, image captioning, image segmentation,…

Deep learning made a lot of progress in image recognition and visual arts. The deep learning approach is used to solve the problems of image recognition, image captioning, image segmentation, etc. One of the main problems is that you need a lot of data to train a model and these datasets are not easily available. Lots of research teams working to create different kind of datasets.

Similarly, one of the problem in robotics manipulation requires the detection and pose estimation of multiple object categories. It’s a two-fold challenge for designing robotic perception algorithms. The difficult is creating a model with the less number of training data. Another problem is acquiring ground truth data, which are time-consuming, error-prone and potentially expensive. Existing techniques for obtaining real-world data do not scale. As a result, they are not capable of generating the large datasets that are needed for training deep neural networks.

“Falling Things” Dataset

These problem has been founded when using synthetically generated data. Synthetic data is any production data applicable to a given situation that is not obtained by direct measurement. Researchers have been using synthetic data as an efficient means of both training and validating deep neural networks for which getting ground truth data is very hard such as for segmentation task. 3D pose estimation and detection task lie within this category where acquiring ground truth is complicated and fit for synthetic data.

 Falling Things (FAT) dataset which consists of more than 61,000 images for training and validating a robotics scene understanding algorithms in a household environment. There are only two datasets are present with accurate ground truth poses of multiple objects, i.e. T-LESS and YCB-Video. But the problem with of these datasets is that they didn’t contain extreme lightning condition and or multiple modalities. But state of the art solution is FAT which incorporates the capabilities which were not present in the other two datasets.

“Falling Things”
Figure 1: The Falling Things (FAT) dataset was generated by placing 3D household object models. Pixelwise segmentation of the objects (bottom left), depth (bottom center), 2D/3D bounding box coordinates (bottom right)

Unreal Engine

The state of the art FAT dataset is generated by using Unreal Engine 4(UE4). The data is generated for three virtual environments within UE4: a kitchen, sun temple, and forest. These environments were chosen for their high-fidelity modeling and quality, as well as for the variety of indoor and outdoor scenes. For every environment, five specific manually selected locations are covering a range of terrain and lighting conditions (e.g., on a kitchen counter or tile floor, next to a rock, above a grassy field, and so forth). By doing this 15 different locations consisting of a variety of 3D backgrounds, lighting conditions, and shadows are yielded.

There are 21 household objects from the YCB. The objects were placed in random positions and orientations within a vertical cylinder of radius 5 cm and height of 10 cm placed at a fixation point. As the objects fell, the virtual camera system was rapidly teleported to random azimuths, elevations, and distances with respect to the fixation point to collect data. Azimuth ranged from –120◦ to +120◦ (to avoid collision with the wall, when present), elevation from 5◦ to 85◦, and distance from 0.5 m to 1.5 m. The virtual camera used in data generation is consist of a pair of stereo RGBD camera. This design decision allows the dataset to support at least three different sensor modalities. Whereas single RGBD sensors are commonly used in robotics, stereo sensors have the potential to yield higher quality output with fewer distortions, and a monocular RGB camera has distinct advantages regarding cost, simplicity, and availability.

The dataset consists of 61,500 unique images having an image resolution of 960 x 540, and it’s divided into two parts:

1. Single Objects: The first part of the dataset was generated by dropping each object model in isolation ∼5 times at each of the 15 locations.

2. Mixed objects: the second part of the dataset was generated in the same manner except for a random number of objects sampled uniformly from 2 to 10. To allow multiple instances of the same category in an image, the object has been sampled with replacement.

To split the dataset for training and testing is to hold out one location per scene as the test sets, and leave the other data for training. Figure 2 shows the total number of occurrences of each object class in the FAT dataset.

object visibility
Figure 2: Total appearance count of the 21 YCB objects in the FAT dataset. Light color bars indicate object visibility higher than 25%, while solid bars indicate visibility higher than 75%.

Bottom Line

This new dataset will help to accelerate research in object detection and pose estimation, segmentation and depth estimation. The proposed dataset focuses on household items from the YCB dataset.

object detection and pose estimation dataset
Figure 3: Datasets for object detection and pose estimation. The FAT datasets provide all the capabilities.

This dataset helps researchers to find solutions for open problems like object detection, pose estimation, depth estimation from monocular and/or stereo cameras, and depth-based segmentation, to advance the field of robotics.

Falling things examples
Figure 4: Some examples from the dataset

Note: The dataset will be publicly available no later than June 2018.

Muneeb ul Hassan