New Datasets for 3D Human Pose Estimation

8 November 2018

New Datasets for 3D Human Pose Estimation

Human pose estimation is а fundamental problem in computer vision. Computer’s ability to recognize and understand humans in images and videos is crucial for multiple tasks including autonomous driving, action…

Human pose estimation is а fundamental problem in computer vision. Computer’s ability to recognize and understand humans in images and videos is crucial for multiple tasks including autonomous driving, action recognition, human-computer interaction, augmented reality and robotics vision.

In recent years, significant progress has been achieved in 2D human pose estimation. The crucial factor behind this success is the availability of large-scale annotated human pose datasets that allow training networks for 2D human pose estimation. At the same time, advances in 3D human pose estimation remain limited because obtaining ground-truth information on the dense correspondence, depth, motion, body-part segmentation, occlusions is a very challenging task.

In this article, we present several recently created datasets that attempt to address the shortage of annotated datasets for 3D human pose estimation.

DensePose

Number of images: 50K

Number of annotated correspondences: 5M

Year: 2018

DensePose is a large-scale ground-truth dataset with image-to-surface correspondences manually annotated on 50K COCO images. To build this dataset Facebook AI Research team involved human annotators, who were establishing dense correspondences from 2D images to surface-based representations of the human body using a specifically developed annotation pipeline.

As shown below, in the first stage annotators define regions corresponding to visible, semantically defined body parts. In the second stage, every part region is sampled with a set of roughly equidistant points and annotators are requested to bring these points in correspondence with the surface. The researchers wanted to avoid manual rotation of the surface and for this purpose, they provide annotators with six pre-rendered views of the same body part and allow users to place landmarks on any of them.

Annotation pipeline

Below are visualizations of annotations on images from the validation set: Image (left), U (middle) and V (right) values for the collected points.

Visualizing of annotations

DensePose is the first manually-collected ground truth dataset for the task of dense human pose estimation.

SURREAL

Number of frames: 6.5M

Number of subjects: 145

Year: 2017

Generating photorealistic synthetic images

SURREAL (Synthetic hUmans foR REAL tasks) is a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. It includes over 6 million frames accompanied with the ground-truth pose, depth maps, and segmentation masks.

As described in the original research paper, images in SURREAL are rendered from 3D sequences of MoCap data. The realism of synthetic data is usually limited. So, to ensure the realism of human bodies in this dataset, the researchers decided to create synthetic bodies using SMPL body model, whose parameters are fit by the MoSh method given raw 3D MoCap marker data. Moreover, creators of SURREAL dataset ensured a large variety of viewpoints, clothing, and lighting.

A pipeline for generating synthetic human is demonstrated below:

  • a 3D human body model is posed using motion capture data;
  • a frame is rendered using a background image, a texture map on the body, lighting and a camera position;
  • all the “ingredients” are randomly sampled to increase the diversity of the data;
  • generated RGB images are accompanied with 2D/3D poses, surface normal, optical flow, depth images, and body-part segmentation maps.
Pipeline for generating synthetic data

The resulting dataset contains 145 subjects, more than 67.5K clips and over 6.5M frames:

Even though SURREAL contains synthetic images, the researchers behind this dataset demonstrate that CNNs trained on SURREAL allow for accurate human depth estimation and human part segmentation in real RGB images. Hence, this dataset provides new possibilities for advancing 3D human pose estimation using cheap and large-scale synthetic data.

UP-3D

Number of subjects: 5,569

Number of images: 5,569 training images and 1208 test images

Year: 2017

Bottom: Validated 3D body model fits on various datasets constitute the initial UP-3D dataset. Top: improved 3D fits can extend the initial dataset

UP-3D is a dataset, which “Unites the People” of different datasets for multiple tasks. In particular, using the recently introduced SMPLify method the researchers obtain high-quality 3D body model fits for several human pose datasets. Human annotators only sort good and bad fits.

This dataset combines two LSP datasets (11,000 training images and 1,000 test images) and the single-person part of the MPII-HumanPose dataset (13,030 training images and 2622 test images). While it was possible to use an automatic segmentation method to provide foreground silhouettes, the researchers decided to involve human annotators for reliability. They have built an interactive annotation tool on top of the Opensurfaces package to work with Amazon Mechanical Turk (AMT) and have been using the interactive Grabcut algorithm to obtain image consistent silhouette borders.

So, the annotators were asked to evaluate fit for:

  • foreground silhouettes;
  • six body part segmentation.

While the average foreground labeling task was solved in 108s on the LSP and 168s on the MPII datasets respectively, annotating the segmentation for six body parts took on average more than twice as long as annotating foreground segmentation: 236s.

The annotators were sorting good and bad fits and here are the percentages of accepted fits per dataset:

Thus, the validated fits formed the initial UP-3D dataset with 5,569 training images and 1,208 test images. After the experiments on semantic body part segmentation, pose estimation and 3D fitting, the improved 3D fits can extend the initial dataset.

Results from various methods trained on labels generated from the UP-3D dataset

The presented dataset allows for a holistic view on human-related prediction tasks. It sets a new mark in terms of levels of detail by including high-fidelity semantic body part segmentation in 31 parts and 91 landmark human pose estimation. It was also demonstrated that training the pose estimator on the full 91 keypoint dataset helps to improve the state-of-the-art for 3D human pose estimation on the two popular benchmark datasets HumanEva and Human3.6M.

Bottom Line

As you can see, there are many possible approaches to building a dataset for 3D human pose estimation. The datasets presented here focus on different aspects of recognizing and understanding humans in images. However, all of them can be handy for estimating human poses in some of the real-life applications.

Learning Physical Skills from Youtube Videos using Deep Reinforcement Learning

6 November 2018

Learning Physical Skills from Youtube Videos using Deep Reinforcement Learning

  Realistic, humanlike chracters represent a very important area of computer animation. These characters are vital components of many applications, such as cartoons, computer games, cinematic special effects, virtual reality,…

 

Realistic, humanlike chracters represent a very important area of computer animation. These characters are vital components of many applications, such as cartoons, computer games, cinematic special effects, virtual reality, artistic expression etc. However, character animation production typically goes through a number of creation stages and as such it represents a laborious task.

Previous Work

Such a labor-intensive task represents a bottleneck in the whole process of computer animation creation. In the past, there have been a number of attempts to overcome this problem and make this task supported by automatic tools, or even completely automated.

Many of the proposed approaches in the past have failed when it comes to producing robust and naturalistic motion controllers that enable virtual characters to perform complex skills in physically simulated environments. The first attempts and approaches have focused mostly on understanding the physics and biomechanics and trying to formulate and replicate motion patterns to virtual characters. More recently, data-driven approaches have caught the attention. However, most data-driven approaches, save for a few exceptions, are based on motion capture data, which often requires costly instrumentation and heavy pre-processing.

State-of-the-art Idea

Recently, researchers from Berkeley AI Research at the University of California have proposed a novel Reinforcement Learning-based approach for learning character animation from videos.

Combining motion estimation from videos and deep reinforcement learning, their method is able to synthesize a controller given a monocular video as input. Additionally, the proposed method is able to predict potential human motions from still images, by forward simulation of learned controllers initialized from the observed pose.

The proposed pipeline for learning acrobatic skills from Youtube videos.

Method

The researchers propose a framework that takes a monocular video and outputs motion imitation done by a simulated character model. The whole approach is based on pose estimation in the frames of the video, which is later used for motion reconstruction and motion imitation to achieve the final goal.

First, the input video is processed by the pose estimation stage, where a learned 2D and 3D pose estimators are applied to extract (estimate) the pose of the actor in each frame. Next, the set of predicted poses proceeds to the motion reconstruction stage, where a reference motion trajectory is optimized such that it is consistent with both the 2D and 3D pose predictions, while also enforcing temporal-consistency between frames. The reference motion is then utilized in the motion imitation stage, where a control policy is trained to enable the character to reproduce the reference motion in a physically simulated environment.

Pose Estimation Stage

The first module in the pipeline is the pose estimation module. At this stage, the goal is to estimate the pose of the actor from single still image i.e. each video frame. There are a number of challenges that have to be addressed at this point, in order to obtain accurate pose estimation. First, the variability in the body orientation among different actors performing the same movement is very high. Second, pose estimation is to be done at each frame independently from the previous or next frames, not accounting for temporal consistency.

To address both of those issues, the researchers propose to use an ensemble of already existing and proven methods for pose estimation. Along with that they use a simple data augmentation technique to improve pose predictions in the domain of acrobatic movements.

They train an ensemble of estimators on the augmented dataset and obtain 2D and 3D pose estimations for each frame, which define the 2D and 3D motion trajectories, respectively.

Comparison of the motions generated by different stages of the pipeline for backflip motion. Top-to-Bottom: Input video clip, 3D pose estimator, 2D pose estimator, simulated character.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Motion Reconstruction Stage

In the motion reconstruction stage, the independent predictions from the pose estimators are consolidated to form the final reference motion. The ultimate goal that the researchers were aiming for in this stage is to improve the quality of the reference motions by fixing errors and removing motion artifacts often manifested as nonphysical behaviours. According to the researchers, these motion artifacts appear due to inconsistent predictions across adjacent frames.

Again, an optimization technique is applied at this stage, optimizing for a common 3D pose trajectory for the pose estimators, while at the same time enforcing temporal consistency between consecutive frames. The optimization is done in the latent space, using the pose estimators leveraging their encoder-decoder architecture.

Motion Imitation Stage

In the final stage, deep reinforcement learning is applied to reach the final objective. From a machine learning perspective, the goal here is to learn a policy that enables the character to reproduce the demonstrated skill in (video) simulation. The reference motion extracted previously is used to define an imitation objective, and a policy is then trained to imitate the given motion.

The reward function is designed to incentivize the character to track the joint orientations from the reference motion. In fact, quaternion differences of joint rotations are computed between the character’s joint rotations and the joint rotations of the extracted reference motion.

Final result of the method: Character imitating a 2-handed vault.

Results

To demonstrate the proposed framework and evaluate the proposed method, the researchers employ a 3D humanoid character and a simulated Atlas robot. A qualitative evaluation was done by comparing snapshots of the simulated characters with the original video demonstrations. All video clips were collected from YouTube and they depict human actors performing various acrobatic motions. As they mention in the paper, since it is a difficult challenge to quantify the difference between the motion of the actor in the video and the simulated character, performance was evaluated with respect to the extracted reference motion. The figures below show overlapping snapshots of the real videos and the simulated characters for a qualitative evaluation.

The simulated Atlas robot performing skills learned from videos.

Qualitative evaluation using simulated characters performing different skills learned from video clips.

Conclusions

The proposed method for data-driven animation creation leverages the abundance of publicly available video clips from the web to learn full-body motion skills and as such it represents a significant contribution. The proposed framework shows the potential of combining multiple, different techniques to build a whole framework and reach a specific objective. As such, there exists a big advantage of the modular design since new advances, relevant to the various stages of the pipeline can be incorporated at later stages to further improve the overall effectiveness of the framework.

Dense Human Pose Estimation In The Wild

23 August 2018

Dense Human Pose Estimation In The Wild

Scene understanding is one of the holy grails of computer vision. A lot of research has been done towards the ultimate goal of understanding a scene given an image. Inferring…

Scene understanding is one of the holy grails of computer vision. A lot of research has been done towards the ultimate goal of understanding a scene given an image. Inferring any kind of additional information helps to forge ahead the human understanding of images. Throughout the recent past, researchers have focused mostly on simpler tasks in order to provide some (satisfactory) level of scene description and understanding. However, in the past few years more and more complex problems have been tackled and solved (to some extent at least) starting from object detection and classification, segmentation, object localization, scene classification all the way up to contextual reasoning.

We have seen remarkable advancement in inferring 3D information out of 2D data. Recent work from Google DeepMind AI showed that it is able to render 3D scene out of flat 2D images. Addressing this kind of problems pushes the boundaries of human understanding in images even more. Researchers from the French Institute for Research in Computer Science (INRIA) and Facebook AI Research have proposed a method for dense human pose estimation from images. In their paper, they propose a Deep Learning method that infers a 3D, surface-based representation of the human body out of a single flat image. As I mentioned before, scene or context understanding in images has been addressed by addressing smaller sub-problems such as object detection, classification localization etc. The novel dense pose estimation method proposed in this work involves these problems as prerequisites and it builds gradually on top of their outcome.

Besides the proposed architecture for learning a surface-based representation of the human body, the authors create a large-scale ground-truth dataset with image-to-surface correspondences manually annotated using 50 thousand images from the COCO Dataset.

Dataset

Having a rich, high-quality labeled dataset of sufficient size is crucial in supervised learning. Different problems require different labeling of the data and very often this represents a bottleneck in the modeling process. For this reason, the researchers created an annotated dataset with image-to-surface correspondences by taking the 50K images from the COCO Dataset. They introduce a new dataset called COCO-DensePose along with evaluation metrics as another contribution. The new dataset is created by introducing a smart annotation pipeline enabling to decrease the need for human effort as much as possible. The annotation includes segmenting the image, marking correspondences using SMPL model to obtain UV fields.

The annotation pipeline used to create the dataset
The annotation pipeline used to create the dataset

Method

To address the problem of human body surface estimation out of flat 2D images, the authors propose mapping the problem as regressing body surface coordinates at an image pixel. By manually annotating a dataset, they exploit a deep neural network architecture — Mask-RCNN trained in a completely supervised manner. They combine the Mask-RCNN network within a DenseReg (Dense Regression System) to obtain the correspondences between the RGB image and a 3D body surface model.

The DensePose-RCNN architecture
The DensePose-RCNN architecture

The first and simpler architecture that is employed is a fully convolutional network (FCN), combining classification and regression. The first part is doing segmentation of the image by classifying the pixels to one of the several classes: background or a specific region of the body. In this way, a coarse estimate of the surface coordinate correspondences is given to the second part which regresses the exact coordinates. The first part is trained using pixel-wise cross-entropy loss. The second part i.e. regression of the exact coordinates is defined as mapping each pixel to a point in a 2D coordinate system given by the parametrization of each piece (part of a human body). In fact, the second part acts as a correction to the classification of the first part. Therefore, the regression loss is taken into account only if the pixel is within the specific part. Finally, each pixel is mapped to U, V coordinates of the parametrization of each body part (in this case each of the 25 defined body parts).

The output of the method (left). The new dataset with the body segmentation and parametrization in a new 2D coordinate system(right)
The output of the method (left). The new dataset with the body segmentation and parametrization in a new 2D coordinate system(right)

The authors improve the method by introducing region-based regression. They introduce a fully convolutional network (as discussed above) on top of ROI-pooling that is entirely devoted to the two tasks, generating a classification and a regression head that provide the part assignment and part coordinate predictions.

The final architecture consists of a cascade of proposing regions-of-interest (ROI), extracting region-adapted features through ROI pooling and providing the results to a region-specific branch.

The final cross-cascading architecture
The final cross-cascading architecture

Cross-modal supervision

The method is further improved by introducing a cross-modal supervision. A weak supervision signal is defined by annotating over a defined small subset of image pixels at each training sample. Training a network architecture in this way is still possible by not including the loss of the pixels that do not have ground-truth correspondence in the pixel-wise loss calculation.

However, to further amplify the supervision signal they propose a cross-modal supervision approach with a teacher network that learns the mapping from a sparse annotated surface to a fully annotated human body surface. They argue that this kind of “in-painting” of the supervision signal is, in fact, efficient and improves the overall results.

Cross-modal supervision is used by including a teacher network that “in-paints” the missing pixels
Cross-modal supervision is used by including a teacher network that “in-paints” the missing pixels

Evaluation and conclusions

The comparison with other methods is given in the tables below. It is worth to note that the comparison between this approach and the previous approach has to be taken carefully since the new method makes use of the new dataset that they created — DensePose-COCO.

AUC comparison

AUC comparison with different methods
AUC comparison with different methods
AUC and IoU comparison with different architectures and approaches
AUC and IoU comparison with different architectures and approaches

geodesic error

The qualitative and quantitative evaluation show that the method is able to infer body surface coordinates with high accuracy, it is able to handle large amounts of occlusion, pose variation and scale. Moreover, the results show that a fully convolutional approach underperforms compared with the newly proposed ROI-based method trained in a cross-modal supervision manner.

testing the method

Testing the method on different realistic images
Testing the method on different realistic images

 

Synthesising Images of Humans in Unseen Poses

19 July 2018
Моделирование позы на фотографии

Synthesising Images of Humans in Unseen Poses

Humans have an incredible capability to imagine things in a different context. At the core of our intelligence are the imagination and learning from experience. Both of these things are…

Humans have an incredible capability to imagine things in a different context. At the core of our intelligence are the imagination and learning from experience. Both of these things are connected, and creativity always comes from the memory and our experience. Therefore, we can estimate a shape of an object (even though we look at it only from one particular viewpoint), we can imagine motion or deformation of an object by just looking at it while static. Our memory provides us with the ability to visualize complicated things, such as what will a person see in a different context or different pose.

Researchers from Massachusetts Institute of Technology have addressed the Computer vision task of novel human pose synthesis. In their work, they present an image synthesis method that given an image containing a person and a desired, target pose it can synthesise a depiction of the person in that pose in a new realistic model.

synthesizing images of human poses

They combine multiple techniques, and they frame the novel pose synthesis as a deep learning problem. The approach is unique, and it produces realistic images as they demonstrate in a few different use cases.

Problem Statement

The problem statement: given an image and a target pose synthesise the person in the picture in that pose. From a problem statement point of view, the task of novel human pose synthesis is non-trivial, and there are a few crucial things to be taken into account.

source and synthesized image
The problem statement: Given an image and a target pose synthesise the person in the image in that pose

First, the generated image has to be as realistic as possible. Second, changing the pose requires segmenting the person from the background or other objects present in the picture. Third, introducing a novel pose leaves gaps in the background caused by disocclusion which have to be filled appropriately and moreover self-occlusion has to be handled as well.

Capturing these complex changes in the image space represents a challenge and in their approach, the researchers tackle the problem by dividing it into smaller sub-problems, solved by separate modules.

Solution

In fact, they design a modular architecture comprising of several modules to address several different challenges and provide realistic image synthesis in the end. They propose an architecture of 4 modules:

A. Source Image Segmentation

B. Spatial Transformation

C. Foreground Synthesis

D. Background Synthesis

The architecture is trained in a supervised manner mapping a tuple of a source image, source pose and target pose to a target image. The whole architecture is trained jointly using a single model as a target.

GNN architecture
The proposed architecture comprising of 4 modules

A. Source Image Segmentation

Differences in poses and motion introduced by pose transformation often involve several moving body parts, large displacements, occlusions and self-occlusion. To overcome this problem, the first module is segmenting the source image. The segmentation is two-fold: first, the image is segmented into foreground and background, then the foreground (the person) is segmented into body parts such as arms, legs etc. Therefore, in the output of the segmentation stage, there are 1 background layer and L foreground layers corresponding to each of the L predefined body parts.

As mentioned before, a tuple of the input image along with the pose and the desired target pose is given as input. Unlike the source image which is an RGB image, the poses are defined as a stack of multiple layers. A pose is represented as a 3D volume given in R(HxWxJ). Each of the J layers (or channels) in the pose representation contains a “Gaussian bump” centred at the (x,y) location of each joint. The Gaussian representation (instead of deterministic dense representation) introduces some kind of regularization since joint location estimates can be often noisy and incorrect. In the experiments, the authors use 14 body parts (head, neck, shoulders, elbows, wrists, hips, knees and ankles) as 14 channels.

results from the segmentation module using Gaussian bumps
The results from the segmentation module using Gaussian bumps

The segmentation module is a U-Net network, that takes a concatenated volume (of the input image plus the pose layers) as input and gives L layer masks as output, specifying the rough location of each joint. The output is, in fact, 2D Gaussian mask over the approximate spatial region of each body part that enables to obtain the desired segmentation.

B. Spatial Transformation

The segmented layers from the segmentation module are then spatially transformed to fit the desired pose parts. The spatial transformation is not learned but directly computed from the input poses.

C. Foreground Synthesis

The foreground synthesis module is again a U-shape network (Encoder-decoder with skip connections) that takes the spatially transformed layers with the target pose layers as a concatenated volume and it outputs two different outputs (by branching the end of the network) — the first one being the target foreground and the second one being the target mask.

D. Background Synthesis

The task that the background synthesis module is solving is filling the background that is missing i.e. that is being occluded by the person in the input image. This module is also a U-net taking a volume of the input image (alongside with Gaussian noise in place of the foreground pixels) and the input pose mask. It outputs a realistic background without the foreground (the person in the image).

Source Image Segmentation
The results of the separate submodules building gradually the final synthesised image

Image Synthesis

Finally, the target background and foreground images are fused by a weighted linear sum taking into account the target mask (see formula below):

image synthesis

As in many generative models recently, the researchers propose an adversarial discriminator to force the generation of realistic images. In fact, the generative model was trained using L1 loss, a feature-wise Loss (denoted L-VGG ) and finally a combined L-VGG + GAN loss using binary cross-entropy classification error of the discriminator.

image synthesiz with loss function
Results with the different loss function
gradient magnitude
A plot of pixel gradient magnitude for different loss functions

Evaluation

The method was evaluated using videos of people performing actions collected from YouTube. The experiments were done using videos from three action classes: golf swings, yoga/workout routines, and tennis actions having a dataset of sizes 136, 60 and 70 videos, respectively. Simple data augmentation is also used to increase the size of the dataset.

Errors and SSIM score
Errors (lower is better) and SSIM score (higher is better) of the proposed method vs. a UNet architecture
Synthesized Images
Some outputs of the proposed method
ynthesized Images with different loss function
Comparison of different loss functions using the same input

Bottom line

The evaluation shows that the method is capable of synthesizing realistic images across different action classes. Though trained on pairs of images within the same video, the method can generalize to pose-appearance combinations that it has never seen (e.g., a golfer in a tennis player’s pose). The decoupling approach proved successful in this non-trivial task, and it shows that tackling a problem by dividing it into sub-problems can give outstanding results despite the complexity of the problem itself.

Dane Mitrev

Inferring a 3D Human Pose out of a 2D Image with FBI

16 July 2018
3D pose estimation based on 2D joints and Forward-or-Backward Information (FBI) for each bone

Inferring a 3D Human Pose out of a 2D Image with FBI

Autonomous driving, virtual reality, human-computer interaction and video surveillance — these are all application scenarios, where you would like to derive a 3D human pose out of a single RGB…

Autonomous driving, virtual reality, human-computer interaction and video surveillance — these are all application scenarios, where you would like to derive a 3D human pose out of a single RGB image. Significant advances have been made in this area after Convolutional Neural Network has been employed to solve the problem of 3D pose inference. However, the task remains challenging for outdoor environments as it is very difficult to obtain 3D pose ground truth for in-the-wild images.

So, let’s see how this fancy “FBI” abbreviation helps with inferring a 3D human pose out of a single RGB image.

Suggested Approach

Group of researchers forms Shenzhen (China) proposed a novel framework for deriving a 3D human pose from a single image. In particular, they suggest exploiting the information of each bone indicating if it is forward or backward with respect to the view of the camera. They refer to this data as Forward-or-Backward Information (or simply, FBI).

Their method starts with training a Convolutional Neural Network with two branches: one is related to mapping 2D joint locations from an image and another comes from FBI of bones. In fact, several state-of-the-art methods use information on the 2D joint locations for predicting a 3D human pose. However, this is an ill-posed problem since different valid 3D poses can explain the same observed 2D joints. At the same time, information on whether each bone is forward or backward when combined with 2D joint locations provides a unique 3D joint position. So, the researchers claim that feeding both 2D joint locations and FBI of bones into a deep regression network will provide better predictions of the 3D positions of joints.

Distribution of out-of-plane angles for all bones marked as “uncertain”
Distribution of out-of-plane angles for all bones marked as “uncertain”

Furthermore, to support the training, they have developed an annotation user interface and labeled FBI for around 12,000 in-the-wild images. They simplified the problem by distinguishing 14 bones with each bone having one of the three states with respect to camera view: forward, backward or parallel to sight. Hired annotators were asked to label images randomly selected from MPII dataset, where the 2D bones are provided. For each of the bones, the annotator was asked to make a choice from three options: forward, backward or uncertain (considering the difficulty to give an accurate judgment for the “parallel to sight” option). It is reported that in total around 20% of bones were marked as uncertain. The figure above illustrates the distribution of out-of-plane angles for all uncertain bones. As expected, people show more uncertainty when the bone is closer to parallel with the view plane.

Network Architecture

Let’s now discover in more depth the network architecture of the suggested approach.

Network architecture
Network architecture

The network consists of three components:

1. 2D pose estimator. It takes an image of a human as input and outputs the 2D locations of 16 joints of the human.

2. FBI predictor. This component also takes an image as input but outputs the FBI of 14 bones with three possible statuses: forward, backward and uncertain. The network here starts from a sequence of convolutional layers, followed by two successive stacked hourglass modules. The extracted feature maps are then fed into a set of convolutional layers and followed by a fully connected layer with a softmax layer to output classification results.

3. 3D pose regressor. At this stage, a deep regression network is learned to infer the 3D coordinates of the joints by taking both their 2D locations and the FBI as input. To keep more information, the regressor takes the generated probability matrix of the softmax layer as input. Thus, 2D locations and the probability matrix are concatenated together and then mapped to the 3D pose by exploiting two cascaded blocks.

Comparisons against existing methods

The quantitative comparison was carried out based on Human3.6M, a dataset containing 3.6 million of RGB images that capture 7 professional actors performing 15 different activities (i.e., walking, eating, sitting etc.). The mean per joint position error (MPJPE) between the ground truth and prediction was used as the evaluation metric, and the results are presented in Table 1.

Table 1. Quantitative comparisons based on MPJPE. Ordinal [19] is a concurrent work with the method presented here. The best score without consideration of this work is marked in blue bold. Black bold is used to highlight the best score when taking this work for comparison.
Table 1. Quantitative comparisons based on MPJPE. Ordinal [19] is a concurrent work with the method presented here. The best score without consideration of this work is marked in blue bold. Black bold is used to highlight the best score when taking this work for comparison.

For some of the previous works, the prediction has been further aligned with the ground truth via a rigid transformation. The results are presented in the table below.

Table 2. Quantitative comparisons based on MPJPE after rigid transformation. Ordinal [19] is a concurrent work with the method presented here. The best score without consideration of this work is marked in blue bold. Black bold is used to highlight the best score when taking this work for comparison.
Table 2. Quantitative comparisons based on MPJPE after rigid transformation. Ordinal [19] is a concurrent work with the method presented here. The best score without consideration of this work is marked in blue bold. Black bold is used to highlight the best score when taking this work for comparison.

The results of the quantitative comparison demonstrate that the presented approach outperforms all previous works almost on all actions and makes considerable improvements in such complicated actions like sitting and sitting down. However, it worth noting that one of the works, marked as Ordinal [19] in the above tables, exploited a similar strategy and achieved comparable results. Specifically, it proposed an annotation tool for collecting the depth relations for all joints. However, their annotation procedure seems to be a much more tedious task comparing to the one presented in this article.

To confirm the efficiency of this method for in-the-wild images, the researchers took 1,000 images from their FBI dataset as a test data and conducted another comparison against the state-of-the-art method presented by Zhou et al. Here the correctness ratio of FBI derived from the 3D pose was used as the evaluation metric. Thus, the method of Zhou et al. had 75% correctness ratio while the presented approach reached 78%. You can also see the results of a qualitative comparison on the image below.

Qualitative comparison results of the suggested method on some in-the-wild (ITW) images
Qualitative comparison results of the suggested method on some in-the-wild (ITW) images

Bottom line

The proposed approach suggests exploiting a new information called Forward-Backward Information (FBI) of bones for 3D human pose estimation, and this piece of data, in fact, helps to get more 3D-aware features from images. As a result, this method outperforms all previous works. However, this is not the only contribution of this research team. They have also labeled the FBI for 12,000 in-the-wild images with a well-designed user interface. These images will become publicly available to benefit other researchers working in this area.

“Seeing Beyond Walls” — Human Pose Estimation Under Occlusions

21 June 2018
Human Pose Estimation Under Occlusions

“Seeing Beyond Walls” — Human Pose Estimation Under Occlusions

Being able to see beyond walls has been considered superhuman power, seen in many sci-fi movies in the past. In 2011, researchers the Massachusetts Institute of Technology announced that they…

Being able to see beyond walls has been considered superhuman power, seen in many sci-fi movies in the past. In 2011, researchers the Massachusetts Institute of Technology announced that they developed a new radar technology that provides real-time video of what is going on behind solid walls. Although successful, this represents complex radar technology intended for specific use-cases.

Seven years later, researchers from the same university — MIT, have proposed a new method for “seeing beyond walls.

Very much like humans and animals who see via waves of visible light that bounce off objects and then strike the retina in our eyes, we can “see” through walls by sending waves that bounce off (specific) targets and return to the receivers. In the new approach, the researchers leverage the power of the wifi frequencies that traverse walls and reflect off the human body. Utilizing the WiFi signals together with Deep Learning techniques, they demonstrate very accurate human pose estimation through walls and under occlusions.

The Input

The method that they propose, called RF-Pose is using low-power wireless signal (actually 1000 times lower than WiFi). It is using the radio reflections and deep neural network to provide accurate pose estimation even under occlusion and behind walls. Walls, by definition, are solid objects made of concrete and they can block signals or at least weaken them. However, there are signals (or at least frequencies of signals) that can traverse walls, and the WiFi signal is one of them. To be able to address the problem of detecting what is behind a wall (of concrete) one has to record the reflected signal that traversed the wall and bounced off an object. In the RF-Pose, the researchers use a commonly known technique called FMCW (Frequency Modulated Continuous Wave) alongside with antenna arrays. In essence, FMCW separates the RF reflections (NoteRF is the Radio Frequency signal used in this approach and it is 1000 lower than the WiFi) based on the distance of the reflecting object. On the other hand, antenna arrays separate the reflections based on the spatial direction.

In this way, the input data of the RF-Pose method are two projections of the signal reflections with are represented as heat maps, created with the two antennas: vertical and horizontal.

Heat Maps, RF-Pose method
The horizontal and vertical projection of the heat maps

The Method

To be able to exploit Deep Learning techniques, it is of crucial importance to define a proper input-output scheme along with proper architecture design, taking into account all the limitations and the nature of the data. Having proved as very successful, Convolutional Neural Networks have been used in many cases where pixels are not the natural representation. In this case, the problem is similar — the RF (Radio Frequency) signals differ significantly from visual data, in terms of their property.

To overcome this, the authors explain and take into account the limitations of RF signals. Mainly, they argue that RF signals, and especially the frequencies that traverse walls have a low spatial resolution (measured in tens of centimetres) as opposed to vision data having a resolution in parts of the millimetre.

Secondly, they argue that the wavelength has to be tuned accordingly to the human body so that humans act as reflectors and not a scatterer. Finally, the data representation differs a lot from images, i.e., visual data because RF signals are given as complex numbers.

Having defined the problem-specific requirements, the researchers propose a method based on cross-modal supervision to tackle the problem of human pose estimation under occlusions. They suggest a teacher-student architecture using synchronized pairs of RGB images and heatmap projections of the RF-signal reflections.

Pose estimation under occlusions network architecture
The proposed network architecture with the teacher and student network

The teacher network is trained using RGB images, and it learns to predict 14 key points, corresponding to the anatomical parts of the human body: head, neck, shoulders, elbows, wrists, hips, knees and ankles. These predicted confidence maps with key points from the teacher network are used for explicit cross-modal supervision for the student network.

Therefore, the training objective is the minimization of the error between the student network’s prediction and the teacher network’s prediction. To achieve this goal, they define a pixel-wise loss function between the confidence maps corresponding to the binary cross entropy.FormulaSince the radio is used in this approach generates 30 pairs of heatmaps per second, the authors had the freedom to train the network by aggregating information from multiple, subsequent snapshots of RF heatmaps. They do this to overcome the problem of key point localization using only one snapshot.

To overcome the problem of different views of the camera and the RF-heatmaps, they impose an encoder-decoder architecture in the student network that forces the network to learn the transformation from the views of RF-heatmaps to the view of the camera. Thus, they employ two encoder networks for both the horizontal and vertical heatmaps (plural because of the use of multiple snapshots) that encode the information and one decoder network that predicts the keypoint confidence maps by getting channel-wise concatenated encodings (from both encoder networks) as input.

The spatiotemporal convolutional encoder networks use 100 frames as input (3.3 seconds), each one having 10 layers of 9x5x5 convolutions. The decoding network consists of 4 layers with 3x6x6 convolutions, and both networks use ReLU units and batch normalization. The implementation was done in PyTorch, and the training was conducted using the batch size of 24.

OpenPose and RF-Pose comparison
Comparison of the proposed method with an already existing method called OpenPose

Evaluation and Conclusions

This approach shows very good results, taking into account the low number of examples used for training. RF-Pose outperforms OpenPose on the visible scenes reporting on the AP evaluation metric (the mean average precision over 10 different OKS thresholds ranging from 0.5 to 0.95.) The comparison between the two is given in the tables and the plot below.

Comparison between RF-Pose and OpenPose
Comparison between RF-Pose and OpenPose

Arguing that occlusion is one of the most significant problems in human pose estimation, the researchers propose a new method using deep neural networks and RF signals to overcome it and provide robust and accurate human pose estimation. We expect to see many applications of this approach since the problem of (human) pose estimation represents an important task especially in the fields of surveillance, activity recognition, gaming, etc.

RF-Pose and OpenPose
Comparison of RF-Pose and OpenPose given the RGB image from the camera

Dane Mitriev