Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors

14 August 2018
supervision-by-registration-compressor

Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors

Precise facial landmark detection lays the foundation for a high-quality performance of many computer vision and computer graphics tasks, such as face recognition, face animation and face reenactment. Many face…

Precise facial landmark detection lays the foundation for a high-quality performance of many computer vision and computer graphics tasks, such as face recognition, face animation and face reenactment. Many face recognition methods rely on locations of detected facial landmarks to spatially align faces, and imprecise landmarks could lead to bad alignment and degrade face recognition performance. Precise facial landmark detection is still an unsolved problem. While significant work has been done on image-based facial landmark detection, these detectors tend to be accurate but not precise, i.e., the detector’s bias is small but the variance is significant. The main causes could be:

  1. insufficient training samples
  2. imprecise annotations

as human annotations inherently have limits on precision and consistency. Other methods that focus on video facial landmark detection utilize both detections and tracking to combat jittering and increase precision, but these methods require per-frame annotations in a video.

This research is using the unsupervised learning to do the task. Supervision-by-Registration (SBR), which augments the training loss function with supervision automatically extracted from unlabeled videos. The key observation is that the coherency of

  1. the detections of the same landmark in adjacent frames
  2. registration, i.e., optical flow, is a source of supervision.

Framework Overview

(SBR) framework takes labeled images, and unlabeled video as input to train an image-based facial landmark detector
Fig:1 (SBR) framework takes labeled images, and unlabeled video as input to train an image-based facial landmark detector

SBR is an end-to-end trainable model consists of two components: a generic detector built on convolutional networks, and a differentiable Lucas-Kanade (LK) operation. During the forward pass, the LK operation takes the landmark detections from the past frame and estimates their locations in the current frame. The tracked landmarks are then compared with the direct detections on the current frame. The registration loss is defined as the offset between them. In the backward pass, the gradient from the registration loss is back-propagated through the LK operation to encourage temporal coherency in the detector. The final output of the method is an enhanced image-based facial landmark detector which has leveraged large amounts of unlabeled video to achieve higher precision in both images and videos and more stable predictions in videos. SBR brings more supervisory signals from registration to enhance the precision of the detector. In sum, SBR has the following benefits:

  1. SBR can enhance the precision of a generic facial landmark detector on both images and video in an unsupervised fashion.
  2. Since the supervisory signal of SBR does not come from annotations, SBR can utilize a very large amount of unlabeled video to enhance the detector.
  3. SBR can be trained end-to-end with the widely used gradient back-propagation method.

Network Architecture

SBR Network Architecture

SBR consists of two complementary parts, the general facial landmark detector, and the LK tracking operation. The training procedure of supervision-by-registration with two complementary losses. The detection loss utilizes appearance from a single image and label information to learn a better landmark detector. Many facial landmark detectors take an image I as input and regress to the coordinates of the facial landmarks, i.e., D(I) = LL² loss is used on the co-ordinated L with ground-truth labels L*.

loss function

The registration loss uncovers temporal consistency by incorporating a Lucas-Kanade operation into the network. Registration loss can be computed in an unsupervised manner to enhance the detector. It is realized with a forward-backward communication scheme between the detection output and the LK operation. The forward communication computes the registration loss while the backward communication evaluates the reliability of the LK operation. The loss is as follows:

loss function

Complete loss function: Let N be the number of training samples with ground truth. For notation brevity, we assume there is only one unlabeled video with T frames. Then, the complete loss function of SBR is as follows:

Complete loss function

The first detector is CPM, which utilizes the ImageNet pre-trained models as the feature extraction part. The first four convolutional layers of VGG-16 is used for feature extraction and use only three CPM stages for heatmap prediction. The faces are cropped and resized into 256×256 for pre-processing. We train the CPM with a batch size of 8 for 40 epochs in total. The learning rate starts at 0.00005 and is reduced by 0.5 at 20th and 30th epochs.


Good news! Now you may swap face with celebrity in a second with our brand new app SWAPP!


The second detector is a simple regression network, denoted as Reg. VGG-16 is used as the base model and change the output neurons of the last fully-connected layer to K×2, where K is the number of landmarks. Input images have been cropped to 224×224 for this regression network.

Result

results table

qualitative result

The datasets used to compute the 300-WAFLWyoutube-face300-VW and YouTube Celebrities. The result of SBR is shown in fig:2 is performed on both the Reg (regression-based) and CPM (heatmap-based) on AFLW and 300-W. Normalized Mean Error (NME) is used to evaluate the performance of images.

The Bottom Line

SBR achieved the state-of-the-art result on all of the datasets. supervision-by-registration (SBR). The main advantages are:

  1. It does not rely on human annotations which tend to be imprecise,
  2. The detector is no longer limited to the quantity and quality of human annotations, and
  3. Back-propagating through the LK layer enables more accurate gradient updates than self-training.

Also, experiments on synthetic data show that annotation errors in the evaluation set may make a well-performing model seem like it is performing poorly, so one should be careful of annotation imprecision when interpreting quantitative results.

Muneeb ul Hassan

Anatomically-Aware Facial Animation from a Single Image

7 August 2018
GAN animation

Anatomically-Aware Facial Animation from a Single Image

Let’s say you have a picture of a Hugh Jackman for an advertisement. He looks great, but the client wants him to look a little bit happier. No, you don’t…

Let’s say you have a picture of a Hugh Jackman for an advertisement. He looks great, but the client wants him to look a little bit happier. No, you don’t need to invite the celebrity for another photo shoot. You even don’t need to spend hours with Photoshop. You can automatically generate a dozen images from this single image, where Hugh will have the smiles of different intensities. You can even create an animation from this single image, where Hugh changes his facial expression from absolutely serious to absolutely happy.

The authors of a novel GAN conditioning scheme based on Action Units (AUs) annotations claim that such scenario is not futuristic but absolutely feasible as of today. And in this article, we’re going to have an overview of their approach, observe the results generated with the suggested method, and compare its performance with the state-of-the-art approaches.

Suggested Method

Facial expressions are the result of the combined and coordinated action of facial muscles. They can be described in terms of the so-called Action Units, which are anatomically related to the contractions of specific facial muscles. For example, the facial expression for fear is generally produced with the following activations: Inner Brow Raiser (AU1), Outer Brow Raiser (AU2), Brow Lowerer (AU4), Upper Lid Raiser (AU5), Lid Tightener (AU7), Lip Stretcher (AU20) and Jaw Drop (AU26). The magnitude of each AU defines the extent of emotion.

Building on this approach to defining facial expressions, Pumarola and his colleagues suggest a GAN architecture, which is conditioned on a one-dimensional vector indicating the presence/absence and the magnitude of each action unit. They train this architecture in an unsupervised manner that only requires images with their activated AUs. Next, they split the problem into two main stages:

  • Rendering a new image under the desired expression by considering an AU-conditioned bidirectional adversarial architecture provided with a single training photo.
  • Rendering back the synthesized image to the original pose.

Moreover, the researchers wanted to ensure that their system will be able to handle images under changing backgrounds and illumination conditions. Hence, they’ve added an attention layer to their network. It focuses the action of the network only in those regions of the image that are relevant to convey the novel expression.

Let’s now move on to the next section to reveal the details of this network architecture – how does it succeed in generating anatomically coherent facial animations from images in the wild?

Overview of the face expression approach
Overview of the approach

Network Architecture

The proposed architecture consists of two main blocks:

  • Generator G regresses attention and color masks (note that it’s applied twice, first to map the input image and then to render it back). The aim was to make generator focusing only on those regions that are responsible for synthesizing the novel face expression while keeping the rest elements of the image such as hair, glasses, hats or jewelry untouched. Thus, instead of regressing the full image, this generator outputs a color mask C and an attention mask A. The mask A indicates to which extent each pixel of the C contributes to the output image. This leads to sharper and more realistic images at the end.
Attention-based generator
Attention-based generator
  • Conditional critic. Critic D evaluates the generated image in its photorealism and expression conditioning fulfillment.

The loss function for this network is a linear combination of several partial losses: image adversarial loss, attention loss, conditional expression loss, and identity loss:

loss function

The model is trained on a subset of 200 000 images from the EmotioNet dataset using the Adam optimizer with the learning rate of 0.0001, beta1 0.5, beta2 0.999 and batch size 25.

Experimental Evaluation

The image below demonstrates the model’s ability to activate AUs at different intensities while preserving the person’s identity. For instance, you can see that the model properly handles the case with zero intensity and generates an identical copy of the input image. For the non-zero cases, the model realistically renders complex facial movements and outputs images that are usually indistinguishable from the real ones.

AU intensity
Single AU’s edition: varying the levels of intensity

Next figure displays the attention mask A and the color mask C. You can see how the model focuses its attention (darker area) onto the corresponding action units in an unsupervised manner. Hence, only the pixels relevant to the expression change are carefully estimated, while the background pixels are directly copied from the input image. This feature of the model is very handy when dealing with images in the wild.

attention mask
Attention Model

Now, let’s see how the model handles the task of editing multiple AUs. The outcomes are depicted below. You can observe here remarkably smooth and consistent transformation across frames even with challenging light conditions and non-real world data, as in the case of the avatar. These results encourage the authors to further extend their model to video generation. They should definitely try, shouldn’t they?

editing multiple AUs
Editing multiple Action Units

Meanwhile, they compare their approach against several state-of-the-art methods to see how well their model performs at generating different facial expressions from a single image. The results are demonstrated below. It looks like the bottom row, representing the suggested approach, contains much more visually compelling images with notably higher spatial resolution. As we’ve already discussed before, the use of the attention mask allows applying the transformation only on the cropped face and put it back on the original image without producing an artifact.

face expression
Qualitative comparison with state-of-the-art

Limitations of the Model

Let’s now discuss the model’s limits – which types of challenging images it is still able to handle, and when it actually fails. As demonstrated on the image below, the model succeeds when dealing with human-like sculptures, non-realistic drawings, non-homogeneous textures across the face, anthropomorphic faces with non-real textures, non-standard illuminations/colors and even the face sketches.

However, there are several cases, when it fails. The first failure case depicted below results from the errors in the attention mechanism when given extreme input expressions. The model may also fail when the input image contains non-previously seen occlusions such as an eye patch causing artifacts in the missing face attributes. It is also not ready to deal with non-human anthropomorphic distributions as in the case of cyclopes. Finally, the model can also generate artifacts like human face features when dealing with animals.

face expression failure
Success and failure cases

Bottom Line

The model presented here is able to generate anatomically-aware face animations from the images in the wild. The resulting images surprise with their realism and high spatial resolution. The suggested approach advances current works, which had only addressed the problem for discrete emotions category editing and portrait images. Its key contributions include a) encoding face deformations by means of AUs to render a wide range of expressions and b) embedding an attention model to focus only on the relevant regions of the image. Several failure cases that were observed are presumably due to insufficient training data. Thus, we can conclude that the results of this approach are very promising, and we look forward to observing its performance for video sequences.