Pairwise Relational Network – New Method for Face Recognition

28 November 2018
Pairwise Relational Network face recognition

Pairwise Relational Network – New Method for Face Recognition

With the rapid progress of deep learning in the past few years, many computer vision problems have been tackled and solved with human or even beyond human performance. One of…

With the rapid progress of deep learning in the past few years, many computer vision problems have been tackled and solved with human or even beyond human performance. One of these tasks, in fact, a very popular one, is face recognition.

Until the recent past, face recognition was seen as something straight out of science fiction. But, over the past decade or two face recognition has not only become a solved problem but a widespread technology with applications in several industries.

Previous Works

Since face recognition is a challenging task it took some time for researchers in different domains to reach satisfactory results. Researchers the domain of pattern recognition, computer vision and artificial intelligence have proposed many solutions in the past. The main goal was to reduce difficulties such as highly variable face poses, image quality etc., so as to improve the robustness and recognition accuracy.

A number of deep learning-based face recognition methods have been proposed in the past few years. Starting with remarkable results with DeepFace (2014), then methods like DeepID (2014), FaceNet (2015), VGGFace (2015), all the way to more recent methods like Cosface (2018), ArcFace (2018).

State-of-the-art Idea

Recently, researchers from Pohang University of Science and Technology in Korea have proposed a novel face recognition method that achieved state-of-the-art results on some of the benchmark datasets. The new method, called pairwise relational network (PRN) takes local appearance features around landmark points on the feature map and captures unique pairwise relations with the same identity and discriminative pairwise relations between different identities.

Pairwise Relational Network

In fact, the idea is to build a method which is able to represent an image of a face in such a manner that the features extracted will be discriminative across faces of different people.

Method

The proposed method works by taking local appearance features as input by ROI projection around landmark points on the feature map. These features are used to train a PRN (pairwise relational network) to capture unique pairwise relations between pairs of local appearance features. Arguing that the existence of such pairwise relations is identity dependent, the researchers employ LSTM to learn additional facial identity state feature. The architecture of the method as well as of the pairwise relational network is given in the figure below.

The architecture of the proposed method
Learning face identity state feature

From a perspective of learning and optimization, the method uses combined triplet ratio loss, pairwise loss, and softmax loss. Stochastic Gradient Descent was used as the optimization method with an initial learning rate of 0.1.

Evaluation and Comparison

The proposed method was evaluated on the LFW dataset which reveals the state-of-the-art face verification in unconstrained environments. It is a popular and good dataset for evaluating face recognition methods. It contains 13, 233 of highly variable images of faces from 5, 749 different identities. The PRN method reaches 99.76% accuracy on this dataset which is almost the same accuracy as the state-of-the-art method ArcFace (99.78%).

Comparison with other methods on the LFW Dataset.
Comparison with other methods on the LFW Dataset.

However, when evaluated on the YTF dataset (with similar characteristics as LFW dataset) the PRN method achieves state-of-the-art results – 96.3%.

Comparison with other methods on the YTF Dataset. The PRN method achieves state-of-the-art performance.

Additionally, the method was evaluated on IJB-A and IJB-B datasets for evaluating face verification and face identification. The results obtained on these datasets as well as on LFW and YTF compared with other methods are reported in the tables below.

Comparison of performances of the proposed PRN method with the state-of-the-art on the IJB-B dataset.
Comparison of performances of the proposed PRN method with the state-of-the-art on the
IJB-A dataset.

Conclusion

The researchers proposed an interesting approach to a well-known problem – face recognition. In their paper, they show that capturing those kinds of unique and discriminative pairwise relations actually solves the problem of face identification to a high degree of accuracy. Extensive experiments have been done on popular datasets and the method achieves very good results on all of them and state-of-the-art performance on one of them.

PIFR: Pose Invariant 3D Face Reconstruction

26 November 2018
pifr reconstruction

PIFR: Pose Invariant 3D Face Reconstruction

3D face geometry needs to be recovered from 2D images in many real-world applications, including face recognition, face landmark detection, 3D emoticon animation etc. However, this task remains challenging especially…

3D face geometry needs to be recovered from 2D images in many real-world applications, including face recognition, face landmark detection, 3D emoticon animation etc. However, this task remains challenging especially under the large pose, when much of the information about the face is unknowable.

Jiang and Wu from Jiangnan University (China) and Kittler from University of Surrey (UK) suggest a novel 3D face reconstruction algorithm, which significantly improves the accuracy of reconstruction even under extreme pose.

But let’s first shortly review the previous work on 3D face models and 3D face reconstruction.

Related Work

The research mentions four publicly available 3D deformation models:

This paper uses a BFM model, which is the most popular.

There are several approaches to reconstructing 3D model from 2D images, including:

State-of-the-art idea

The paper by Jiang, Wu, and Kittler proposes a novel Pose-Invariant 3D Face Reconstruction (PIFR) algorithm based on 3D Morphable Model (3DMM).

Firstly, they suggest generating a frontal image by normalizing a single face input image. This step allows restoring additional identity information of the face.

The next step is to use a weighted sum of the 3D parameters of both images: the frontal one and the original one. This allows to preserve the pose of the original image, but also enhance the identity information.

The pipeline for the suggested approach is provided below.

Overview of the Pose-Invariant 3D Face Reconstruction (PIFR) method
Overview of the Pose-Invariant 3D Face Reconstruction (PIFR) method

The experiments show that PIFR algorithm has significantly improved the performance of 3D face reconstruction compared to the previous methods, especially in the extreme pose cases.

Let’s now have a closer look at the suggested model…

Model details

PIFR method is largely relying on the 3DMM fitting process, which can be expressed as minimizing the error between the 2D coordinates of the 3D point projection and the ground truth. However, the face generated by the 3D model has about 50,000 vertices, and thus iterative calculations result in the slow and ineffective convergence. To overcome this problem, the researchers suggest using the landmarks (e.g., eye center, mouth corner, and nose tip) as the ground truth in the fitting process. Specifically, they use a weighted landmark 3DMM fitting.

Top row: the original image and landmark. Bottom row: 3D face model and its alignment to the 2D image
Top row: the original image and landmark. Bottom row: 3D face model and its alignment to the 2D image

The next challenge is to reconstruct 3D faces in large poses. To solve this problem, the researchers use High-Fidelity Pose and Expression Normalization (HPEN) method, but only for normalization of the pose and not expression. Also, Poisson Editing is used to recover the occluded area of the face due to the angle.

Performance Comparison with Other Methods

The performance of PIFR method was evaluated for the face reconstruction:

  • in small and medium poses;
  • large poses;
  • extreme poses (±90 yaw angles).

For this purpose, the researchers used three publicly available datasets:

  • AFW dataset, which was created using Flickr images, contains 205 images with 468 marked faces, complex backgrounds and face poses.
  • LFPW dataset, which has 224 face images in the test set and 811 face images in the training set; each image is marked with 68 feature points; 900 face images from both sets were selected for testing in this research.
  • AFLW dataset is a large-scale face database, which contains around 250 million hand-labeled face images, and each image is marked with 21 feature points. This study used only extreme pose face images from this dataset for qualitative analysis.

Quantitative analysis. Using the Mean Euclidean Metric (MEM), the study compares the performance of PIFR method to E-3DMM and FW-3DMM on AFW and LFPW datasets. Cumulative errors distribution (CED) curves look like this:

Comparisons of cumulative errors distribution (CED) curves on AFW and LFPW datasets
Comparisons of cumulative errors distribution (CED) curves on AFW and LFPW datasets

As you can see from these plots and the tables below, PIFR method shows superior performance compared to the other two methods. Its reconstruction performance in large poses is particularly good.

Qualitative analysis. The method was also assessed qualitatively based on the face images in extreme poses from AFLW dataset. The results are shown in the figure below.

Comparison of 3D face reconstruction: (a) Input image; (b) FW-3DMM; (c) E-3DMM; (d) Suggested approach
Comparison of 3D face reconstruction: (a) Input image; (b) FW-3DMM; (c) E-3DMM; (d) Suggested approach

Even though half of the landmarks are invisible due to extreme posture, which leads to large errors and failures of other methods, the PIFR method still performs quite well.

Here are some additional examples of the PIFR method performance based on the images from the AFW dataset.

Top row: Input 2D image. Middle row: 3D face. Bottom row: Align to 2D image.
Top row: Input 2D image. Middle row: 3D face. Bottom row: Align to 2D image

Bottom Line

A novel 3D face reconstruction framework PIFR demonstrates good reconstruction performance even in extreme poses. By taking both the original and the frontal images for weighted fusion, the method allows restoring enough face information to reconstruct the 3D face.

In the future, the researchers plan to restore even more facial identity information to improve the accuracy of reconstruction further.

Fooling Facial Recognition: Fast Method for Generating Adversarial Faces

2 October 2018
Fooling Facial Recognition Fast Method for Generating Adversarial Faces

Fooling Facial Recognition: Fast Method for Generating Adversarial Faces

With the rapid progress and state-of-the-art performance in a wide range of tasks, deep learning based methods are in use in a large number of security-sensitive and critical applications. However,…

With the rapid progress and state-of-the-art performance in a wide range of tasks, deep learning based methods are in use in a large number of security-sensitive and critical applications. However, despite the remarkable, often beyond human-level performance, deep learning methods are vulnerable to well-designed input samples. This kind of input samples is named adversarial examples. In a game of “cat and mouse”, researchers have been competing in designing robust adversarial attacks on one hand and designing robust defense mechanisms on the other.

The problem of adversarial attacks is well emphasized in Computer Vision tasks such as object recognition, classification. In the field of image processing with deep learning, small perturbations in the input space can result in a significant change in the output. Such disorders are almost unnoticeable for humans and do not change the semantics of the image content itself, however, can trick deep learning methods.

Adversarial attacks are a big concern in security-critical applications such as identity verification, access control etc. One particular target of adversarial attacks is face recognition.

Previous works

The excellent performance of deep learning methods for face recognition has contributed for them to be accepted and employed in a wide variety of systems.

In the past, adversarial attacks have targeted face recognition systems. Mainly, these attacks can be divided into two more prominent groups: intensity-based and geometry-based adversarial attacks. Many of them proved to be very successful in fooling a face recognition system. However, a number of defense mechanisms have been proposed to deal with different kinds of attacks.

face adversarial attacks
Comparison of the proposed attack (Column 2) to an intensity-based
attack (Column 3).

To exploit the vulnerability of face recognition systems and surpass defense mechanisms, more and more sophisticated adversarial attacks have been developed. Some of them are changing pixel intensities while others are trying to transform benign images to perform the attack spatially.

State-of-the-art idea

Researchers from West Virginia University have proposed a new fast method for generating adversarial face images. The purpose of their approach is defining a face transformation model based on facial landmark locations.

Method

The problem of manipulating an image and transforming it to an adversarial sample has been addressed by landmark manipulation. The technique is based on optimizing for a displacement field, which is used to process the input image spatially. It is a geometry-based attack, able to generate adversarial sample by only modifying a number of landmarks.

Taking into account the fact that facial landmark locations provide highly discriminative information for face recognition tasks, the researchers exploit the gradients of the prediction to the position of landmarks to update the displacement field. A scheme of the proposed method for generating adversarial face images is shown in the picture below.

face adversarial attacks architecture
The proposed method is optimizing for a displacement field to produce adversarial landmark locations.

To overcome the problem of multiple possible updates of the displacement field due to the different possible direction of the gradients, they propose grouping the landmarks semantically. This allows manipulating the group properties instead of perturbing each landmark for obtaining natural images.

face landmarks
Grouping face landmarks based on semantic regions of the face (eye, nose, mouth etc.).

Results

The new adversarial face generator was evaluated by measuring and comparing the performance of the attacks under several defense methods. To further explore the problem of generating adversarial samples of face images the researchers assess how spatially manipulating the face regions affects the performance of a face recognition system.

First, the performance was evaluated on a white-box attack scenario on the CASIA-WebFace dataset. Six experiments were done to investigate the importance of each region of the face in the proposed attack methods. They evaluate the performance of the attacks on each of the five main areas of the face including 1) eyebrows, 2) eyes, 3) nose, 4) mouth, 5) jaw and 6) all regions. The results are given in the table.

face adversarial comparison
Comparison of the results of the proposed attacks FLM and GFLM to stAdv [33] and exploring the influence of different regions of the face.
The researchers calculate the prediction of the true class for faces which are correctly classified and their manipulated versions.

Comparison with other state-of-the-art

A comparison with several existing methods for generating adversarial faces has been made within this study. They compare the methods in terms of success rate and also speed.

Face adversarial
Comparison of the proposed FLM and GFLM attacks to FGSM and stAdv attacks under the state-of-the-art adversarial training defenses.
face adversarial attacks technique
Linearly interpolating the defined face properties

Conclusion

This approach shows that landmark manipulation can be a reliable way of changing the prediction of face recognition classifiers. The novel method can generate adversarial faces approximately 200 times faster than other geometry-based approaches. This method creates natural samples and can fool state-of-the-art defense mechanisms.

True Face Super-Resolution Upscaling with the Facial Component Heatmaps

1 October 2018
face resolution upscaling

True Face Super-Resolution Upscaling with the Facial Component Heatmaps

The performance of the most facial analysis techniques relies on the resolution of the corresponding image. Face alignment or face identification is not going to work correctly when the resolution…

The performance of the most facial analysis techniques relies on the resolution of the corresponding image. Face alignment or face identification is not going to work correctly when the resolution of a face is adversely low.

What’s Face Super-Resolution?

Face super-resolution (FSR) or face hallucination, provides a viable way to recover a high-resolution (HR) face image from its low-resolution (LR) counterpart. This research area has attracted increasing interest in the recent years, and the most advanced deep learning methods achieve state-of-the-art performance in face super-resolution.

However, even these methods often produce the results with the distorted face structure and only partially recovered facial details. Deep learning based FSR methods fail to super-resolve LR faces under large pose variations.

How can we solve this problem?

  • Augmenting training data with large pose variations still leads to suboptimal results where facial details are missing or distorted.
  • Directly detecting facial components or landmarks in LR faces is also suboptimal and may lead to ghosting artifacts in the final result.

But what about a method that super-resolves LR faces images while collaboratively predicting face structure? Can we use heatmaps to represent the probability of the appearance of each facial component?

We are going to discover this very soon, but let’s first check the previous approaches to the problem of face super-resolution.

Related Work

Face hallucination methods can be roughly grouped into three categories:

  • ‘Global model’ based approaches aim at super-resolving an LR input image by learning a holistic appearance mapping such as PCA. For instance, Wang and Tang reconstruct an HR output from the PCA coefficients of the LR input; Liu et al. develop a Markov random field (MRF) to reduce ghosting artifacts caused by the misalignments in LR images; Kolouri and Rohde employ optimal transport techniques to morph an HR output by interpolating exemplary HR faces.
  • Part based methods are proposed to super-resolve individual facial regions separately. For instance, Tappen and Liu super-resolve HR facial components by warping the reference HR images; Yang et al. localize facial components in the LR images by a facial landmark detector and then reconstruct missing high-frequency details from similar HR reference components.
  • Deep learning techniques can be very different: Xu et al. employ the framework of generative adversarial networks to recover blurry LR face images; Zhu et al. present a cascade bi-network, dubbed CBN, to localize LR facial components first and then upsample the facial components.

State-of-the-art-idea

Xin Yu and his colleagues propose a multi-task deep neural network that not only super-resolves LR images but also estimates the spatial positions of their facial components. Their convolutional neural network (CNN) has two branches: one for super-resolving face images and the other – for predicting salient regions of a face coined facial component heatmaps.

The whole process looks like this:

  1. Super-resolving features of input LR images.
  2. Employing a spatial transformer network to align the feature maps.
  3. Estimating the heatmaps of facial components with the upsampled feature maps.
  4. Concatenating estimated heatmaps of facial components with the upsampled feature maps.

This method can super-resolve tiny unaligned face images (16 x 16 pixels) with the upscaling factor of 8x while preserving face structure.

(a) LR image; (b) HR image; (c) Nearest Neighbors; (d) CBN, (e) TDAE, (f) TDAE trained on better dataset, (g) suggested approach
(a) LR image; (b) HR image; (c) Nearest Neighbors; (d) CBN, (e) TDAE, (f) TDAE trained on better dataset, (g) suggested approach

Now let’s learn the details of the proposed method.

Model overview

The network has the following structure:

  1. A multi-task upsampling network (MTUN):
    1. an upsampling branch (composed of a convolutional autoencoder, deconvolutional layers, and a spatial transformer network);
    2. a facial component heatmap estimation branch (HEB).
  2. Discriminative network, which is constructed by convolutional layers and fully connected layers.
The pipeline of the suggested network Face super resolution
The pipeline of the suggested network

Facial Component Heatmap Estimation. Even the state-of-the-art facial landmark detectors cannot accurately localize facial landmarks in very low-resolution images. So, the researchers propose to predict facial component heatmaps from super-resolved feature maps.

2D photos may exhibit a wide range of poses. Thus, to reduce the number of training images required for learning HEB, they suggest employing a spatial transformer network (STN) to align the upsampled features before estimating heatmaps.

In total, four heatmaps are estimated to represent four components of a face: eyes, nose, mouth, and chain (see the image below).

Visualization of estimated facial component heatmaps: (a) Unaligned LR image; (b) HR image; (c) Heatmaps; (d) Result; (e) The estimated heatmaps overlying the results
Visualization of estimated facial component heatmaps: (a) Unaligned LR image; (b) HR image; (c) Heatmaps; (d) Result; (e) The estimated heatmaps overlying the results

Loss Function. The results of using different combinations of losses are provided below:

Comparison of different losses
Comparison of different losses

On the above image:

  1. unaligned LR image,
  2. original HR image,
  3. pixel-wise loss only,
  4. pixel-wise and feature-wise losses combined,
  5. pixel-wise, feature-wise, and discriminative losses,
  6. pixel-wise and face structure losses,
  7. pixel-wise, feature-wise, and face structure losses,
  8. pixel-wise, feature-wise, discriminative, and face structure losses.

In training their multi-task upsampling network, the researchers have selected to use the last option (h).

Qualitative and Quantitative Comparisons

See the qualitative comparison of the suggested approach with the state-of-the-art methods:

Comparisons with the state-of-the-art methods: (a) Unaligned LR image; (b) HR image; (c) Bicubic interpolation; (d) VDSR; (e) SRGAN; (f) Ma et al.’s method; (g) CBN; (h) TDAE; (i) Suggested approach
Comparisons with the state-of-the-art methods: (a) Unaligned LR image; (b) HR image; (c) Bicubic interpolation; (d) VDSR; (e) SRGAN; (f) Ma et al.’s method; (g) CBN; (h) TDAE; (i) Suggested approach

As you can see, most of the existing methods fail to generate realistic face details, while the suggested approach outputs realistic and detailed images, which are very close to the original HR image.

Quantitative comparison with the state-of-the-art methods leads us to the same conclusions. All methods were evaluated on the entire test dataset by the average PSNR and the structural similarity (SSIM) scores.

Quantitative comparisons on the entire test dataset
Quantitative comparisons on the entire test dataset

The results in the table show that the approach presented here outperforms the second best with a large margin of 1.75 dB in PSNR. This confirms that estimating heatmaps helps in localizing facial components and aligning LR faces more accurately.

Bottom Line

Let’s summarize the contributions of this work:

  • It presents a novel multi-task upsampling network that can super-resolve very small LR face images (16 x 16 pixels) by an upscaling factor of 8x.
  • The method not only exploits image intensity similarity but also estimates the face structure with the help of facial component heatmaps.
  • The estimated facial component heatmaps provide not only spatial information of facial components but also their visibility information.
  • Thanks to the aligning of feature maps before heatmap estimation, the number of images required for training the model is largely reduced.

The method is good at super-resolving very low-resolution faces in different poses and generates realistic and detailed images free from distortions and artifacts.

Learning 3D Face Morphable Model Out of 2D Images

5 September 2018
3D morphable model out of single image

Learning 3D Face Morphable Model Out of 2D Images

The 3D Morphable Model (3DMM) is a statistical model of 3D facial shape and texture. 3D Morphable Models have various applications in many fields including computer vision, computer graphics, human…

The 3D Morphable Model (3DMM) is a statistical model of 3D facial shape and texture. 3D Morphable Models have various applications in many fields including computer vision, computer graphics, human behavioral analysis, craniofacial surgery.

In essence, 3D Morphable Models are used to model facial shapes and textures and modeling human faces is not a trivial task at all. Different identities, highly-variable face shapes, and postures make the modeling of the human face a challenging task. In this context, a 3D Morphable Model is trying to learn a model of facial shape and texture in a space where there are explicit correspondences. This means, there has to be a point-to-point correspondence between the reconstruction and all other models, enabling morphing, and second, it has to model the underlying transformations between types of faces (male to female, neutral to smile, etc.).

3D morphable model

Researchers from Michigan State University propose a novel Deep Learning-based approach to learning a 3D Morphable Model. Exploiting the power of Deep Neural Networks to learn non-linear mappings, they suggest a method for learning 3D Morphable Model out of just 2D images from in-the-wild (images not taken in a controlled environment like a lab).

Previous Approaches

A conventional 3DMM is learned from a set of 3D face scans with associated well-controlled 2D face images. Traditionally, 3DMM is learned through supervision by performing dimension reduction, typically Principal Component Analysis (PCA), on a training set of co-captured 3D face scans and 2D images. By employing a linear model such as PCA, non-linear transformations and facial variations cannot be captured by the 3D Morphable Model. Moreover, large amounts of high-quality 3D data are needed to model highly variable 3D face shapes.

State of the art idea

The idea of the proposed approach is to leverage the power of Deep Neural Networks or more specifically Convolutional Neural Networks (which are more suitable for the task and less expensive than multilayer perceptrons) to learn the 3D Morphable Model with an encoder network that takes a face image as input and generates the shape and albedo parameters, from which two decoders estimate shape and albedo.

Method

As mentioned before a linear 3DMM has the problems such as the need of 3D face scans for supervised learning, unable to leverage massive in-the-wild face images for learning, and the limited representation power due to the linear model (PCA). The proposed method learns a nonlinear 3 DMM model using only large-scale in-the-wild 2D face images.

UV Space Representation

In their method, the researchers use an unwrapped 2D texture (where 3 D vertex v is projected onto the UV space) as a texture representation for the shape and the albedo. They argue that keeping the spatial information is very important as they employ Convolutional Networks in their method and frontal face-images contain little information about the two sides. Therefore their choice falls on UV-space representation.

Three albedo representations. (a) Albedo value per vertex, (b) Albedo as a 2D frontal face, (c) UV space 2D unwarped albedo.
Three albedo representations. (a) Albedo value per vertex, (b) Albedo as a 2D frontal face, (c) UV space 2D unwarped albedo.
UV space shape representation. x, y, z, and a combined shape representation.
UV space shape representation. x, y, z, and a combined shape representation

Network architecture

They designed an architecture that given an input image it encodes it into shape, albedo and lightning parameters (vectors). The encoded latent vectors for albedo and shape are decoded using two different Decoder networks (again Convolutional Neural Networks) to obtain face skin reflectance, image (for the albedo) and 3D face mash (for the shape). Then a differentiable rendering layer was designed to generate the reconstructed face by fusing the 3D face, albedo, lighting, and the camera projection parameters estimated by the encoder. The whole architecture is nicely presented in the figure below.

The proposed method's architecture for learning a non-linear 3DMM
The proposed method’s architecture for learning a non-linear 3DMM

The presented robust, learning of a non-linear 3D Morphable Model is applied to 2D Face Alignment, and 3D Face Reconstruction problems. It can also have many applications since it represents a model learning method, which can solve different problems.

The proposed rendering layer
The proposed rendering layer

Comparison with other methods

The method was evaluated against other methods on the following tasks: 2D Face Alignment, 3D Face Reconstruction and Face Editing. The suggested technique outperforms other state-of-the-art methods on these tasks. Some of the results of the evaluation are presented below.

2D Face Alignment

One of the critical applications of this kind of approach can become face alignment. Alignment naturally should improve facial analysis in a range of tasks (for example face recognition). However, alignment is not a straightforward task, and this method proves successful in face alignment.

2D face alignment results. Invisible landmarks are marked as red. The technique can well handle extreme pose, lighting, and expression
2D face alignment results. Invisible landmarks are marked as red. The technique can well handle extreme pose, lighting, and expression

3D Face Reconstruction

The approach was also evaluated on another task: 3D Face Reconstruction, yielding outstanding results compared to other methods.

Quantitative evaluation of the 3D reconstruction
Quantitative evaluation of the 3D reconstruction
3D reconstruction results comparison to Sela et al. The proposed method handles facial hair and occlusions far better than this method
3D reconstruction results comparison to Sela et al. The proposed method handles facial hair and occlusions far better than this method
3D reconstruction results comparison to VRN by Jack- son et al. on the popular CelebA dataset
3D reconstruction results comparison to VRN by Jack- son et al. on the popular CelebA dataset
3D reconstruction results comparison to Tewari et al. This result shows that the proposed method overcomes the problem of face shrinking when dealing with a different texture (like facial hair)
3D reconstruction results comparison to Tewari et al. This result shows that the proposed method overcomes the problem of face shrinking when dealing with a different texture (like facial hair)

Face Editing

A method that learns a model and decomposes a face image into individual components allows the image to be modified and the face to be edited by manipulating different elements. The method was also evaluated on face editing tasks such as relighting and attribute manipulation.

Growing mustache editing results. The first column shows original images, the following columns show edited images with increasing magnitudes.
Growing mustache editing results. The first column shows original images, the following columns show edited images with increasing magnitudes.
Comparing to Shu et al. results (last row), the proposed method produces more realistic images, and the identity is better preserved.
Comparing to Shu et al. results (last row), the proposed method produces more realistic images, and the identity is better preserved.

Conclusions

In conclusion, the proposed method will have a potentially high impact since it improves the way of learning a 3D Morphable Model. This kind of model has been widely adopted in the past since its introduction, but there was not an efficient, robust way of learning this model from in-the-wild data.

The proposed approach exploits the power of deep neural networks as very good function approximator to model the highly variable human face robustly. The unusual path of learning a 3DMM allows different manipulations and many applications of this method, some of which are presented in the paper, and many others are expected.

Facial Surface and Texture Synthesis via GAN

3 September 2018
face texture synthesis

Facial Surface and Texture Synthesis via GAN

Deep networks can be extremely powerful and effective in answering complex questions. But it is also well-known that in order to train a really complex model, you’ll need lots and…

Deep networks can be extremely powerful and effective in answering complex questions. But it is also well-known that in order to train a really complex model, you’ll need lots and lots of data, which closely approximates the complete data distribution.

With the lack of real-world data, many researchers choose data augmentation as a method for extending the size of a given dataset. The idea is to modify the training examples in such a way that keeps their semantic properties intact. That’s not an easy task when dealing with human faces.

The method should account for such complex transformations of data as pose, lighting and non-rigid deformations, yet create realistic samples that follow the real-world data statistics.

So, let’s see how the latest state-of-the-art methods approach this challenging task…

Previous approaches

Generative adversarial networks (GANs) have demonstrated their effectiveness in making synthetic data more realistic. Taking the simulated data as input, GAN produces samples that appear more realistic. However, the semantic properties of these samples might be altered, even with a loss penalizing the change in the parameters of the output.

3D morphable model (3DMM) is the most commonly used method for representation and synthesis of geometries and textures, and it was originally proposed in the context of 3D human faces. By this model, the geometric structure and the texture of human faces are linearly approximated as a combination of principal vectors.

Recently, the 3DMM model was combined with the convolutional neural networks for data augmentation. However, the generated samples tend to be smooth and unrealistic in appearance as you can observe in the figure below.

Faces synthesized using the 3DMM linear model
Faces synthesized using the 3DMM linear model

Moreover, 3DMM generates samples following a Gaussian distribution, which rarely reflects the true distribution of the data. For instance, see below the first two PCA coefficients plotted for real faces vs the synthesized 3DMM faces. This gap between the real and synthesized distributions may easily result in non-plausible samples.

First two PCA coefficients of real (left) and 3DMM generated (right) faces
First two PCA coefficients of real (left) and 3DMM generated (right) faces

State-of-the-art idea

Slossberg, Shamai, and Kimmel from Technion – Israel Institute of Technology propose a new realistic data synthesis approach for human faces by combining GAN and 3DMM model.

In particular, the researchers employ a GAN to imitate the space of parametrized human textures and generate corresponding facial geometries by learning the best 3DMM coefficients for each texture. The generated textures are mapped back onto the corresponding geometries to obtain new generated high-resolution 3D faces.

This approach produces realistic samples, and it:

  • doesn’t suffer from indirect control over such desired attributes as pose and lighting;
  • is not limited to producing new instances of existing individuals.

Let’s have a closer look at their data processing pipeline…

Data processing pipeline

The process includes aligning 3D scans of human faces vertex to vertex and mapping their textures onto a 2D plane using a predefined universal transformation.

Data preparation pipeline
Data preparation pipeline

The data preparation pipeline contains four main stages:

  • Data acquisition: the researchers collected about 5000 scans from a wide variety of ethnic, gender, and age groups; each subject was asked to perform five distinct expressions including a neutral one.
  • Landmark annotation: 43 landmarks were added to the meshes semi-automatically by rendering the face and using a pre-trained facial landmark detector on the 2D images.
  • Mesh alignment: this was conducted by deforming a template face mesh according to the geometric structure of each scan, guided by the previously obtained facial landmark points.
  • Texture transfer: the texture is transferred from the scan to the template using a ray casting technique built into the animation rendering toolbox of Blender; then, the texture is mapped from the template to a 2D plane using the predefined universal mapping.

See the resulting mapped textures below:

Flattened aligned facial textures
Flattened aligned facial textures

The next step is to train GAN to learn and imitate these aligned facial textures. For this purpose, the researchers use a progressive growing GAN with the generator and discriminator constructed as symmetric networks. In this implementation, the generator progressively increases the resolution of the feature maps until reaching the output image size, while the discriminator gradually reduces the size back to a single output.

See below the new synthetic facial textures generated by the aforementioned GAN:

Facial textures synthesized by GAN
Facial textures synthesized by GAN

The final step is to synthesize the geometries of the faces. The researchers explored several approaches to finding plausible geometry coefficients for a given texture. You can observe the qualitative and quantitative (L2 geometric error) comparison between the various methods in the next figure:

Two synthesized textures mapped onto different geometries
Two synthesized textures mapped onto different geometries

Apparently, the least squares approach produces the lowest distortion results. Considering also its simplicity, this method was chosen for all the subsequent experiments.

Experimental results

The proposed method can generate many new identities, and each one of them can be rendered under varying pose, expression, and lighting. Different expressions are added to the neutral geometry using the Blend Shapes model. The resulting images with different pose and lighting are shown below:

Identities generated by the proposed method with different pose and lighting
Identities generated by the proposed method with different pose and lighting

For quantitative evaluation of the results, the researchers used the sliced Wasserstein distance (SWD) to measure distances between distributions of their training and generated images in different scales:

The table demonstrates that the textures generated by the proposed model are statistically closer to the real data than those generated by 3DMM.

The next experiment was designed to evaluate if the proposed model is capable of generating samples that diverge significantly from the original training set and resemble previously unseen data. Thus, 5% of the identities were held out for evaluation. The researchers measured the L2 distance between each real identity from the test set to the closest identity generated by the GAN, as well as to the closest real identity from the training set.

The distance between the generated and real identities
The distance between the generated and real identities

As it can be seen from the figure, the test set identities are closer to the generated identities than the training set identities. Moreover, the “Test to fake” distances are not significantly larger than the “Fake to real” distances. That implies that generated samples do not just produce IDs that are very close to the training set, but also novel IDs that resemble previously unseen examples.

Finally, a qualitative evaluation was performed to check if the proposed pipeline is able to generate original data samples. Thus, facial textures generated by the model were compared to their closest real neighbors in terms of L2 norm between identity descriptors.

Synthesized facial textures (top) vs. corresponding closest real neighbors (bottom)
Synthesized facial textures (top) vs. corresponding closest real neighbors (bottom)

As you can see, the nearest real textures are far enough to be visually distinguished as different people, which confirms the model’s ability to produce novel identities.

Bottom Line

The suggested model is probably the first to realistically synthesize both texture and geometry of human faces. It can be useful for training face detection, face recognition or face reconstruction models. In addition, it can be applied in cases where many different realistic faces are needed like for instance, film industry or computer games. Moreover, this framework is not limited to synthesizing human faces but can be actually employed to other classes of objects where alignment of the data is possible.

Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors

14 August 2018
supervision-by-registration-compressor

Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors

Precise facial landmark detection lays the foundation for a high-quality performance of many computer vision and computer graphics tasks, such as face recognition, face animation and face reenactment. Many face…

Precise facial landmark detection lays the foundation for a high-quality performance of many computer vision and computer graphics tasks, such as face recognition, face animation and face reenactment. Many face recognition methods rely on locations of detected facial landmarks to spatially align faces, and imprecise landmarks could lead to bad alignment and degrade face recognition performance. Precise facial landmark detection is still an unsolved problem. While significant work has been done on image-based facial landmark detection, these detectors tend to be accurate but not precise, i.e., the detector’s bias is small but the variance is significant. The main causes could be:

  1. insufficient training samples
  2. imprecise annotations

as human annotations inherently have limits on precision and consistency. Other methods that focus on video facial landmark detection utilize both detections and tracking to combat jittering and increase precision, but these methods require per-frame annotations in a video.

This research is using the unsupervised learning to do the task. Supervision-by-Registration (SBR), which augments the training loss function with supervision automatically extracted from unlabeled videos. The key observation is that the coherency of

  1. the detections of the same landmark in adjacent frames
  2. registration, i.e., optical flow, is a source of supervision.

Framework Overview

(SBR) framework takes labeled images, and unlabeled video as input to train an image-based facial landmark detector
Fig:1 (SBR) framework takes labeled images, and unlabeled video as input to train an image-based facial landmark detector

SBR is an end-to-end trainable model consists of two components: a generic detector built on convolutional networks, and a differentiable Lucas-Kanade (LK) operation. During the forward pass, the LK operation takes the landmark detections from the past frame and estimates their locations in the current frame. The tracked landmarks are then compared with the direct detections on the current frame. The registration loss is defined as the offset between them. In the backward pass, the gradient from the registration loss is back-propagated through the LK operation to encourage temporal coherency in the detector. The final output of the method is an enhanced image-based facial landmark detector which has leveraged large amounts of unlabeled video to achieve higher precision in both images and videos and more stable predictions in videos. SBR brings more supervisory signals from registration to enhance the precision of the detector. In sum, SBR has the following benefits:

  1. SBR can enhance the precision of a generic facial landmark detector on both images and video in an unsupervised fashion.
  2. Since the supervisory signal of SBR does not come from annotations, SBR can utilize a very large amount of unlabeled video to enhance the detector.
  3. SBR can be trained end-to-end with the widely used gradient back-propagation method.

Network Architecture

SBR Network Architecture

SBR consists of two complementary parts, the general facial landmark detector, and the LK tracking operation. The training procedure of supervision-by-registration with two complementary losses. The detection loss utilizes appearance from a single image and label information to learn a better landmark detector. Many facial landmark detectors take an image I as input and regress to the coordinates of the facial landmarks, i.e., D(I) = LL² loss is used on the co-ordinated L with ground-truth labels L*.

loss function

The registration loss uncovers temporal consistency by incorporating a Lucas-Kanade operation into the network. Registration loss can be computed in an unsupervised manner to enhance the detector. It is realized with a forward-backward communication scheme between the detection output and the LK operation. The forward communication computes the registration loss while the backward communication evaluates the reliability of the LK operation. The loss is as follows:

loss function

Complete loss function: Let N be the number of training samples with ground truth. For notation brevity, we assume there is only one unlabeled video with T frames. Then, the complete loss function of SBR is as follows:

Complete loss function

The first detector is CPM, which utilizes the ImageNet pre-trained models as the feature extraction part. The first four convolutional layers of VGG-16 is used for feature extraction and use only three CPM stages for heatmap prediction. The faces are cropped and resized into 256×256 for pre-processing. We train the CPM with a batch size of 8 for 40 epochs in total. The learning rate starts at 0.00005 and is reduced by 0.5 at 20th and 30th epochs.


Good news! Now you may swap face with celebrity in a second with our brand new app SWAPP!


The second detector is a simple regression network, denoted as Reg. VGG-16 is used as the base model and change the output neurons of the last fully-connected layer to K×2, where K is the number of landmarks. Input images have been cropped to 224×224 for this regression network.

Result

results table

qualitative result

The datasets used to compute the 300-WAFLWyoutube-face300-VW and YouTube Celebrities. The result of SBR is shown in fig:2 is performed on both the Reg (regression-based) and CPM (heatmap-based) on AFLW and 300-W. Normalized Mean Error (NME) is used to evaluate the performance of images.

The Bottom Line

SBR achieved the state-of-the-art result on all of the datasets. supervision-by-registration (SBR). The main advantages are:

  1. It does not rely on human annotations which tend to be imprecise,
  2. The detector is no longer limited to the quantity and quality of human annotations, and
  3. Back-propagating through the LK layer enables more accurate gradient updates than self-training.

Also, experiments on synthetic data show that annotation errors in the evaluation set may make a well-performing model seem like it is performing poorly, so one should be careful of annotation imprecision when interpreting quantitative results.

Muneeb ul Hassan

State-of-the-Art Facial Expression Recognition Model: Introducing of Covariances

23 May 2018
Tochnost-raspoznavanija-jemocij

State-of-the-Art Facial Expression Recognition Model: Introducing of Covariances

Recognizing facial expressions is quite an interesting and at the same time challenging task. The humans are likely to benefit significantly if facial expression recognition is performed automatically by computer…

Recognizing facial expressions is quite an interesting and at the same time challenging task. The humans are likely to benefit significantly if facial expression recognition is performed automatically by computer algorithms. Possible applications of such an algorithm would include better transcription of videos, movie or advertisement recommendations, detection of pain in telemedicine, etc.

Still, not even all humans perform equally well at recognizing other people’s emotions, but it seems that machines should be good at this, shouldn’t they? We all know that humans express their emotions with eyes, eyebrows, lips movements. So how good are current state-of-the-art approaches in recognizing these motion patterns? It turns out that modern machine learning algorithms demonstrate around 55% accuracy for recognizing facial expressions from the real-world images and 46% accuracy while performing the same task from the videos.

Let’s now discover how covariances can improve the accuracy, with which facial expressions are recognized and classified.

Figure 1. Sample images of different expressions and distortion of the region between eyebrows in the corresponding image

What is suggested to improve the results?

Group of researchers from ETH Zurich (Switzerland) and KU Leuven (Belgium) point out to the fact that classifying facial expressions into different categories (sadness, anger, joy, etc.) requires capturing regional distortions of facial landmarks. Next, they believe that second-order statistics such as covariance is more suited to capture such distortions in regional facial features.

The suggested approach was applied to two separate tasks:

  • Facial expression recognition from images: covariance pooling was introduced after final convolutional layers. Dimensionality reduction was carried out using the concepts from the manifold network, which was trained together with conventional CNNs in the end-to-end fashion.
  • Facial expression recognition from videos: covariance pooling was used here to capture the temporal evolution of per-frame features. The researchers conducted several experiments using manifold networks for pooling per-frame features.

Now, let’s dig deeper into this new approach to facial expression recognition using covariance pooling.

Model architecture

First, image-based facial expression recognition will be discussed. Here the algorithm starts with face detection to get rid of the irrelevant information that is contained in the real-world images. So, face detection is performed and aligned based on the facial landmark locations. Then, normalized faces are fed into a deep CNN. In order to pool the feature maps spatially from the CNN, covariance pooling is used. Finally, the manifold network is employed to deeply learn the second-order statistics.

Figure 2. The pipeline of the proposed model for image-based facial expression recognition

Next, the model for video-based facial expression recognition is mostly similar to the image-based one, but yet has some peculiarities. Firstly, the pipeline starts with getting useful information from videos: all frames are extracted from videos, and then face detection and alignment is performed on each individual frame. Furthermore, the authors of this model suggested pooling the frames over time since, intuitively, the temporal covariance can capture the useful facial motion pattern. Afterward, they again employed the manifold network for dimensionality reduction and non-linearity on covariance matrices.

Figure 3. The overview of the presented model for video-based facial expression recognition

Now, let’s have a short overview of the two core techniques used in the proposed models: covariance pooling and manifold network for learning the second-order features deeply.

Covariance pooling. Covariance matrix was used for summarizing the second-order information in the set. However, in order to preserve the geometric structure while employing layers of the symmetric positive definite (SPD) manifold network, the covariance matrices are required to be SPD. But, even if the matrices are only positive semi-definite, they can be regularized by adding a multiple of the trace to diagonal entries of the covariance matrix.

SPD Manifold Network (SPDNet). The covariance matrices calculated on the previous step typically reside on the Riemannian manifold of SPD matrices. They are often large, and their dimension needs to be reduced without losing the geometric structure. So, let’s briefly discuss specific layers that are used to solve these tasks:

  • Bilinear Mapping Layer (BiMap) accomplishes the task of reducing dimension while preserving the geometric structure.
  • Eigenvalue Rectification Layer (ReEig) is used to introduce non-linearity.
  • Log Eigenvalue Layer (LogEig) endows elements in the Riemannian manifold so that matrices can be flattened, and standard Euclidean operations can be applied.

Note that BiMap and ReEig layers can be used together, and so the block of these two layers is abbreviated as BiRe.

Figure 4. Illustration of SPD Manifold Network (SPDNet) with 2-BiRe layers

Results for image-based facial expression recognition

To compare the performance of the suggested approach to some baseline models, researchers used two datasets:

  • Real-world Affective Faces (RAF) contains 15331 images labeled with seven basic emotion categories of which 3068 were used for validation and 12271 for training.
  • Static Facial Expressions in the Wild (SFEW) 2.0 contains 1394 images, of which 958 were used for training and 436 for validation.

Then, it was decided to experiment with various models while introducing covariance pooling. You can see the details of the models considered in Table 1.

Table 1. Various models considered for covariance pooling

Now, various models described in the table above, as well as some other state-of-the-art models without covariance pooling, are listed in Table 2 together with the respective accuracies.

Table 2. Comparison of image-based recognition accuracies for various models

As you can see, Model-2 demonstrates 87% accuracy in the RAF dataset and outperforms the baseline model for 2.3%, which is a very good result for such a challenging task as face expression recognition. Next, Model-4 with covariance pooling shows improvement of almost 3,7% over baseline in the SFEW 2.0 dataset, which obviously justifies the use of SPDNet for image-based facial expression recognition. In total, these results are the best results for this kind of problem achieved by various state-of-the-art methods so far.

Figure 5. Samples from each class of the SFEW dataset that were most accurately and least accurately classified.

Results for video-based facial expression recognition

Here Acted Facial Expressions in the Wild (AFEW) dataset was used to compare a novel approach with existing methods. This dataset was prepared by selecting videos from movies. It contains about 1156 publicly available labeled videos of which 773 videos were used for training and 383 for validation.

The results of the proposed methods with covariance pooling as well as of some other state-of-the-art methods selected for comparison are provided below. However, it should be noted that datasets used for pretraining of other models are not uniform, and so the detailed comparison of all existing methods requires further research.

Table 3. Comparison of video-based recognition accuracies for various models.

As it can be observed from the Table 3, the model with covariance pooling and 4 BiRe layers was able to slightly surpass the results of the baseline model. It also demonstrated higher accuracy than all single models that were trained on publicly available training datasets. The VGG13 network, which shows much higher accuracy was trained on a private dataset containing a significantly higher number of samples. Still, we cannot conclude that introducing covariance pooling to the problem of video-based face expression recognition provides any significant improvements with regards to recognition accuracy.

Conclusion

In summary, this study introduces the end-to-end pooling of second-order statistics for both videos and images in the context of facial expression recognition. However, state-of-the-art results were achieved only for image-based facial expression recognition. Here the recognition accuracy after introducing covariance pooling to the model outperformed all other existing methods.

For the problem of video-based facial expression recognition, training SPDNet on image-based features was still able to obtain results comparable to state-of-the-art results. Not very high accuracy of the suggested method could be a result of the relatively small size of AFEW dataset compared to parameters in the network. The authors of this method conclude that further work is necessary to see if training end-to-end using joint convolutional network and SPDNet can improve the results.