Pairwise Relational Network – New Method for Face Recognition

28 November 2018
Pairwise Relational Network face recognition

Pairwise Relational Network – New Method for Face Recognition

With the rapid progress of deep learning in the past few years, many computer vision problems have been tackled and solved with human or even beyond human performance. One of…

With the rapid progress of deep learning in the past few years, many computer vision problems have been tackled and solved with human or even beyond human performance. One of these tasks, in fact, a very popular one, is face recognition.

Until the recent past, face recognition was seen as something straight out of science fiction. But, over the past decade or two face recognition has not only become a solved problem but a widespread technology with applications in several industries.

Previous Works

Since face recognition is a challenging task it took some time for researchers in different domains to reach satisfactory results. Researchers the domain of pattern recognition, computer vision and artificial intelligence have proposed many solutions in the past. The main goal was to reduce difficulties such as highly variable face poses, image quality etc., so as to improve the robustness and recognition accuracy.

A number of deep learning-based face recognition methods have been proposed in the past few years. Starting with remarkable results with DeepFace (2014), then methods like DeepID (2014), FaceNet (2015), VGGFace (2015), all the way to more recent methods like Cosface (2018), ArcFace (2018).

State-of-the-art Idea

Recently, researchers from Pohang University of Science and Technology in Korea have proposed a novel face recognition method that achieved state-of-the-art results on some of the benchmark datasets. The new method, called pairwise relational network (PRN) takes local appearance features around landmark points on the feature map and captures unique pairwise relations with the same identity and discriminative pairwise relations between different identities.

Pairwise Relational Network

In fact, the idea is to build a method which is able to represent an image of a face in such a manner that the features extracted will be discriminative across faces of different people.


The proposed method works by taking local appearance features as input by ROI projection around landmark points on the feature map. These features are used to train a PRN (pairwise relational network) to capture unique pairwise relations between pairs of local appearance features. Arguing that the existence of such pairwise relations is identity dependent, the researchers employ LSTM to learn additional facial identity state feature. The architecture of the method as well as of the pairwise relational network is given in the figure below.

The architecture of the proposed method
Learning face identity state feature

From a perspective of learning and optimization, the method uses combined triplet ratio loss, pairwise loss, and softmax loss. Stochastic Gradient Descent was used as the optimization method with an initial learning rate of 0.1.

Evaluation and Comparison

The proposed method was evaluated on the LFW dataset which reveals the state-of-the-art face verification in unconstrained environments. It is a popular and good dataset for evaluating face recognition methods. It contains 13, 233 of highly variable images of faces from 5, 749 different identities. The PRN method reaches 99.76% accuracy on this dataset which is almost the same accuracy as the state-of-the-art method ArcFace (99.78%).

Comparison with other methods on the LFW Dataset.
Comparison with other methods on the LFW Dataset.

However, when evaluated on the YTF dataset (with similar characteristics as LFW dataset) the PRN method achieves state-of-the-art results – 96.3%.

Comparison with other methods on the YTF Dataset. The PRN method achieves state-of-the-art performance.

Additionally, the method was evaluated on IJB-A and IJB-B datasets for evaluating face verification and face identification. The results obtained on these datasets as well as on LFW and YTF compared with other methods are reported in the tables below.

Comparison of performances of the proposed PRN method with the state-of-the-art on the IJB-B dataset.
Comparison of performances of the proposed PRN method with the state-of-the-art on the
IJB-A dataset.


The researchers proposed an interesting approach to a well-known problem – face recognition. In their paper, they show that capturing those kinds of unique and discriminative pairwise relations actually solves the problem of face identification to a high degree of accuracy. Extensive experiments have been done on popular datasets and the method achieves very good results on all of them and state-of-the-art performance on one of them.

New Datasets for Disguised Face Recognition

9 October 2018
Face recognition datasets in the wild

New Datasets for Disguised Face Recognition

Face recognition is a common task in deep learning, and convolutional neural networks (CNNs) are doing a pretty good job here. I guess Facebook usually performs right at recognizing you…

Face recognition is a common task in deep learning, and convolutional neural networks (CNNs) are doing a pretty good job here. I guess Facebook usually performs right at recognizing you and your friends in the uploaded images.

But is this really a solved problem? What if the picture is obfuscated? What if the person impersonates somebody else? Can heavy makeup trick the neural network? How easy is it to recognize a person who wears glasses?

Disguised face recognition is still quite a challenging task for neural networks and primarily due to the lack of corresponding datasets. In this article, we are going to feature several face datasets presented recently. Each of them reflects different aspects of face obfuscation, but their goal is the same – to help developers create better models for disguised face recognition.

Disguised Faces in the Wild

Number of images: 11,157

Number of subjects: 1,000

Year: 2018

Sample genuine, cross-subject impostor, impersonator, and obfuscated face images for a single subject
Sample genuine, cross-subject impostor, impersonator, and obfuscated face images for a single subject

We will start with the most recent dataset presented earlier this year – Disguised Faces in the Wild (DFW). It primarily contains images of celebrities of Indian or Caucasian origin. The dataset focuses on a specific challenge of face recognition under the disguise covariate.

According to the DFW’s description, it covers disguise variations for hairstyles, beard, mustache, glasses, make-up, caps, hats, turbans, veils, masquerades and ball masks. This is coupled with other variations for pose, lighting, expression, background, ethnicity, age, gender, clothing, and camera quality.

There are four types of images in the dataset:

  • Normal Face Image: each subject has a non-disguised frontal face image.
  • Validation Face Image: 903 subjects have an image, which corresponds to a non-disguised frontal face image and can be used for generating a non-disguised pair within a subject.
  • Disguised Face Image: each subject has from 1 to 12 face images with intentional or unintentional disguise.
  • Impersonator Face Image: 874 subjects have from 1 to 21 images of the impersonators. An impersonator of a subject refers to a picture of any other person (intentionally or unintentionally) pretending to be the subject’s identity.
Sample images of 3 subjects from the DFW dataset. Each row corresponds to one subject, containing the normal (gray), validation (yellow), disguised (green), and impersonated (blue) images.
Sample images of 3 subjects from the DFW dataset. Each row corresponds to one subject, containing the normal (gray), validation (yellow), disguised (green), and impersonated (blue) images.

In total, the DFW dataset contains 1,000 normal face images, 903 validation face images, 4,814 disguised face images, and 4,440 impersonator images.

Makeup Induced Face Spoofing

Number of images: 642

Number of subjects: 107 + 107 target subjects

Year: 2017

An attempt of one subject to spoof multiple identities
An attempt of one subject to spoofing multiple identities

Makeup Induced Face Spoofing dataset (MIFS) is also about impersonating but with the specific focus on makeup. The researchers extracted the images from the YouTube videos where female subjects were applying makeup to transform their appearance to resemble celebrities. It should be noted, though, that the subjects were not trying to deceive an automated face recognition system deliberately but rather, they intended to impersonate a target celebrity from a human vision perspective.

The dataset consists of 107 makeup-transformations with two before-makeup and two after-makeup images per subject. Additionally, two face images of the target identity were taken from the Internet and included to the dataset. However, it is important to point out that the target images are not necessarily those used by the spoofer as a reference during the makeup transformation process. The celebrities sometimes change their facial appearance drastically, and so, the researchers were trying to select target identity images that most resembled the after-makeup image.

And finally, all the acquired images were subjected to face cropping. This routine eliminates hair and accessories. The examples of the cropped images are provided below.

Examples of images in the MIFS dataset after cropping: before makeup – after makeup – target identity
Examples of images in the MIFS dataset after cropping: before makeup – after makeup – target identity

So, in total, the MIFS dataset contains 214 images of a subject before makeup; 214 images of the same subject after makeup with the intention of spoofing; and 214 images of the target subject who is being spoofed. It should also be noted that subjects are attempting to spoof multiple target identities resulting in duplicate subject identities and even multiple subjects are attempting to spoof the same target identity resulting in duplicate target identities.

Specs on Faces dataset

Number of images: 42,592

Number of subjects: 112

Year: 2017

Samples from SoF dataset: metadata for each image includes 17 facial landmarks, a glass rectangle, and a face rectangle
Samples from SoF dataset: metadata for each image includes 17 facial landmarks, a glass rectangle, and a face rectangle

It looks like glasses as a natural occlusion threaten the performance of many face detectors and facial recognition systems. That’s why such a dataset with all the subjects wearing glasses is of particular importance. Specs on Faces dataset (SoF) comprises 2,662 original images of size 640 × 480 pixels for 112 persons (66 males and 46 females) from different ages. The glasses are the common natural occlusion in all images of the dataset. This original set of images consists of two parts:

  • 757 unconstrained face images in the wild that were captured over a long period in several locations under indoor and outdoor illumination environments;
  • 1905 images that are specifically dedicated to challenging harsh illumination changes: 12 persons were filmed under a single lamp located in arbitrary locations to emit light rays in random directions.
Images captured under different lighting directions
Images captured under different lighting directions

Then, for each image of the original set, there are:

  • 6 extra pictures generated by the synthetic occlusions – nose and mouth occlusion using a white block;
  • 9 additional pictures made with the image filters: Gaussian noise, Gaussian blur, and image posterization using fuzzy logic.

So, in total, the SoF dataset includes 42,592 images of 112 persons and a huge bonus – handcrafted metadata that contains subject ID, view (frontal/near-frontal) label, 17 facial feature points, face and glasses rectangle, gender and age labels, illumination quality, and facial emotion for each subject.

Large Age-Gap Face Verification

Number of images: 3,828

Number of subjects: 1,010 celebrities

Year: 2017

Examples of face crops for matching pairs in the LAG dataset
Examples of face crops for matching pairs in the LAG dataset

Another challenge is a large age gap. Can the algorithm recognize a personality based on her picture from early childhood? Large-age gap dataset (LAG) was created to help developers with solving this challenging task.

The dataset is constructed with photos of celebrities discovered through the Google Image Search and YouTube videos. The large age gap may have different interpretations: from one side it refers to images with extreme difference in age (e.g., 0 to 80 years old) but on the other hand, it also refers to a significant difference in appearance due to the aging process. For instance, as pointed out by the dataset author, “0 to 15 years old is a relatively small difference in age but has a large change in appearance”.

The LAG dataset reflects both aspects of a large-age gap concept. It contains 3,828 images of 1,010 celebrities. For each identity, at least one child/young image and one adult/old image are present. Starting from the collected images, a total of 5051 matching pairs has been generated.

Additional examples of matching pairs in the LAG dataset
Additional examples of matching pairs in the LAG dataset

Bottom Line

The face recognition problem is still topical. There are lots of challenging tasks that significantly threaten the performance of the current facial recognition systems – it turns out that even glasses are a huge problem. Fortunately, the new face image datasets appear regularly. While each of them is focusing on the different aspects of the problem, together they build a great foundation for significant improvements in the performance of the facial recognition systems.

Fooling Facial Recognition: Fast Method for Generating Adversarial Faces

2 October 2018
Fooling Facial Recognition Fast Method for Generating Adversarial Faces

Fooling Facial Recognition: Fast Method for Generating Adversarial Faces

With the rapid progress and state-of-the-art performance in a wide range of tasks, deep learning based methods are in use in a large number of security-sensitive and critical applications. However,…

With the rapid progress and state-of-the-art performance in a wide range of tasks, deep learning based methods are in use in a large number of security-sensitive and critical applications. However, despite the remarkable, often beyond human-level performance, deep learning methods are vulnerable to well-designed input samples. This kind of input samples is named adversarial examples. In a game of “cat and mouse”, researchers have been competing in designing robust adversarial attacks on one hand and designing robust defense mechanisms on the other.

The problem of adversarial attacks is well emphasized in Computer Vision tasks such as object recognition, classification. In the field of image processing with deep learning, small perturbations in the input space can result in a significant change in the output. Such disorders are almost unnoticeable for humans and do not change the semantics of the image content itself, however, can trick deep learning methods.

Adversarial attacks are a big concern in security-critical applications such as identity verification, access control etc. One particular target of adversarial attacks is face recognition.

Previous works

The excellent performance of deep learning methods for face recognition has contributed for them to be accepted and employed in a wide variety of systems.

In the past, adversarial attacks have targeted face recognition systems. Mainly, these attacks can be divided into two more prominent groups: intensity-based and geometry-based adversarial attacks. Many of them proved to be very successful in fooling a face recognition system. However, a number of defense mechanisms have been proposed to deal with different kinds of attacks.

face adversarial attacks
Comparison of the proposed attack (Column 2) to an intensity-based
attack (Column 3).

To exploit the vulnerability of face recognition systems and surpass defense mechanisms, more and more sophisticated adversarial attacks have been developed. Some of them are changing pixel intensities while others are trying to transform benign images to perform the attack spatially.

State-of-the-art idea

Researchers from West Virginia University have proposed a new fast method for generating adversarial face images. The purpose of their approach is defining a face transformation model based on facial landmark locations.


The problem of manipulating an image and transforming it to an adversarial sample has been addressed by landmark manipulation. The technique is based on optimizing for a displacement field, which is used to process the input image spatially. It is a geometry-based attack, able to generate adversarial sample by only modifying a number of landmarks.

Taking into account the fact that facial landmark locations provide highly discriminative information for face recognition tasks, the researchers exploit the gradients of the prediction to the position of landmarks to update the displacement field. A scheme of the proposed method for generating adversarial face images is shown in the picture below.

face adversarial attacks architecture
The proposed method is optimizing for a displacement field to produce adversarial landmark locations.

To overcome the problem of multiple possible updates of the displacement field due to the different possible direction of the gradients, they propose grouping the landmarks semantically. This allows manipulating the group properties instead of perturbing each landmark for obtaining natural images.

face landmarks
Grouping face landmarks based on semantic regions of the face (eye, nose, mouth etc.).


The new adversarial face generator was evaluated by measuring and comparing the performance of the attacks under several defense methods. To further explore the problem of generating adversarial samples of face images the researchers assess how spatially manipulating the face regions affects the performance of a face recognition system.

First, the performance was evaluated on a white-box attack scenario on the CASIA-WebFace dataset. Six experiments were done to investigate the importance of each region of the face in the proposed attack methods. They evaluate the performance of the attacks on each of the five main areas of the face including 1) eyebrows, 2) eyes, 3) nose, 4) mouth, 5) jaw and 6) all regions. The results are given in the table.

face adversarial comparison
Comparison of the results of the proposed attacks FLM and GFLM to stAdv [33] and exploring the influence of different regions of the face.
The researchers calculate the prediction of the true class for faces which are correctly classified and their manipulated versions.

Comparison with other state-of-the-art

A comparison with several existing methods for generating adversarial faces has been made within this study. They compare the methods in terms of success rate and also speed.

Face adversarial
Comparison of the proposed FLM and GFLM attacks to FGSM and stAdv attacks under the state-of-the-art adversarial training defenses.
face adversarial attacks technique
Linearly interpolating the defined face properties


This approach shows that landmark manipulation can be a reliable way of changing the prediction of face recognition classifiers. The novel method can generate adversarial faces approximately 200 times faster than other geometry-based approaches. This method creates natural samples and can fool state-of-the-art defense mechanisms.

Identity Verification with Deep Learning: ID-Selfie Matching Method

24 September 2018
ID selfie verification

Identity Verification with Deep Learning: ID-Selfie Matching Method

A large number of daily activities in our lives require identity verification. Identity verification provides a security mechanism starting from access control to systems all the way to at border crossing…

A large number of daily activities in our lives require identity verification. Identity verification provides a security mechanism starting from access control to systems all the way to at border crossing and bank transactions. However, in many of the activities that require identity verification, the process is done manually, and it is often slow and requires human operators.

Examples of automatic ID document photo matching systems at international borders.
Examples of automatic ID document photo matching systems at international borders

An automated system for identity verification will significantly speed up the process and provide a seamless security check in all those activities where we need to verify our identity. One of the simplest ways to do this is to design a system that will match ID photos with selfie pictures.

Previous works

There have been both successful and unsuccessful attempts in the past to employ an automated system for identity verification. A successful example is Australia’s SmartGate. It is an automated self-service border control system operated by the Australian Border Force and located at immigration checkpoints in arrival halls in eight Australian international airports. It uses a camera to capture a verification picture and tries to match it to a person’s ID. Also, China has introduced such systems at train stations and airports.

While there have also been attempts to match ID Documents and selfies using traditional computer vision techniques, the better-performing methods rely on deep learning. Zhu et al. proposed the first deep learning approach for a document to selfie matching using Convolutional Neural Networks.

State-of-the-art idea

In their new paper, researchers from Michigan State University proposed an improved version of their DocFace – a deep learning approach for document-selfie matching.

They show that gradient-based optimization methods converge slowly when many classes have very few samples – like in the case of existing ID-selfie datasets. To overcome this shortcoming, they propose a method, called Dynamic Weight Imprinting (DWI). Additionally, they introduce a new recognition system for learning unified representations from ID-selfie pairs and an open-source face matcher called DocFace+, for ID-selfie matching.


There are a large number of problems and constraints in building an automated system for ID-selfie matching. Speaking about ID-selfie matching, numerous challenges are different from general face recognition.

The two main challenges are low quality of document (as well as selfie) photos due to compression and the large time gap between the document issue time and the verification moment.

The whole method is based on transfer learning. A base neural network model is trained on a large-scale face dataset (MS-Celeb 1M), and then features are transferred to the target domain of ID-selfie pairs.

Arguing that the convergence is very slow and very often the training can get stuck in local minima when dealing with many classes having very few samples, the researchers propose to use Additive Margin Softmax (AM-Softmax) loss function alongside with a novel optimization method that they call Dynamic Weight Imprinting (DWI).

Generalization performance of different loss functions.

Dynamic Weight Imprinting

Since Stochastic Gradient Descent updates the network with mini-batches, in a two-shot case (like the one of ID-selfie matching), each weight vector will receive signals only twice per epoch. These sparse attraction signals make little difference to the classifier weights. To overcome this problem, they propose a new optimization method where the idea is to update the weights based on sample features and therefore avoid underfitting of the classifier weights and accelerate the convergence.

Compared with stochastic gradient descend and other gradient-based optimization methods, the proposed DWI only updates the weights based on genuine samples. It only updates the weights of classes that are present in the mini-batch, and it works well with extensive datasets where the weight matrix of all classes is too large to be loaded, and only a subset of weights can be sampled for training.

Comparison of AM-Softmax loss and the proposed DIAM loss.

The researchers trained the popular Face-ResNet architecture using stochastic gradient descent and AM-Softmax loss. Then they fine-tune the model on the ID-selfie dataset by binding the proposed Dynamic Weight Imprinting optimization method with the Additive Margin Softmax. Finally, a pair of sibling networks is trained for learning domain-specific features of IDs and selfies sharing high-level parameters.

Workflow of the proposed method. A base model is trained on a large scale unconstrained face dataset. Then, the parameters are transferred to a pair of sibling networks, who have shared high-level modules.


The proposed ID-selfie matching method achieves excellent result obtaining true acceptance rate TAR to 97.51 ± 0.40%. The authors report that their approach using the MS-Celeb-1M dataset and the AM-Softmax loss function achieves 99.67% accuracy on the standard verification protocol of LFW and a Verification Rate (VR) of 99.60% at False Accept Rate (FAR) of 0.1% on the BLUFR protocol.

Examples of falsely classified images by our model on the Private ID-selfie dataset.
The mean performance of constraining different modules of the sibling networks to be shared
Comparing Static and Dynamic Weight Imprinting regarding TAR

Comparison with other state-of-the-art

The approach was compared with other state-of-the-art general face matches since there are no existing public ID-selfie matching methods. The comparison with these methods is given concerning TAR – true accept rate and FAR – false accept rate and shown in the tables below.

The mean (and s.d. of) performance of different matches on the private ID-selfie dataset
Evaluation results were compared with other methods on Public-IvS dataset


The proposed DocFace+ method for ID-selfie matching shows the potential of transfer learning, especially in tasks where not enough data is available. The proposed method is achieving high accuracy in selfie to ID matching and has potential to be employed in identity verification systems. Additionally, the proposed novel optimization method – Dynamic Weight Imprinting shows improved convergence and better generalization performance and represents a significant contribution to the field of machine learning.

Finding Tiny Faces in the Wild with Generative Adversarial Network

5 June 2018
Finding Tiny Faces in the Wild with GAN

Finding Tiny Faces in the Wild with Generative Adversarial Network

Face detection is a fundamental problem in computer vision since it is usually a key step towards many subsequent face-related applications, including face parsing, face verification, face tagging and retrieval, etc.…

Face detection is a fundamental problem in computer vision since it is usually a key step towards many subsequent face-related applications, including face parsing, face verification, face tagging and retrieval, etc. Face detection has been widely studied over the past few decades, and numerous accurate and efficient methods have been proposed for most constrained scenarios. Modern face detectors have achieved impressive results on the large and medium faces; however, the performance on small faces is far from satisfactoryThe main difficulty for small face (e.g., 10 × 10 pixels) detection is that small faces lack sufficient detailed information to distinguish them from the similar background, e.g., regions of partial faces or hands. Another problem is that modern CNN-based face detectors use the down-sampled convolutional (conv) feature maps with stride 8, 16 or 32 to represent faces, which lose most spatial information and are too coarse to describe small faces.

To deal with the nuisances in face detection, a unified, end-to-end convolutional neural network for better face detection based on the classical generative adversarial network (GAN) framework is proposedThere are two sub-networks in this detector, a generator network, and a discriminator network.

In the generator sub-network, a super-resolution network (SRN) is used to up-sample small faces to a fine scale for finding those tiny faces. Compared to re-sizing by a bilinear operation, SRN can reduce the artefact and improve the quality of up-sampled images with large upscaling factors. However, even with such sophisticated SRN, up-sampled images are unsatisfactory (usually blurring and lacking fine details) due to faces of very low resolutions (10 × 10 pixels).

Therefore, a refinement network (RN) is proposed to recover some missing details in the up-sampled images and generate sharp, high-resolution images for classification. The generated images and real images pass through the discriminator network to JOINTLY distinguish whether they are real images or generated high-resolution images and whether they are faces or non-faces. More importantly, the classification loss is used to guide the generator to generate clearer faces for easier classification.

Network Architecture

The generator network includes two components (i.e. upsample sub-network and refinement sub-network), and the first sub-network takes the low-resolution images as the inputs and the outputs are the super-resolution images. Since the blurry small faces lack fine details and due to the influence of MSE loss, the generated super-resolution faces are usually blurring. So the second subnetwork is used to refine the super-resolution images from the first sub-network. In the end, a classification branch is added to the discriminator network for the purpose of detection, which means the discriminator network can classify faces and non-faces as well as distinguish between the fake and real images.

Network Architecture
The architecture of the generator and discriminator network

Generator Network

The generator network includes refinement sub-network which is also a deep CNN architecture. The batch normalization and rectified linear unit (ReLU) activation after each convolutional layer are used except the last layer. The up-sampling sub-network first up-samples a low-resolution image and outputs a 4× super-resolution image, and this super-resolution image is blurring when the small faces are far from the cameras or under fast motion. Then, the refinement sub-network processes the blurring image and outputs a clear super-resolution image, which is easier for the discriminator to classify the faces vs. non-faces.

Discriminator Network

VGG19 is used as the backbone network in the discriminator. To avoid too many down-sampling operations for the small blurry faces, the max-pooling is removed from the “conv5” layer. Moreover, all the fully connected layer (i.e. f c6, f c7, f c8) are replaced with two parallel fully connected layers fcGAN and fcclc. The input is the super-resolution image, the output of fcGAN branch is the probability of the input being a real image, and the output of the fcclc is the probability of the input being a face.

Generator and Discriminator

Loss Function

Pixel-wise loss: the input of our generator network is the small blurry images instead of random noise. A natural way to enforce the output of the generator to be close to the super-resolution ground truth is through the pixel-wise MSE loss, and it is calculated as:

Loss Function

where ILR and IHR denote the small blurry images and super-resolution images respectively, G1 means up-sampling sub-network, G2 denotes the refinement subnetwork and w is the parameters of the generator network.

Adversarial loss: to achieve more realistic results, the adversarial loss to the objective loss is introduced, defined as:

Adversarial loss

Here, the adversarial loss encourages the network to generate sharper high-frequency details for trying to fool the discriminator network.

Classification loss: In order to make the reconstructed images by the generator network easier to classify, the classification loss to the objective loss is also introduced. The formulation of classification loss is:

Classification loss

Classification loss plays two roles, where the first is to distinguish whether the high-resolution images, including both the generated and the natural real high-resolution images, are faces or non-faces in the discriminator network. The other role is to promote the generator network to reconstruct sharper images.

Objective function: Based on the above losses, adversarial loss and classification loss is incorporated into pixel-wise MSE loss. The GAN network can be trained by the objective function. For better gradient behaviour loss function of generator G and the discriminator D are modified as follows:

Objective function Objective function 2

The equation 8 consists of adversarial loss, MSE loss and classification loss, which enforce the reconstructed images to be similar to the real natural high-resolution image on the high-frequency details, pixel, and semantic level respectively. The loss function of discriminator D in equation 9 introduces the classification loss to classify whether the high-resolution images are faces or non-faces. By adding the classification loss, the recovered images from the generator are more realistic than the results optimised by the adversarial loss and MSE loss.

The performance is better than the previously studied methods.

Performance of the baseline model trained with and without t GAN, refinement network, adversarial loss and classification loss on the WIDER FACE invalidation set


This new method is able to find small faces in the wild by using GAN. A novel network is designed to directly generate a clear super-resolution image from a small blurry one, and our up-sampling sub-network and refinement sub-network are trained in an end-to-end way. Moreover, an additional classification branch is introduced to the discriminator network, which can distinguish the fake/real and face/non-face simultaneously.

Finding faces

Qualitative detection results of the proposed method. Green bounding boxes are ground truth annotations and red bounding boxes are the results from a suggested method. Best seen on the computer, in colour and zoomed in:

Finding Tiny Faces

Muneeb ul Hassan

State-of-the-Art Facial Expression Recognition Model: Introducing of Covariances

23 May 2018

State-of-the-Art Facial Expression Recognition Model: Introducing of Covariances

Recognizing facial expressions is quite an interesting and at the same time challenging task. The humans are likely to benefit significantly if facial expression recognition is performed automatically by computer…

Recognizing facial expressions is quite an interesting and at the same time challenging task. The humans are likely to benefit significantly if facial expression recognition is performed automatically by computer algorithms. Possible applications of such an algorithm would include better transcription of videos, movie or advertisement recommendations, detection of pain in telemedicine, etc.

Still, not even all humans perform equally well at recognizing other people’s emotions, but it seems that machines should be good at this, shouldn’t they? We all know that humans express their emotions with eyes, eyebrows, lips movements. So how good are current state-of-the-art approaches in recognizing these motion patterns? It turns out that modern machine learning algorithms demonstrate around 55% accuracy for recognizing facial expressions from the real-world images and 46% accuracy while performing the same task from the videos.

Let’s now discover how covariances can improve the accuracy, with which facial expressions are recognized and classified.

Figure 1. Sample images of different expressions and distortion of the region between eyebrows in the corresponding image

What is suggested to improve the results?

Group of researchers from ETH Zurich (Switzerland) and KU Leuven (Belgium) point out to the fact that classifying facial expressions into different categories (sadness, anger, joy, etc.) requires capturing regional distortions of facial landmarks. Next, they believe that second-order statistics such as covariance is more suited to capture such distortions in regional facial features.

The suggested approach was applied to two separate tasks:

  • Facial expression recognition from images: covariance pooling was introduced after final convolutional layers. Dimensionality reduction was carried out using the concepts from the manifold network, which was trained together with conventional CNNs in the end-to-end fashion.
  • Facial expression recognition from videos: covariance pooling was used here to capture the temporal evolution of per-frame features. The researchers conducted several experiments using manifold networks for pooling per-frame features.

Now, let’s dig deeper into this new approach to facial expression recognition using covariance pooling.

Model architecture

First, image-based facial expression recognition will be discussed. Here the algorithm starts with face detection to get rid of the irrelevant information that is contained in the real-world images. So, face detection is performed and aligned based on the facial landmark locations. Then, normalized faces are fed into a deep CNN. In order to pool the feature maps spatially from the CNN, covariance pooling is used. Finally, the manifold network is employed to deeply learn the second-order statistics.

Figure 2. The pipeline of the proposed model for image-based facial expression recognition

Next, the model for video-based facial expression recognition is mostly similar to the image-based one, but yet has some peculiarities. Firstly, the pipeline starts with getting useful information from videos: all frames are extracted from videos, and then face detection and alignment is performed on each individual frame. Furthermore, the authors of this model suggested pooling the frames over time since, intuitively, the temporal covariance can capture the useful facial motion pattern. Afterward, they again employed the manifold network for dimensionality reduction and non-linearity on covariance matrices.

Figure 3. The overview of the presented model for video-based facial expression recognition

Now, let’s have a short overview of the two core techniques used in the proposed models: covariance pooling and manifold network for learning the second-order features deeply.

Covariance pooling. Covariance matrix was used for summarizing the second-order information in the set. However, in order to preserve the geometric structure while employing layers of the symmetric positive definite (SPD) manifold network, the covariance matrices are required to be SPD. But, even if the matrices are only positive semi-definite, they can be regularized by adding a multiple of the trace to diagonal entries of the covariance matrix.

SPD Manifold Network (SPDNet). The covariance matrices calculated on the previous step typically reside on the Riemannian manifold of SPD matrices. They are often large, and their dimension needs to be reduced without losing the geometric structure. So, let’s briefly discuss specific layers that are used to solve these tasks:

  • Bilinear Mapping Layer (BiMap) accomplishes the task of reducing dimension while preserving the geometric structure.
  • Eigenvalue Rectification Layer (ReEig) is used to introduce non-linearity.
  • Log Eigenvalue Layer (LogEig) endows elements in the Riemannian manifold so that matrices can be flattened, and standard Euclidean operations can be applied.

Note that BiMap and ReEig layers can be used together, and so the block of these two layers is abbreviated as BiRe.

Figure 4. Illustration of SPD Manifold Network (SPDNet) with 2-BiRe layers

Results for image-based facial expression recognition

To compare the performance of the suggested approach to some baseline models, researchers used two datasets:

  • Real-world Affective Faces (RAF) contains 15331 images labeled with seven basic emotion categories of which 3068 were used for validation and 12271 for training.
  • Static Facial Expressions in the Wild (SFEW) 2.0 contains 1394 images, of which 958 were used for training and 436 for validation.

Then, it was decided to experiment with various models while introducing covariance pooling. You can see the details of the models considered in Table 1.

Table 1. Various models considered for covariance pooling

Now, various models described in the table above, as well as some other state-of-the-art models without covariance pooling, are listed in Table 2 together with the respective accuracies.

Table 2. Comparison of image-based recognition accuracies for various models

As you can see, Model-2 demonstrates 87% accuracy in the RAF dataset and outperforms the baseline model for 2.3%, which is a very good result for such a challenging task as face expression recognition. Next, Model-4 with covariance pooling shows improvement of almost 3,7% over baseline in the SFEW 2.0 dataset, which obviously justifies the use of SPDNet for image-based facial expression recognition. In total, these results are the best results for this kind of problem achieved by various state-of-the-art methods so far.

Figure 5. Samples from each class of the SFEW dataset that were most accurately and least accurately classified.

Results for video-based facial expression recognition

Here Acted Facial Expressions in the Wild (AFEW) dataset was used to compare a novel approach with existing methods. This dataset was prepared by selecting videos from movies. It contains about 1156 publicly available labeled videos of which 773 videos were used for training and 383 for validation.

The results of the proposed methods with covariance pooling as well as of some other state-of-the-art methods selected for comparison are provided below. However, it should be noted that datasets used for pretraining of other models are not uniform, and so the detailed comparison of all existing methods requires further research.

Table 3. Comparison of video-based recognition accuracies for various models.

As it can be observed from the Table 3, the model with covariance pooling and 4 BiRe layers was able to slightly surpass the results of the baseline model. It also demonstrated higher accuracy than all single models that were trained on publicly available training datasets. The VGG13 network, which shows much higher accuracy was trained on a private dataset containing a significantly higher number of samples. Still, we cannot conclude that introducing covariance pooling to the problem of video-based face expression recognition provides any significant improvements with regards to recognition accuracy.


In summary, this study introduces the end-to-end pooling of second-order statistics for both videos and images in the context of facial expression recognition. However, state-of-the-art results were achieved only for image-based facial expression recognition. Here the recognition accuracy after introducing covariance pooling to the model outperformed all other existing methods.

For the problem of video-based facial expression recognition, training SPDNet on image-based features was still able to obtain results comparable to state-of-the-art results. Not very high accuracy of the suggested method could be a result of the relatively small size of AFEW dataset compared to parameters in the network. The authors of this method conclude that further work is necessary to see if training end-to-end using joint convolutional network and SPDNet can improve the results.

How Gfycat Solves the Problem of Recognizing Faces of Asians and Africans

19 April 2018
Gfycat's facial recognition software

How Gfycat Solves the Problem of Recognizing Faces of Asians and Africans

Is it correct to ask such a question at all? The latest study showed that racial differences have to be taken into account by developers of neural networks to improve…

Is it correct to ask such a question at all? The latest study showed that racial differences have to be taken into account by developers of neural networks to improve the accuracy of face recognition. But in society, it is formally considered that all people are equal and bias in this matter should be avoided.

Gfycat software engineer Gurney Gan said that last summer his software successfully identified most of his colleagues, but “stumbled” on one group.

“It got some of our Asian employees mixed up. Which was strange because it got everyone else correctly”, — says Gan.

However, even the largest companies have similar problems. For example, Microsoft and IBM face analysis services are 95 % more accurate when recognizing white men than women with darker skin.

The Google Photos service does not respond to requests “gorilla”, “chimpanzee” or “monkey”. Thus, the risk of repetition of the embarrassment of 2015 is eliminated. The search engine back then mistakenly took black people in photos for monkeys.

A universal standard for testing and eliminating bias in AI systems has not been developed yet.

“Lots of companies are now taking these things seriously, but the playbook for how to fix them is still being written”, — says Meredith Whittaker, co-director of AI Now.

Gfycat started face recognition development to create a system that would allow people to find the perfect GIFs to use in messengers. The search system of the company works with about 50 million GIF-files — from cats to presidents’ faces. With the help of face recognition, developers wanted to facilitate the search for famous personalities — from politicians to stars

“Asian detector”

The company used open source software based on Microsoft research and trained it on millions of photos from collections issued by the Universities of Illinois and Oxford.

But the neural network could not distinguish Asian celebrities, such as Constance Wu and Lucy Liu, and did not distinguish reliably people with dark skin.

First, Gan tried to solve the problem by adding more photos of “problematic” faces to machine learning examples. But adding a large number of photos of black and Asian celebrities to the data array helped only partially.

It was possible to solve the problem only by creating a sort of “Asian detector”. Finding an Asian face, the system goes into a mode of hypersensitivity.

According to Ghana, this was the only way to make the program distinguish Asians from each other.

The company says that the system is now 98% accurate when identifying white people, and 93 % when dealing with Asians.

Intentional search for racial differences may seem a strange way to combat prejudice. But this idea was supported by a number of scientists and companies. In December, Google published a report on improving the accuracy of the smile recognition system. This was achieved by determining whether the person in the photo is a man or a woman, and to which of the four racial groups he belongs. But the document also says that artificial intelligence systems should not be used to determine the person’s race and that using only two gender and four racial categories is not sufficient in all cases.

Some researchers have suggested forming industry standards for transparency and decreasing AI bias. Thus, Eric Learned-Miller, professor of the University of Massachusetts, suggests that organizations utilizing face recognition, such as Facebook and the FBI, should disclose the accuracy of their systems for different gender and racial groups.


Neural Network Has Learned to Separate Individuals’ Speech on Video

13 April 2018
Neural Network Has Learned to Separate Individuals’ Speech on Video

Neural Network Has Learned to Separate Individuals’ Speech on Video

The fact that our brain in a noisy environment can effectively focus on a particular speaker, “turning off” background sounds — is no secret. This phenomenon even received the popular…

The fact that our brain in a noisy environment can effectively focus on a particular speaker, “turning off” background sounds — is no secret. This phenomenon even received the popular name “cocktail party effect”. But, despite the good study of the phenomenon, the automatic separation of speech of a particular speaker is still a difficult task for machines.

So Looking to Listen project with a combined audio-visual model was created. The project’s technology makes it possible to select one person’s speech from the soundtrack (including background noise and other people’s voices) and to mute all other sounds.

The method works on totally ordinary videos with single audio stream. All that is required from users is to select face of the person they want to hear on the video.

The possible application of this technology is extremely wide — from speech recognition to hearing aids, which today work poorly if several people speak simultaneously.

The new technology uses a combination of audio and video to separate speech. Thanks to the human lips movement and its correlations with the sounds spoken, it is possible to determine which part of the audio stream is associated with a particular person. This significantly improves the quality of speech allocation (in comparison with systems that process only audio), especially in situations where there are several speakers at once.

More importantly, the technology makes it possible to recognise which of the people on the video says what, by linking the already separated speech with specific speakers.

How does it work?

As training examples for a neural network, a database of 100,000 videos of lectures and conversations on YouTube was used. Of these, fragments with “pure speech” (without background music, audience sounds or other people’s speeches) and just one speaker in the frame with a total duration of about 2000 hours were singled out.

Then from these “pure” data “synthetic cocktail parties” were created — videos in which faces of the speakers were mixed, as well as their pre-separated speech and background noises, which were taken on AudioSet.

As a result, it was possible to train the convolutional neural network to allocate from the “cocktail party” a separate audio stream for each person speaking on the video.

The architecture of the neural network can be found on the diagram below:

Neural network architecture
The model of the multistream neural network used in the Looking to Listen: the video stream uses as the original data fragments with faces recognized in each frame, and the audio stream uses the soundtrack of the video clip with both speech and background noises.

First, the recognised faces are extracted from the video stream, after which the convolutional neural network looks at the features of each face. From the audio stream, in turn, the system first obtains a spectrogram using STFT, and then it processes it via a similar neural network. The combined audiovisual representation is obtained by fusing the processed audio and video signals and is further processed using a bi-directional LSTM and three layers of a deep convolutional neural network.

The network creates a complex spectrogram mask for each speaker, which is multiplied with “noisy” original data and again converted into a waveform to obtain an isolated speech signal for each speaker.

More information about the technology and projects results can be found in the project documentation and on its Github page.

The Results

Below are the results of applying this technology to several videos. All sounds, except for the speech of the selected person, can be either turned off at all or muffled to the required level.

This technology can be useful for speech recognition and automatic captions. Existing systems do not cope with the situation when the speech of several people overlap. Separation of sound “by source” allows you to get more accurate and easier to read captions.