Dissecting GANs for Better Understanding and Visualization

5 December 2018
dissecting gan paper

Dissecting GANs for Better Understanding and Visualization

GANs can be taught to create (or generate) worlds similar to our own in any domain: images, music, speech, etc. Since 2014, a large number of improvements of GANs have…

GANs can be taught to create (or generate) worlds similar to our own in any domain: images, music, speech, etc. Since 2014, a large number of improvements of GANs have been proposed, and GANs have achieved impressive results. Researchers from MIT-IBM Watson Lab have presented GAN Paint based on Dissecting GAN – the method to validate if an explicit representation of an object is present in an image (or feature map) from a hidden layer:

GAN paint gif
The GAN Paint interactive tool

State-of-the-art Idea

However, a question that is raised very often in ML is the lack of understanding of the methods developed and applied. Despite the success of GANs, visualization and understanding of GANs are very little explored fields in research.

A group of researchers led by David Bau have done the first systematic study for understanding the internal representations of GANs. In their paper, they present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level.

Their work resulted with a general method for visualizing and understanding GANs at different levels of abstraction, several practical applications enabled by their analytic framework and an open source interpretation tools for better understanding Generative Adversarial Network models.

dissecting gan
Inserting door by setting 20 causal units to a fixed high value at one pixel in the representation.


From what we have seen so far, especially in the image domain, Generative Adversarial Networks can generate super realistic images from different domains. From this perspective, one might say that GANs have learned facts about a higher abstraction level – objects for example. However, there are cases where GANs fail terribly and produce some very unrealistic images. So, is there a way to explain at least these two cases? David Bau and his team tried to answer this question among a few others in their paper. They studied the internal representations of GANs and tried to understand how a GAN represents structures and relationships between objects (from the point of view of a human observer).

As the researchers mention in their paper, there has been previous work on visualizing and understanding deep neural networks but mostly for image classification tasks. Much less work has been done in visualization and understanding of generative models.

The main goal of the systematic analysis is to understand how objects such as trees are encoded by the internal representations of a GAN generator network. To do this, the researchers study the structure of a hidden representation given as a feature map. Their study is divided into two phases that they call: dissection and intervention.

Characterizing units by Dissection

The goal of the first phase – Dissection, is to validate if an explicit representation of an object is present in an image (or feature map) from a hidden layer. Moreover, the goal is to identify which classes from a dictionary of classes have such explicit representation.

To search for explicit representations of objects they quantify the spatial agreement between the unit thresholded feature map and a concept’s segmentation mask using intersection-over-union (IoU) measure. The result is called agreement, and it allows for individual units to be characterized. It allows to rank the concepts related to each unit and label each unit with the concept that matches it best.

Dissection algorithm
Phase 1: Dissection.

Measuring causal relationships using Intervention

The second important question that was mentioned before is causality. Intervention – denoted as the second phase, seeks to estimate the causal effect of a set of units on a particular concept.

To measure this effect, in the intervention phase the impact of forcing units on (unit insertion) and off (unit ablation) is measured, again using segmentation masks. More precisely, a feature map’s units are forced on and off, and both resulting images from those two representations are segmented to obtain two segmentation masks. Finally, these masks are compared to measure the causal effect.

Intervention algorithm
Phase 2: Intervention.


For the whole study, the researchers use three variants of Progressive GANs (Karras et al., 2018) trained on LSUN scene datasets. For the segmentation task, they use a recent image segmentation model (Xiao et al., 2018) trained on the ADE20K scene dataset.

An extensive analysis was done using the proposed framework for understanding and visualization of GANs. The first part – Dissection was used by the researchers for analyzing and comparing units across datasets, layers, and models, and locating artifact units.

Comparing representations learned by progressive GANs
Comparing representations learned by progressive GANs trained on different scene types. The units that emerge match objects that commonly appear in the scene type: seats in conference rooms and stoves in kitchens.
Removing successively larger sets of tree-causal units from a GAN.

A set of dominant object classes and the second part of the framework- intervention, were used to locate causal units that can remove and insert objects in different images. The results are presented in the paper, the supplementary material and a video were released demonstrating the interactive tool. Some of the results are shown in the figures below.

Visualizing the activations of individual units in two GANs.
Visualizing the activations of individual units in two GANs.


This is one of the first extensive studies that target the understanding and visualization of generative models. Focusing on the most popular generative model – Generative Adversarial Networks, this work reveals significant insights about generative models. One of the main findings is that the larger part of GAN representations can be interpreted. It shows that GAN’s internal representation encodes variables that have a causal effect on the generation of objects and realistic images.

Many researchers will potentially benefit from the insights that came out of this work and the proposed framework that will provide a basis for analysis, debugging and understanding of Generative Adversarial Network models.

PIFR: Pose Invariant 3D Face Reconstruction

26 November 2018
pifr reconstruction

PIFR: Pose Invariant 3D Face Reconstruction

3D face geometry needs to be recovered from 2D images in many real-world applications, including face recognition, face landmark detection, 3D emoticon animation etc. However, this task remains challenging especially…

3D face geometry needs to be recovered from 2D images in many real-world applications, including face recognition, face landmark detection, 3D emoticon animation etc. However, this task remains challenging especially under the large pose, when much of the information about the face is unknowable.

Jiang and Wu from Jiangnan University (China) and Kittler from University of Surrey (UK) suggest a novel 3D face reconstruction algorithm, which significantly improves the accuracy of reconstruction even under extreme pose.

But let’s first shortly review the previous work on 3D face models and 3D face reconstruction.

Related Work

The research mentions four publicly available 3D deformation models:

This paper uses a BFM model, which is the most popular.

There are several approaches to reconstructing 3D model from 2D images, including:

State-of-the-art idea

The paper by Jiang, Wu, and Kittler proposes a novel Pose-Invariant 3D Face Reconstruction (PIFR) algorithm based on 3D Morphable Model (3DMM).

Firstly, they suggest generating a frontal image by normalizing a single face input image. This step allows restoring additional identity information of the face.

The next step is to use a weighted sum of the 3D parameters of both images: the frontal one and the original one. This allows to preserve the pose of the original image, but also enhance the identity information.

The pipeline for the suggested approach is provided below.

Overview of the Pose-Invariant 3D Face Reconstruction (PIFR) method
Overview of the Pose-Invariant 3D Face Reconstruction (PIFR) method

The experiments show that PIFR algorithm has significantly improved the performance of 3D face reconstruction compared to the previous methods, especially in the extreme pose cases.

Let’s now have a closer look at the suggested model…

Model details

PIFR method is largely relying on the 3DMM fitting process, which can be expressed as minimizing the error between the 2D coordinates of the 3D point projection and the ground truth. However, the face generated by the 3D model has about 50,000 vertices, and thus iterative calculations result in the slow and ineffective convergence. To overcome this problem, the researchers suggest using the landmarks (e.g., eye center, mouth corner, and nose tip) as the ground truth in the fitting process. Specifically, they use a weighted landmark 3DMM fitting.

Top row: the original image and landmark. Bottom row: 3D face model and its alignment to the 2D image
Top row: the original image and landmark. Bottom row: 3D face model and its alignment to the 2D image

The next challenge is to reconstruct 3D faces in large poses. To solve this problem, the researchers use High-Fidelity Pose and Expression Normalization (HPEN) method, but only for normalization of the pose and not expression. Also, Poisson Editing is used to recover the occluded area of the face due to the angle.

Performance Comparison with Other Methods

The performance of PIFR method was evaluated for the face reconstruction:

  • in small and medium poses;
  • large poses;
  • extreme poses (±90 yaw angles).

For this purpose, the researchers used three publicly available datasets:

  • AFW dataset, which was created using Flickr images, contains 205 images with 468 marked faces, complex backgrounds and face poses.
  • LFPW dataset, which has 224 face images in the test set and 811 face images in the training set; each image is marked with 68 feature points; 900 face images from both sets were selected for testing in this research.
  • AFLW dataset is a large-scale face database, which contains around 250 million hand-labeled face images, and each image is marked with 21 feature points. This study used only extreme pose face images from this dataset for qualitative analysis.

Quantitative analysis. Using the Mean Euclidean Metric (MEM), the study compares the performance of PIFR method to E-3DMM and FW-3DMM on AFW and LFPW datasets. Cumulative errors distribution (CED) curves look like this:

Comparisons of cumulative errors distribution (CED) curves on AFW and LFPW datasets
Comparisons of cumulative errors distribution (CED) curves on AFW and LFPW datasets

As you can see from these plots and the tables below, PIFR method shows superior performance compared to the other two methods. Its reconstruction performance in large poses is particularly good.

Qualitative analysis. The method was also assessed qualitatively based on the face images in extreme poses from AFLW dataset. The results are shown in the figure below.

Comparison of 3D face reconstruction: (a) Input image; (b) FW-3DMM; (c) E-3DMM; (d) Suggested approach
Comparison of 3D face reconstruction: (a) Input image; (b) FW-3DMM; (c) E-3DMM; (d) Suggested approach

Even though half of the landmarks are invisible due to extreme posture, which leads to large errors and failures of other methods, the PIFR method still performs quite well.

Here are some additional examples of the PIFR method performance based on the images from the AFW dataset.

Top row: Input 2D image. Middle row: 3D face. Bottom row: Align to 2D image.
Top row: Input 2D image. Middle row: 3D face. Bottom row: Align to 2D image

Bottom Line

A novel 3D face reconstruction framework PIFR demonstrates good reconstruction performance even in extreme poses. By taking both the original and the frontal images for weighted fusion, the method allows restoring enough face information to reconstruct the 3D face.

In the future, the researchers plan to restore even more facial identity information to improve the accuracy of reconstruction further.

This Neural Network Evaluates Natural Scene Memorability

1 October 2018
natural scene memorability score by neural network

This Neural Network Evaluates Natural Scene Memorability

One hallmark of human cognition is the splendid capacity of recalling thousands of different images, some in details, after only a single view. Not all photos are remembered equally in…

One hallmark of human cognition is the splendid capacity of recalling thousands of different images, some in details, after only a single view. Not all photos are remembered equally in a human brain. Some images stick in our minds, while others fade away in a short time. This kind of capacity is likely to be influenced by individual experiences and is also subject to some degree of inter-subject variability, similar to some individual image properties.

Interestingly, when exposed to the overflow of visual images, subjects have a consistent tendency rather remember or forget the same pictures. Previous research suggests and analyzes the reason why people have the intuition to remember images and provide reliable solutions for ranking images by memorability scores. These works are mostly for generic images, object images and face photographs. However, it is difficult to dig out the obvious cues relevant to the memorability of a natural scene. To date, methods for predicting the visual memorability of a natural scene are scarce.

Previous Works

Previous work showed that memorability is an intrinsic property of an image. DNN has demonstrated splendid achievement in many research areas, e.g., video coding and computer vision. Also, several DNN approaches were proposed to estimate image memorability, which significantly improves the prediction accuracy.

  • Data scientists from MIT  trained the MemNet on a large-scale database, achieving a splendid prediction performance close to human consistency.
  • Baveye et al. fine-tuned the GoogleNet exceeding the performance of handcrafted features. Researchers also studied and targeted the certain objects like faces, natural scenes, etc.
  • Researchers from MIT have also created a database for studying the memorability of human face photographs. They further explored the contribution of certain traits (e.g., kindness, trustworthiness, etc.) to face memorability, but such characteristics only partly explain facial memorability.

State-of-the-art idea

As a first step towards understanding and predicting the memorability of a natural scene, LNSIM database is built. In LNSIM database, there are in total 2,632 natural scene images. For obtaining these natural scene images, 6,886 images are selected, which contain natural scenes from the existing databases, including MIR Flickr, MIT1003, NUSEF, and AVA database. Natural scenes images are selected from these databases. Fig. 1 shows some example images from LNSIM database.

Fig: 1 Image samples from LNSIM database

A memory game is used to quantify the memorability of each image in LNSIM database. A software is developed in which 104 subjects (47 females and 57 males) were involved. They do not overlap with the volunteers who participated in the image selection. The procedure of our memory game is summarized in Fig. 2.

Fig:02 The experimental procedure of memory game. Each level lasts about 5.5 minutes with a total of 186 images. Those 186 images are composed of 66 targets, 30 fillers, and 12 vigilance images. The specific time durations for experiment setting are labeled above.
In this experiment, there were 2,632 target images, 488 vigilance images and 1,200 filler images, which were unknown to all subjects. Vigilance and filler images were randomly sampled from the rest of 6,886 images. Vigilance images were repeated within 7 images, in an attempt to ensure that the subjects were paying attention to the game. Filler images were presented for once, such that spacing between the same target or vigilance images can be inserted. After collecting the data, a memorability score is assigned to quantify how memorable each image is. Also, to evaluate the human consistency, subjects are split into two independent halves(i.e. Group 1 and 2).

Analysis of Natural Scene Memorability

LNSIM database is mined to better understand how natural scene memorability is influenced by the low, middle and high-level handcrafted features and the learned deep feature.

Low-level features, like pixels, SIFT  and HOG2, have the impact on memorability of generic images. It has been investigated whether these low-level features still work on natural scene image set or not. To evaluate this, a support vector regression (SVR) for each low-level feature using training set to predict memorability, and then evaluate the SRCC of these low-level features with memorability on the test set. Below table 1 reports the results of SRCC on natural scenes, with SRCC on generic images as the baseline. It is evident that pixels (ρ=0.08), SIFT (ρ=0.28) and HOG2 (ρ=0.29) are not as effective as expected on the natural scene, especially compared to generic images.
Table 1: The correlation ρ between low-level features and natural scene memorability.
This suggests that the low-level features cannot effectively characterize the visual information for remembering natural scenes.

The middle-level feature of GIST describes the spatial structure of an image. However, Table 2 shows that the SRCC of GIST is only 0.23 for the natural scene, much less thanρ=0.38 of generic images. This illustrates that structural information provided by the GIST feature is less effective for predicting memorability scores on natural scenes.

Table 2: The correlation ρ between middle-level features and natural scene memorability.
There is no salient object, animal or person in natural scene images, such that scene semantics, as a high-level feature. To obtain the ground truth of scene category, two experiments are designed to annotate scene category for 2,632 images in the database.
  • Task 1(Classification Judgement): 5 participants are asked to indicate which scene categories an image has. A random image query was generated for each participant. Participants had
    to choose proper scene category labels to interpret scene stuff for each image.
  • Task 2 (Verification Judgement): A separate task ran on the same set of images by recruiting another 5 participants after Task 1. The participants were asked to provide a binary answer to the question for each image. The default answer was set to “No”, and the participants can check the box of image index to set “No” to “Yes”.

All images are annotated with categories through the majority voting over Task 1 and Task 2. Afterward, an SVR predictor with the histogram intersection kernel is trained for scene category. The scene category attribute achieves a good performance of SRCC(ρ=0.38), outperforming the results of low-level feature combination. This suggests that high-level scene category is an obvious cue of quantifying the natural scene memorability. As shown in below Figure, the horizontal axis represents scene categories in the descending order of corresponding average memorability scores. The average score ranges from 0.79 to 0.36, giving a sense of how memorability changes across different scene categories. The distribution in below Figure indicates that some unusual classes like aurora tend to be more memorable, while usual classes like mountain are more likely to be forgotten. This is possibly due to the frequency of each category appears in daily life.

Comparison of average memorability score and standard deviation of each scene category
To dig out how deep feature influences the memorability of a natural scene, a fine-tuned MemNet is trained on LNSIM database, using the Euclidean distance between the predicted and ground truth memorability scores as the loss function. The output of the last hidden layer is extracted as the deep feature (dimension: 4096).To evaluate the correlation between the deep feature and natural scene memorability, similar to above-handcrafted features, an SVR predictor with histogram intersection kernel is trained for the deep feature. The SRCC of the deep feature is 0.44, exceeding all handcrafted features. It is acceptable that DNN indeed works well on predicting the memorability of a natural scene, as deep feature shows a rather high prediction accuracy. Nonetheless, there is no doubt that the fine-tuned MemNet also has its limitation, since it still has the gap to human consistency (ρ=0.78).

DeepNSM: DNN for natural scene memorability

Fine-tuned MemNet model serves as the baseline model in predicting natural scene memorability. In the proposed DeepNSM architecture, the deep feature is concatenated with the category-related element to predict the memorability of natural scene images accurately. Note that the “deep feature” refers to the 4096-dimension feature extracted from the baseline model.

Figure 2: Architecture of DeepNSM model
Figure 2: Architecture of DeepNSM model
The architecture of DeepNSM model is presented in Figure 2. In DeepNSM model, the aforementioned category-related feature is concatenated with the deep feature obtained from the baseline model. Based on such concatenated element, additional fully-connected layers (including one hidden layer with the dimension of 4096) are designed to predict the memorability scores of natural scene images. In training, the layers of the baseline and ResNet models are initialized by the individually pre-trained models, and the added fully-connected layers are randomly initialized. The whole network is jointly trained in an end-to-end manner, using the Adam optimizer with the Euclidean distance adopted as the loss function.

Comparison with other models

The performance of DeepNSM model in predicting natural scene memorability regarding SRCC (ρ). The DeepNSM model is tested on both the test set of LNSIM database and the NSIM database. The SRCC performance of DeepNSM model is compared with the state-of-the-art memorability prediction methods, including MemNet, MemoNet, and Lu et al. Among them, MemNet and MemoNet are the latest DNN methods for generic images, which beat the conventional techniques using handcrafted features. Lu et al. is a state-of-the-art method for predicting natural scene memorability.
Fig: 3 The SRCC (ρ) performance of DeepNSM and compared methods.

Fig: 3 shows the SRCC performance of DeepNSM and the three compared methods. DeepNSM successfully achieves the outstanding SRCC performance, i.e., ρ=0.58 and 0.55, over the LNSIM and NSIM databases, respectively. It significantly outperforms the state-of-the-art DNN methods, MemNet and MemoNet. The above results demonstrate the effectiveness of DeepNSM in predicting natural scene memorability.


The above approach investigated the memorability of a natural scene from the data-driven perspective. Specifically, it established the LNSIM database for analyzing human memorability on natural scene. In exploring the correlation of memorability with low-, middle- and high-level features, it is worth mentioning that a high-level feature of scene category plays a vital role in predicting the memorability of a natural scene.

True Face Super-Resolution Upscaling with the Facial Component Heatmaps

1 October 2018
face resolution upscaling

True Face Super-Resolution Upscaling with the Facial Component Heatmaps

The performance of the most facial analysis techniques relies on the resolution of the corresponding image. Face alignment or face identification is not going to work correctly when the resolution…

The performance of the most facial analysis techniques relies on the resolution of the corresponding image. Face alignment or face identification is not going to work correctly when the resolution of a face is adversely low.

What’s Face Super-Resolution?

Face super-resolution (FSR) or face hallucination, provides a viable way to recover a high-resolution (HR) face image from its low-resolution (LR) counterpart. This research area has attracted increasing interest in the recent years, and the most advanced deep learning methods achieve state-of-the-art performance in face super-resolution.

However, even these methods often produce the results with the distorted face structure and only partially recovered facial details. Deep learning based FSR methods fail to super-resolve LR faces under large pose variations.

How can we solve this problem?

  • Augmenting training data with large pose variations still leads to suboptimal results where facial details are missing or distorted.
  • Directly detecting facial components or landmarks in LR faces is also suboptimal and may lead to ghosting artifacts in the final result.

But what about a method that super-resolves LR faces images while collaboratively predicting face structure? Can we use heatmaps to represent the probability of the appearance of each facial component?

We are going to discover this very soon, but let’s first check the previous approaches to the problem of face super-resolution.

Related Work

Face hallucination methods can be roughly grouped into three categories:

  • ‘Global model’ based approaches aim at super-resolving an LR input image by learning a holistic appearance mapping such as PCA. For instance, Wang and Tang reconstruct an HR output from the PCA coefficients of the LR input; Liu et al. develop a Markov random field (MRF) to reduce ghosting artifacts caused by the misalignments in LR images; Kolouri and Rohde employ optimal transport techniques to morph an HR output by interpolating exemplary HR faces.
  • Part based methods are proposed to super-resolve individual facial regions separately. For instance, Tappen and Liu super-resolve HR facial components by warping the reference HR images; Yang et al. localize facial components in the LR images by a facial landmark detector and then reconstruct missing high-frequency details from similar HR reference components.
  • Deep learning techniques can be very different: Xu et al. employ the framework of generative adversarial networks to recover blurry LR face images; Zhu et al. present a cascade bi-network, dubbed CBN, to localize LR facial components first and then upsample the facial components.


Xin Yu and his colleagues propose a multi-task deep neural network that not only super-resolves LR images but also estimates the spatial positions of their facial components. Their convolutional neural network (CNN) has two branches: one for super-resolving face images and the other – for predicting salient regions of a face coined facial component heatmaps.

The whole process looks like this:

  1. Super-resolving features of input LR images.
  2. Employing a spatial transformer network to align the feature maps.
  3. Estimating the heatmaps of facial components with the upsampled feature maps.
  4. Concatenating estimated heatmaps of facial components with the upsampled feature maps.

This method can super-resolve tiny unaligned face images (16 x 16 pixels) with the upscaling factor of 8x while preserving face structure.

(a) LR image; (b) HR image; (c) Nearest Neighbors; (d) CBN, (e) TDAE, (f) TDAE trained on better dataset, (g) suggested approach
(a) LR image; (b) HR image; (c) Nearest Neighbors; (d) CBN, (e) TDAE, (f) TDAE trained on better dataset, (g) suggested approach

Now let’s learn the details of the proposed method.

Model overview

The network has the following structure:

  1. A multi-task upsampling network (MTUN):
    1. an upsampling branch (composed of a convolutional autoencoder, deconvolutional layers, and a spatial transformer network);
    2. a facial component heatmap estimation branch (HEB).
  2. Discriminative network, which is constructed by convolutional layers and fully connected layers.
The pipeline of the suggested network Face super resolution
The pipeline of the suggested network

Facial Component Heatmap Estimation. Even the state-of-the-art facial landmark detectors cannot accurately localize facial landmarks in very low-resolution images. So, the researchers propose to predict facial component heatmaps from super-resolved feature maps.

2D photos may exhibit a wide range of poses. Thus, to reduce the number of training images required for learning HEB, they suggest employing a spatial transformer network (STN) to align the upsampled features before estimating heatmaps.

In total, four heatmaps are estimated to represent four components of a face: eyes, nose, mouth, and chain (see the image below).

Visualization of estimated facial component heatmaps: (a) Unaligned LR image; (b) HR image; (c) Heatmaps; (d) Result; (e) The estimated heatmaps overlying the results
Visualization of estimated facial component heatmaps: (a) Unaligned LR image; (b) HR image; (c) Heatmaps; (d) Result; (e) The estimated heatmaps overlying the results

Loss Function. The results of using different combinations of losses are provided below:

Comparison of different losses
Comparison of different losses

On the above image:

  1. unaligned LR image,
  2. original HR image,
  3. pixel-wise loss only,
  4. pixel-wise and feature-wise losses combined,
  5. pixel-wise, feature-wise, and discriminative losses,
  6. pixel-wise and face structure losses,
  7. pixel-wise, feature-wise, and face structure losses,
  8. pixel-wise, feature-wise, discriminative, and face structure losses.

In training their multi-task upsampling network, the researchers have selected to use the last option (h).

Qualitative and Quantitative Comparisons

See the qualitative comparison of the suggested approach with the state-of-the-art methods:

Comparisons with the state-of-the-art methods: (a) Unaligned LR image; (b) HR image; (c) Bicubic interpolation; (d) VDSR; (e) SRGAN; (f) Ma et al.’s method; (g) CBN; (h) TDAE; (i) Suggested approach
Comparisons with the state-of-the-art methods: (a) Unaligned LR image; (b) HR image; (c) Bicubic interpolation; (d) VDSR; (e) SRGAN; (f) Ma et al.’s method; (g) CBN; (h) TDAE; (i) Suggested approach

As you can see, most of the existing methods fail to generate realistic face details, while the suggested approach outputs realistic and detailed images, which are very close to the original HR image.

Quantitative comparison with the state-of-the-art methods leads us to the same conclusions. All methods were evaluated on the entire test dataset by the average PSNR and the structural similarity (SSIM) scores.

Quantitative comparisons on the entire test dataset
Quantitative comparisons on the entire test dataset

The results in the table show that the approach presented here outperforms the second best with a large margin of 1.75 dB in PSNR. This confirms that estimating heatmaps helps in localizing facial components and aligning LR faces more accurately.

Bottom Line

Let’s summarize the contributions of this work:

  • It presents a novel multi-task upsampling network that can super-resolve very small LR face images (16 x 16 pixels) by an upscaling factor of 8x.
  • The method not only exploits image intensity similarity but also estimates the face structure with the help of facial component heatmaps.
  • The estimated facial component heatmaps provide not only spatial information of facial components but also their visibility information.
  • Thanks to the aligning of feature maps before heatmap estimation, the number of images required for training the model is largely reduced.

The method is good at super-resolving very low-resolution faces in different poses and generates realistic and detailed images free from distortions and artifacts.

Anatomically-Aware Facial Animation from a Single Image

7 August 2018
GAN animation

Anatomically-Aware Facial Animation from a Single Image

Let’s say you have a picture of a Hugh Jackman for an advertisement. He looks great, but the client wants him to look a little bit happier. No, you don’t…

Let’s say you have a picture of a Hugh Jackman for an advertisement. He looks great, but the client wants him to look a little bit happier. No, you don’t need to invite the celebrity for another photo shoot. You even don’t need to spend hours with Photoshop. You can automatically generate a dozen images from this single image, where Hugh will have the smiles of different intensities. You can even create an animation from this single image, where Hugh changes his facial expression from absolutely serious to absolutely happy.

The authors of a novel GAN conditioning scheme based on Action Units (AUs) annotations claim that such scenario is not futuristic but absolutely feasible as of today. And in this article, we’re going to have an overview of their approach, observe the results generated with the suggested method, and compare its performance with the state-of-the-art approaches.

Suggested Method

Facial expressions are the result of the combined and coordinated action of facial muscles. They can be described in terms of the so-called Action Units, which are anatomically related to the contractions of specific facial muscles. For example, the facial expression for fear is generally produced with the following activations: Inner Brow Raiser (AU1), Outer Brow Raiser (AU2), Brow Lowerer (AU4), Upper Lid Raiser (AU5), Lid Tightener (AU7), Lip Stretcher (AU20) and Jaw Drop (AU26). The magnitude of each AU defines the extent of emotion.

Building on this approach to defining facial expressions, Pumarola and his colleagues suggest a GAN architecture, which is conditioned on a one-dimensional vector indicating the presence/absence and the magnitude of each action unit. They train this architecture in an unsupervised manner that only requires images with their activated AUs. Next, they split the problem into two main stages:

  • Rendering a new image under the desired expression by considering an AU-conditioned bidirectional adversarial architecture provided with a single training photo.
  • Rendering back the synthesized image to the original pose.

Moreover, the researchers wanted to ensure that their system will be able to handle images under changing backgrounds and illumination conditions. Hence, they’ve added an attention layer to their network. It focuses the action of the network only in those regions of the image that are relevant to convey the novel expression.

Let’s now move on to the next section to reveal the details of this network architecture – how does it succeed in generating anatomically coherent facial animations from images in the wild?

Overview of the face expression approach
Overview of the approach

Network Architecture

The proposed architecture consists of two main blocks:

  • Generator G regresses attention and color masks (note that it’s applied twice, first to map the input image and then to render it back). The aim was to make generator focusing only on those regions that are responsible for synthesizing the novel face expression while keeping the rest elements of the image such as hair, glasses, hats or jewelry untouched. Thus, instead of regressing the full image, this generator outputs a color mask C and an attention mask A. The mask A indicates to which extent each pixel of the C contributes to the output image. This leads to sharper and more realistic images at the end.
Attention-based generator
Attention-based generator
  • Conditional critic. Critic D evaluates the generated image in its photorealism and expression conditioning fulfillment.

The loss function for this network is a linear combination of several partial losses: image adversarial loss, attention loss, conditional expression loss, and identity loss:

loss function

The model is trained on a subset of 200 000 images from the EmotioNet dataset using the Adam optimizer with the learning rate of 0.0001, beta1 0.5, beta2 0.999 and batch size 25.

Experimental Evaluation

The image below demonstrates the model’s ability to activate AUs at different intensities while preserving the person’s identity. For instance, you can see that the model properly handles the case with zero intensity and generates an identical copy of the input image. For the non-zero cases, the model realistically renders complex facial movements and outputs images that are usually indistinguishable from the real ones.

AU intensity
Single AU’s edition: varying the levels of intensity

Next figure displays the attention mask A and the color mask C. You can see how the model focuses its attention (darker area) onto the corresponding action units in an unsupervised manner. Hence, only the pixels relevant to the expression change are carefully estimated, while the background pixels are directly copied from the input image. This feature of the model is very handy when dealing with images in the wild.

attention mask
Attention Model

Now, let’s see how the model handles the task of editing multiple AUs. The outcomes are depicted below. You can observe here remarkably smooth and consistent transformation across frames even with challenging light conditions and non-real world data, as in the case of the avatar. These results encourage the authors to further extend their model to video generation. They should definitely try, shouldn’t they?

editing multiple AUs
Editing multiple Action Units

Meanwhile, they compare their approach against several state-of-the-art methods to see how well their model performs at generating different facial expressions from a single image. The results are demonstrated below. It looks like the bottom row, representing the suggested approach, contains much more visually compelling images with notably higher spatial resolution. As we’ve already discussed before, the use of the attention mask allows applying the transformation only on the cropped face and put it back on the original image without producing an artifact.

face expression
Qualitative comparison with state-of-the-art

Limitations of the Model

Let’s now discuss the model’s limits – which types of challenging images it is still able to handle, and when it actually fails. As demonstrated on the image below, the model succeeds when dealing with human-like sculptures, non-realistic drawings, non-homogeneous textures across the face, anthropomorphic faces with non-real textures, non-standard illuminations/colors and even the face sketches.

However, there are several cases, when it fails. The first failure case depicted below results from the errors in the attention mechanism when given extreme input expressions. The model may also fail when the input image contains non-previously seen occlusions such as an eye patch causing artifacts in the missing face attributes. It is also not ready to deal with non-human anthropomorphic distributions as in the case of cyclopes. Finally, the model can also generate artifacts like human face features when dealing with animals.

face expression failure
Success and failure cases

Bottom Line

The model presented here is able to generate anatomically-aware face animations from the images in the wild. The resulting images surprise with their realism and high spatial resolution. The suggested approach advances current works, which had only addressed the problem for discrete emotions category editing and portrait images. Its key contributions include a) encoding face deformations by means of AUs to render a wide range of expressions and b) embedding an attention model to focus only on the relevant regions of the image. Several failure cases that were observed are presumably due to insufficient training data. Thus, we can conclude that the results of this approach are very promising, and we look forward to observing its performance for video sequences.

Unsupervised Attention-Guided Image-to-Image Translation

30 July 2018
Unsupervised Attention-Guided Image-to-Image Translation

Unsupervised Attention-Guided Image-to-Image Translation

Image-to-image translation is the task of mapping an image from a source domain to a target domain. Applications include image colorization, image super-resolution, style transfer, domain adaptation and data augmentation. Most of the approaches require…

Image-to-image translation is the task of mapping an image from a source domain to a target domain. Applications include image colorizationimage super-resolutionstyle transferdomain adaptation and data augmentation. Most of the approaches require data from each domain to be paired or under alignment, e.g., when translating satellite images to topographic maps, which restricts applications and may not even be possible for some domains. Unsupervised approaches, such as DiscoGAN and CycleGAN overcome this problem with cyclic losses which encourage the translated domain to be faithfully reconstructed when mapped back to the original domain. Existing algorithms feed an input image to an encoder–decoder-like neural network architecture called the generator, which tries to translate the image. Then, this output is fed to a discriminator which attempts to classify if the output image has indeed been translated.

However, these approaches are limited by the system’s inability to attend only to specific scene objects. In the unsupervised case, where images are not paired or aligned, the network must additionally learn which parts of the scene are intended to be translated. For example, in Figure 1, a convincing translation between the horse and zebra domains requires the network to attend to each animal and change only those parts of the image. This is challenging for existing approaches, even if they use a localized loss like PatchGAN, as the network itself has no explicit attention mechanism. Instead, they typically aim to minimize the divergence between the underlying data-generating distribution for the entire image in the source and target domains. To overcome this limitation, a new approach is introduced which minimize the divergence between only the relevant parts of the data-generating distributions for the source and target domains.

Architecture Design

data flow diagram
Data-flow diagram from the source domain S to the target domain T during training

The goal of image translation is to estimate a map F(S→T) from a source image domain S to a target image domain T based on independently sampled data instances X(S) and X(T), such that the distribution of the mapped instances F(S→T) (XS) matches the probability distribution P(T) of the target. The training of the transfer network F(S→T) requires a discriminator D(T) to try to detect the translated outputs from the observed instances X(T). For cycle consistency, the inverse map F(T→S) and the corresponding discriminator D(S) are simultaneously trained. Solving this problem requires solving two equally important tasks:

  • (1) locating the areas to map in each image, and
  • (2) applying the right mapping to the located areas.

To achieve this, two attention networks A(S) and A(T), which select areas to translate by maximizing the probability that the discriminator makes a mistake.

Attention-guided generator

Input images are feed into attention network A(s), resulting in the attention map s(a) =AS(s). the mapped image s` is obtained by:

attention guided generator

The ‘foreground’ object s(f) is obtained via an element-wise product on each RGB channel: s(f) =s(a)⊙s. Then, the foreground s(f) is fed into the generator F(S→T), which maps sf to the target domain T. To create background image s(b) = (1−s(a))⊙s, and add it to the masked output of the generator F(S→T).

Loss function: This process is governed by the adversarial energy:

 attention guided generator

Attention-guided discriminator

This added loss makes our framework more robust in two ways: (1) it enforces the attended regions in the generated image to conserve content (e.g., pose), and (2) it encourages the attention maps to be sharp (converging towards a binary map), as the cycle-consistency loss of unattended areas will always be zero.

The final energy is obtained loss by combining the adversarial and cycle-consistency losses for both source and target domains are as follows:

attention guided discriminator

With a continuous attention map, the discriminator may receive ‘fractional’ pixel values, which may be close to zero early in training. While the generator benefits from being able to blend pixels at object boundaries, multiplying real images by these fractional values cause the discriminator to learn that mid gray is ‘real’ (i.e., we push the answer towards the midpoint 0 of the normalized [−1,1] pixel space). The learned attention map for the discriminator is as follows:

attention-guided discriminator

Thus, the updated adversarial energy L(adv) are as follows:

 attention guided discriminator


Fréchet Inception Distance (FID) is used to evaluate the image translation framework. FID computes the Fréchet distance between feature representations of real and generated images. Such feature representations are extracted from the last hidden layer of the Inception architecture. This approach achieves the lowest FID in all but one mapping, with CycleGAN as the next best performing approach. UNIT achieves the second-lowest FID value, which suggests that the latent space assumption is useful in this setting. The code can be found here.

Fréchet Inception Distance
Fréchet Inception Distance for different algorithms

While modern unsupervised image-to-image translation techniques can map relevant image regions, they also inadvertently map irrelevant regions, too. By doing so, the generated images fail to look realistic, as the background and foreground are generally not appropriately blended. By incorporating an attention mechanism into unsupervised image-to-image translation, this approach demonstrates significant improvements in the quality of generated images.

Fréchet Inception Distance resultsFréchet Inception Distance results

Fréchet Inception Distance results

Apples to orange

Apples to orange results

Bonus — results for ablation experiments

By only adopting the holistic image discriminator (‘Ours–D’), the attention networks start to focus on the background as shown in the bottom row:

results for ablation experiments


New Approach to Recovering 3D Shape Structure from a Single 2D Image

27 July 2018
3D recovery

New Approach to Recovering 3D Shape Structure from a Single 2D Image

Single-view image-based 3D modeling is a topic of particular interest the last few years. That’s likely due to the tremendous success of deep convolutional neural networks (CNN) on image-based learning…

Single-view image-based 3D modeling is a topic of particular interest the last few years. That’s likely due to the tremendous success of deep convolutional neural networks (CNN) on image-based learning tasks. However, most of the deep models provide the only volumetric representation of 3D shapes as output. As a result, important information about shape topology or part structure is lost.

Figure 1. Results of 3D share structure recovery from a single image. Top-8 images, returned by Google, when searching for “chair”, “table” and “airplane” were used to test the new approach. Failure cases are marked with red

The alternative could be to recover 3D shape structure, which encompasses part composition and part relations. This task is quite challenging: inferring a part segmentation for a 3D shape is not an easy task by itself, but even if a segmentation is given, it is still challenging to reason about part relations such as connection, symmetry, parallelism, and others.

In fact, we can talk about several particular challenges here:

  • Part decomposition and relations are not as explicit in 2D images, as, for example, shape geometry. It should be also noted that compared to pixel-to-voxel mapping, recovering part structure from pixels would be a highly ill-posed task.
  • Many 3D CAD models of human-made objects contain diverse substructures, and recovery of those complicated 3D structures is far more challenging than shape synthesis modulated by a shape classification.
  • Objects from real images usually have different textures, lighting conditions, and backgrounds.

What’s Suggested

Chengjie Niu, Jun Li, and Kai Xu suggest learning a deep neural network that directly recovers 3D shape structure of an object from a single RGB image. To accomplish this task, they propose to learn and integrate two networks:

  • Structure masking network, which highlights multi-scale object structures in an input 2D image. It is designed as a multi-scale convolutional neural network (CNN) augmented with jump connections. Its task retains shape details while screening out the structure-irrelevant information such as background and textures.
  • Structure recovery network, which recursively recovers a hierarchy of object parts abstracted by cuboids. This network takes as input the features extracted in the structure masking network, adds the CNN features of the original 2D image, and then feeds all these features into a recursive neural network (RvNN) for 3D structure decoding. The output is a tree organization of 3D cuboids with plausible spatial configuration and reasonable mutual relations.

The two networks are trained jointly. The training data includes image-mask and cuboid-structure pairs that can be generated by rendering 3D CAD models and extracting the box structure based on the given parts of the shape.

Network Architecture

An overview of the suggested network architecture is depicted in the image below. As you can see from the resultant cuboid structure of the chair, symmetries between chair legs (highlighted by red arrows) were successfully recovered by this network.

Figure 2. Network architecture

Let’s check more closely the details of the suggested solution.

The structure masking network is a two-scale CNN trained to produce a contour mask for the object of interest. The authors decided to include this network as the first step since previous studies of the subject revealed that object contours provide strong cues for understanding shape structures in 2D images. However, instead of utilizing the extracted contour mask, they suggest taking the feature map of the last layer of the structure masking network and feeding it into the structure recovery network.

Next, the structure recovery network combines features from two convolutional channels. One channel takes as input the last feature map before the mask prediction layer from the structure masking network. Another channel is the CNN feature of the original image extracted by a VGG-16. Since it is hard for the masking network to produce perfect mask prediction, the CNN feature of the original image provides complementary information by retaining more object information.

So, the recursive neural network (RNN) starts from a root feature code and recursively decodes it into a hierarchy of features until reaching the leaf nodes, which can be further decoded into a vector of box parameters. The suggested solution uses three types of nodes in its hierarchy, including leaf node, adjacency node, and symmetry node, as well as the corresponding decoders such as box decoder, adjacency decoder, and symmetry decoder. Illustration of the decoder network at a given node is provided below.

Figure 3. Decoder network

Thus, during the decoding, two types of part relations are recovered as the class of internal nodes: adjacency and symmetry. In order to determine correctly type of the node and use the corresponding decoder, a separate node classifier is trained jointly with the three decoders. It is learned based on the training task of structure recovery, where the ground-truth box structure is known for a given training pair of image and shape structure.

The dataset for training the model included 800 3D shapes from three categories in ShapeNet: chairs (500), tables (200), airplanes (100). For each 3D shape, researchers created 36 rendered views around the shape for every 30° rotation with 3 elevations. Together with another 24 randomly generated views, there we 60 rendered RGB images in total for each shape. The 3D shapes were then complemented with randomly selected backgrounds from NYU v2 dataset.

Results and Application

Some results of recovering 3D shape structure from a single RGB image using the suggested approach can be observed in Figure 1, where top 8 images, returned by Google for the search of “chair”, “table” and “airplane”, were selected, and then for each image a 3D cuboid structure was recovered. From the results, it can be observed that the approach described here is able to recover 3D shape structures from real images in a detailed and accurate way. Moreover, it allows recovering the connection and symmetry relations of the shape parts from single view inputs.

The authors of this approach suggest two settings, where their method of recovering 3D shape structure can be used:

  • structure-aware image editing;
  • structure-assisted 3D volume refinement.

The results of applying their method to these problems are demonstrated in the image below.

Figure 4. Top row: The inferred 3D shape structure can be used to complete and refine the volumetric shape. Bottom row: The structure is used to assist structure-aware image editing.

Bottom Line

The suggested approach to recovering 3D shape structure from a single RGB image has several important strengths:

  • connection and symmetry relations are recovered quite accurately;
  • the overall result is sufficiently detailed;
  • the method can be useful for structure-aware image editing and structure-assisted 3D volume refinement.

However, the method fails to recover structures for object categories unseen from the training set. Moreover, it currently recovers 3D cuboids only but not the underlying part geometry, and so the roundtable appears like a square table in a recovered 3D shape structure.

Figure 5: Comparing single-view, part-based 3D shape reconstruction between our Im2Struct and two alternatives

To sum up, by combining 2 neural networks (structure masking network and structure recovery network) the researchers managed to recover faithful and detailed 3D shape structure of an object from a single 2D image, reflecting part connectivity and symmetries — something that has never been done before.

The main job was done by the second network (namely, reflecting part connectivity and symmetries) while combining it with the structure masking network allowed for more accurate results in general. From this point of view, we may say that structure recovery network, and, in particular, structure decoding part of this network is a key component of this research.

Image Editing Becomes Easy with Semantically Meaningful Objects Generated

3 July 2018
semantic soft segmentation

Image Editing Becomes Easy with Semantically Meaningful Objects Generated

Image editing and compositing could be a fascinating creative process unless you need to spend most of your time on the tedious task of object selection. The process becomes even…

Image editing and compositing could be a fascinating creative process unless you need to spend most of your time on the tedious task of object selection. The process becomes even more time-consuming when some fuzzy boundaries and transparency are involved. Existing tools such as the magnetic lasso and the magic wand exploit only low-level cues and produce binary selections that need further refinement by the virtual artist to account for soft boundaries.

In this article, we are going to discover how neural networks may assist with this challenging task and create a set of layers that correspond to semantically meaningful regions with accurate soft transitions between different objects.

Suggested Approach

Group of researchers from MIT CSAIL (USA) and ETH Zürich (Switzerland), headed by Y. Aksoy suggested approaching this problem from a spectral segmentation angle. In particular, they propose a graph structure that combines the texture and color information from the input image as well as higher-level semantic information generated by a neural network. The soft segments are generated via eigendecomposition of the carefully constructed Laplacian matrix fully automatically. High-quality layers generated from the eigenvectors can then be utilized for quick and easy image editing. Combining elements from different images has always been a powerful way to produce new content, and now it’s also becoming much more efficient with the automatically created layers.

Overview of the suggested approach

Model Specifications

Let’s now discuss this approach to creating semantically meaningful layers step-by-step:

1. Spectral matting. The approach builds upon the work of Levin and his colleagues, who were first to introduce the matting Laplacian that uses local color distributions to define a matrix L that captures the affinity between each pair of pixels in a local patch. Using this matrix, they minimize the quadratic functional aᵀLa, subject to user-provided constraints, with a denoting a vector made of all the a values for a layer. So, each soft segment is a linear combination of the K eigenvectors corresponding to the smallest eigenvalues of L that maximizes matting sparsity.

2. Color affinity. For defining nonlocal color affinity, the researchers suggest generating 2500 superpixels and estimate the affinity between each superpixel and all the superpixels within a radius that corresponds to 20% of the image size. This affinity essentially makes sure that the regions with very similar colors stay connected in challenging scene structures like the one depicted below.

Nonlocal color affinity

3. High-level semantic affinity. This stage was introduced to create segments that are confined in semantically similar regions. Semantic affinity encourages the grouping of pixels that belong to the same scene object and discourages that of pixels from different objects. Here the researchers build upon prior work in the domain of object recognition to compute a feature vector at each pixel that correlated with the underlying object. Feature vectors are computed via a neural network, which will be discussed in more details later. Semantic affinity is defined over superpixels similarly to color affinity. However, unlike the color affinity, the semantic affinity only relates nearby superpixels to favor the creation of connected objects. Combination of nonlocal color affinity and local semantic affinity allows creating layers that cover spatially disconnected regions of the same semantically coherent region (e.g., greenery, sky, other types of background).

Semantic affinity

4. Creating the layers. This part is carried out using the affinities defined earlier to form a Laplacian matrix L. The eigenvectors corresponding to the 100 smallest eigenvalues of L are extracted from this matrix, and then a two-step sparsification process is used to create 40 layers from these eigenvectors. Then, the number of layers is reduced by running the k-means algorithm with k = 5. This approach produced better results than trying to directly sparsify the 100 eigenvectors into 5 layers since such drastic reduction makes the problem overly constrained. The researchers have chosen a number of segments to be equal to 5 and claim that it is a reasonable number for most images. Still, this number can be changed by the user depending on the scene structure.

Soft segments before and after grouping

5. Semantic feature vectors. In this implementation, a semantic segmentation approach was combined with a network for metric learning. The base network of the feature extractor is based on DeepLab-ResNet-101 and trained with a metric learning approach to maximize the L2 distance between the features of different objects. Thus, the network minimizes the distance between the features of samples having the same ground-truth classes and maximizes the distance otherwise.

Qualitative Comparison to the Related Methods

Comparison between the proposed soft segments and soft color segments by Aksoy

Figures below show the results of the suggested approach (marked as ‘Our result’) together with that of spectral matting as the most related soft segmentation method, and two state-of-the-art methods for semantic segmentation: the scene parsing method PSPNet and the instance segmentation method Mask R-CNN.

Qualitative comparison of the semantic soft segmentation approach with the related methods

You may observe that PSPNet and Mask R-CNN tend to produce inaccuracies around object boundaries, while the soft segments of spectral matting often extend beyond object boundaries. At the same time, the semantic soft segmentation approach, described here, captures objects in their entirety without grouping unrelated objects and achieves a high accuracy at edges, including soft transitions when appropriate. However, it should be noted that the semantic features in this method are not instance-aware, i.e. the features of two different objects of the same class are similar. This results in multiple objects being represented in the same layer such as cows or giraffes on the pictures above.

Image Editing with Semantic Soft Segments

Several use cases of soft segments for targeted editing and compositing are demonstrated below. As you see, the soft segments can also be used to define masks for specific adjustment layers such as adding motion blur to the train in (2), color grading the people and the backgrounds separately in (5, 6) and separate stylization of the hot-air balloon, sky, terrain and the person in (8). While these edits can be done via user-drawn masks or natural matting algorithms, automated defining of the semantically meaningful objects makes the targeted edits effortless for the visual artist.

Use of semantic soft segmentation in image editing tasks

Bottom Line

The proposed approach generates soft segments that correspond to semantically meaningful regions in the image by fusing the high-level information from a neural network with low-level image features fully automatically. However, the method has several limitations. First of all, it is relatively slow: runtime for a 640 x 480 image lies between 3 and 4 minutes. Secondly, the method does not generate separate layers for different instances of the same class of objects. And finally, as demonstrated below, the method may fail at the initial constrained sparsification step when the object colors are very similar (top example), or the grouping of soft segments may fail due to unreliable semantic feature vectors around large transition regions (bottom example).

Failure cases

Still, soft segments generated using the presented approach, provide a convenient intermediate image representation that makes it much easier to handle image editing and compositing tasks, which otherwise require lots of manual labor.

FAIR Proposed a New Partially Supervised Trading Paradigm to Segment Every Thing

26 June 2018
image segmentation

FAIR Proposed a New Partially Supervised Trading Paradigm to Segment Every Thing

Object detectors have become significantly more accurate and gained new capabilities. One of the most exciting is the ability to predict a foreground segmentation mask for each detected object, a…

Object detectors have become significantly more accurate and gained new capabilities. One of the most exciting is the ability to predict a foreground segmentation mask for each detected object, a task called instance segmentation. In practice, typical instance segmentation systems are restricted to a narrow slice of the vast visual world that includes only around 100 object categories. A principal reason for this limitation is that state-of-the-art instance segmentation algorithms require strong supervision and such supervision may be limited and expensive to collect for new categories. By comparison, bounding box annotations are more abundant and cheaper.

FAIR (Facebook AI Research) introduced a new partially supervised instance segmentation task and proposed a novel transfer learning method to address it. The partially supervised instance segmentation task as follows:

  1. Given a set of categories of interest, a small subset has instance mask annotations, while the other categories have only bounding box annotations.
  2. The instance segmentation algorithm should utilize this data to fit a model that can segment instances of all object categories in the set of interest.

Since the training data is a mixture of strongly annotated examples (those with masks) and weakly annotated examples (those with only boxes), the task is referred to partially supervised. To address partially supervised instance segmentation, a novel transfer learning approach built on Mask R-CNN. Mask R-CNN is well-suited to this task because it decomposes the instance segmentation problem into the subtasks of bounding box object detection and masks prediction.

Learning to Segment Every Thing

Let C be the set of object categories for which instance segmentation model is trained. All training examples in C are annotated with instance masks. It is to be assumed that C = A ∪ B where samples from the categories in A have masks, while those in B have only bounding boxes. Since the examples of the B categories are weakly labeled w.r.t., the target task (instance segmentation), it is referred to train on the combination of strong and weak labels as a partially supervised learning problem. Given an instance segmentation model like Mask RCNN that has a bounding box detection component and a mask prediction component, a model Mask^X RCNNmethod that transfers category-specific information from the model’s bounding box detectors to its instance mask predictors.

Mask^X R-CNN method
Detailed illustration of proposed Mask^X R-CNN method.

This method is built on Mask R-CNN, because it is a simple instance segmentation model that also achieves state-of-the-art results. In Mask R-CNN, the last layer in the bounding box branch and the last layer in the mask branch both contain category-specific parameters that are used to perform bounding box classification and instance mask prediction, respectively, for each category. Instead of learning the category-specific bounding box parameters and mask parameters independently, authors propose to predict a category’s mask parameters from its bounding box parameters using a generic, category-agnostic weight transfer function that can be jointly trained as part of the whole model.

For a given category c, let w(det) be the class-specific object detection weights in the last layer of the bounding box head, and w(seg) be the class-specific mask weights in the mask branch. Instead of treating w(seg) as model parameters, w(seg) is parameterized using a generic weight prediction function T (·):

Generic weight prediction function T (·)

where θ are class-agnostic, learned parameters. The same transfer function T(·) may be applied to any category c and, thus, θ should be set such that Tgeneralizes to classes whose masks are not observed during training.

T (·) can be implemented as a small fully connected neural network. Figure 1 illustrates how the weight transfer function fits into Mask R-CNN to form Mask^X R-CNN. Note that the bounding box head contains two types of detection weights: the RoI classification weights w(cls) and the bounding box regression weights w(box).

Experiments on COCO

This method is evaluated on COCO dataset which is small-scale w.r.t. the number of categories but contains exhaustive mask annotations for 80 categories. This property enables rigorous quantitative evaluation using standard detection metrics, like average precision (AP). Each class has a 1024-d RoI classification parameter vector w(cls) and a 4096- d bounding box regression parameter vector w(box) in the detection head, and a 256-d segmentation parameter vector w(seg) in the mask head. The output mask resolution is M × M = 28 × 28. Table below compares full Mask^X R-CNN method (i.e., Mask R-CNN with ‘transfer+MLP’ and T implemented as ‘cls+box, 2-layer, LeakyReLU’) and the class-agnostic baseline using end-to-end training.

Experiments on COCO

Mask^X R-CNN outperforms these approaches by a large margin (over 20% relative increase in mask AP).


Mask^X R-CNN approach
Mask predictions from the class-agnostic baseline (top row) vs. Mask^X R-CNN approach (bottom row). Green boxes are classes in set A while the red boxes are classes in set B. The left 2 columns are A = {voc} and the right 2 columns are A = {non-voc}.

This research addresses the problem of large-scale instance segmentation by formulating a partially supervised learning paradigm in which only a subset of classes have instance masks during training while the rest have box annotations.The FAIR proposes a novel transfer learning approach, where a learned weight transfer function predicts how each class should be segmented based on parameters learned for detecting bounding boxes. Experimental results on the COCO dataset demonstrate that this method significantly improves the generalization of mask prediction to categories without mask training data. This model will help to build a large-scale instance segmentation model over 3000 classes in the Visual Genome dataset.

Mask predictions on 3000 classes in Visual Genome.
Mask predictions from Mask^X R-CNN on 3000 classes in Visual Genome.

Muneeb Ul Hassan