Dissecting GANs for Better Understanding and Visualization

5 December 2018
dissecting gan paper

Dissecting GANs for Better Understanding and Visualization

GANs can be taught to create (or generate) worlds similar to our own in any domain: images, music, speech, etc. Since 2014, a large number of improvements of GANs have…

GANs can be taught to create (or generate) worlds similar to our own in any domain: images, music, speech, etc. Since 2014, a large number of improvements of GANs have been proposed, and GANs have achieved impressive results. Researchers from MIT-IBM Watson Lab have presented GAN Paint based on Dissecting GAN – the method to validate if an explicit representation of an object is present in an image (or feature map) from a hidden layer:

GAN paint gif
The GAN Paint interactive tool

State-of-the-art Idea

However, a question that is raised very often in ML is the lack of understanding of the methods developed and applied. Despite the success of GANs, visualization and understanding of GANs are very little explored fields in research.

A group of researchers led by David Bau have done the first systematic study for understanding the internal representations of GANs. In their paper, they present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level.

Their work resulted with a general method for visualizing and understanding GANs at different levels of abstraction, several practical applications enabled by their analytic framework and an open source interpretation tools for better understanding Generative Adversarial Network models.

dissecting gan
Inserting door by setting 20 causal units to a fixed high value at one pixel in the representation.

Method

From what we have seen so far, especially in the image domain, Generative Adversarial Networks can generate super realistic images from different domains. From this perspective, one might say that GANs have learned facts about a higher abstraction level – objects for example. However, there are cases where GANs fail terribly and produce some very unrealistic images. So, is there a way to explain at least these two cases? David Bau and his team tried to answer this question among a few others in their paper. They studied the internal representations of GANs and tried to understand how a GAN represents structures and relationships between objects (from the point of view of a human observer).

As the researchers mention in their paper, there has been previous work on visualizing and understanding deep neural networks but mostly for image classification tasks. Much less work has been done in visualization and understanding of generative models.

The main goal of the systematic analysis is to understand how objects such as trees are encoded by the internal representations of a GAN generator network. To do this, the researchers study the structure of a hidden representation given as a feature map. Their study is divided into two phases that they call: dissection and intervention.

Characterizing units by Dissection

The goal of the first phase – Dissection, is to validate if an explicit representation of an object is present in an image (or feature map) from a hidden layer. Moreover, the goal is to identify which classes from a dictionary of classes have such explicit representation.

To search for explicit representations of objects they quantify the spatial agreement between the unit thresholded feature map and a concept’s segmentation mask using intersection-over-union (IoU) measure. The result is called agreement, and it allows for individual units to be characterized. It allows to rank the concepts related to each unit and label each unit with the concept that matches it best.

Dissection algorithm
Phase 1: Dissection.

Measuring causal relationships using Intervention

The second important question that was mentioned before is causality. Intervention – denoted as the second phase, seeks to estimate the causal effect of a set of units on a particular concept.

To measure this effect, in the intervention phase the impact of forcing units on (unit insertion) and off (unit ablation) is measured, again using segmentation masks. More precisely, a feature map’s units are forced on and off, and both resulting images from those two representations are segmented to obtain two segmentation masks. Finally, these masks are compared to measure the causal effect.

Intervention algorithm
Phase 2: Intervention.

Results

For the whole study, the researchers use three variants of Progressive GANs (Karras et al., 2018) trained on LSUN scene datasets. For the segmentation task, they use a recent image segmentation model (Xiao et al., 2018) trained on the ADE20K scene dataset.

An extensive analysis was done using the proposed framework for understanding and visualization of GANs. The first part – Dissection was used by the researchers for analyzing and comparing units across datasets, layers, and models, and locating artifact units.

Comparing representations learned by progressive GANs
Comparing representations learned by progressive GANs trained on different scene types. The units that emerge match objects that commonly appear in the scene type: seats in conference rooms and stoves in kitchens.
GAN
Removing successively larger sets of tree-causal units from a GAN.

A set of dominant object classes and the second part of the framework- intervention, were used to locate causal units that can remove and insert objects in different images. The results are presented in the paper, the supplementary material and a video were released demonstrating the interactive tool. Some of the results are shown in the figures below.

Visualizing the activations of individual units in two GANs.
Visualizing the activations of individual units in two GANs.

Conclusion

This is one of the first extensive studies that target the understanding and visualization of generative models. Focusing on the most popular generative model – Generative Adversarial Networks, this work reveals significant insights about generative models. One of the main findings is that the larger part of GAN representations can be interpreted. It shows that GAN’s internal representation encodes variables that have a causal effect on the generation of objects and realistic images.

Many researchers will potentially benefit from the insights that came out of this work and the proposed framework that will provide a basis for analysis, debugging and understanding of Generative Adversarial Network models.

PIFR: Pose Invariant 3D Face Reconstruction

26 November 2018
pifr reconstruction

PIFR: Pose Invariant 3D Face Reconstruction

3D face geometry needs to be recovered from 2D images in many real-world applications, including face recognition, face landmark detection, 3D emoticon animation etc. However, this task remains challenging especially…

3D face geometry needs to be recovered from 2D images in many real-world applications, including face recognition, face landmark detection, 3D emoticon animation etc. However, this task remains challenging especially under the large pose, when much of the information about the face is unknowable.

Jiang and Wu from Jiangnan University (China) and Kittler from University of Surrey (UK) suggest a novel 3D face reconstruction algorithm, which significantly improves the accuracy of reconstruction even under extreme pose.

But let’s first shortly review the previous work on 3D face models and 3D face reconstruction.

Related Work

The research mentions four publicly available 3D deformation models:

This paper uses a BFM model, which is the most popular.

There are several approaches to reconstructing 3D model from 2D images, including:

State-of-the-art idea

The paper by Jiang, Wu, and Kittler proposes a novel Pose-Invariant 3D Face Reconstruction (PIFR) algorithm based on 3D Morphable Model (3DMM).

Firstly, they suggest generating a frontal image by normalizing a single face input image. This step allows restoring additional identity information of the face.

The next step is to use a weighted sum of the 3D parameters of both images: the frontal one and the original one. This allows to preserve the pose of the original image, but also enhance the identity information.

The pipeline for the suggested approach is provided below.

Overview of the Pose-Invariant 3D Face Reconstruction (PIFR) method
Overview of the Pose-Invariant 3D Face Reconstruction (PIFR) method

The experiments show that PIFR algorithm has significantly improved the performance of 3D face reconstruction compared to the previous methods, especially in the extreme pose cases.

Let’s now have a closer look at the suggested model…

Model details

PIFR method is largely relying on the 3DMM fitting process, which can be expressed as minimizing the error between the 2D coordinates of the 3D point projection and the ground truth. However, the face generated by the 3D model has about 50,000 vertices, and thus iterative calculations result in the slow and ineffective convergence. To overcome this problem, the researchers suggest using the landmarks (e.g., eye center, mouth corner, and nose tip) as the ground truth in the fitting process. Specifically, they use a weighted landmark 3DMM fitting.

Top row: the original image and landmark. Bottom row: 3D face model and its alignment to the 2D image
Top row: the original image and landmark. Bottom row: 3D face model and its alignment to the 2D image

The next challenge is to reconstruct 3D faces in large poses. To solve this problem, the researchers use High-Fidelity Pose and Expression Normalization (HPEN) method, but only for normalization of the pose and not expression. Also, Poisson Editing is used to recover the occluded area of the face due to the angle.

Performance Comparison with Other Methods

The performance of PIFR method was evaluated for the face reconstruction:

  • in small and medium poses;
  • large poses;
  • extreme poses (±90 yaw angles).

For this purpose, the researchers used three publicly available datasets:

  • AFW dataset, which was created using Flickr images, contains 205 images with 468 marked faces, complex backgrounds and face poses.
  • LFPW dataset, which has 224 face images in the test set and 811 face images in the training set; each image is marked with 68 feature points; 900 face images from both sets were selected for testing in this research.
  • AFLW dataset is a large-scale face database, which contains around 250 million hand-labeled face images, and each image is marked with 21 feature points. This study used only extreme pose face images from this dataset for qualitative analysis.

Quantitative analysis. Using the Mean Euclidean Metric (MEM), the study compares the performance of PIFR method to E-3DMM and FW-3DMM on AFW and LFPW datasets. Cumulative errors distribution (CED) curves look like this:

Comparisons of cumulative errors distribution (CED) curves on AFW and LFPW datasets
Comparisons of cumulative errors distribution (CED) curves on AFW and LFPW datasets

As you can see from these plots and the tables below, PIFR method shows superior performance compared to the other two methods. Its reconstruction performance in large poses is particularly good.

Qualitative analysis. The method was also assessed qualitatively based on the face images in extreme poses from AFLW dataset. The results are shown in the figure below.

Comparison of 3D face reconstruction: (a) Input image; (b) FW-3DMM; (c) E-3DMM; (d) Suggested approach
Comparison of 3D face reconstruction: (a) Input image; (b) FW-3DMM; (c) E-3DMM; (d) Suggested approach

Even though half of the landmarks are invisible due to extreme posture, which leads to large errors and failures of other methods, the PIFR method still performs quite well.

Here are some additional examples of the PIFR method performance based on the images from the AFW dataset.

Top row: Input 2D image. Middle row: 3D face. Bottom row: Align to 2D image.
Top row: Input 2D image. Middle row: 3D face. Bottom row: Align to 2D image

Bottom Line

A novel 3D face reconstruction framework PIFR demonstrates good reconstruction performance even in extreme poses. By taking both the original and the frontal images for weighted fusion, the method allows restoring enough face information to reconstruct the 3D face.

In the future, the researchers plan to restore even more facial identity information to improve the accuracy of reconstruction further.

BrainNet – Brain-to-Brain Interface for Direct Collaboration Between Brains

23 October 2018
brain-to-brain interface

BrainNet – Brain-to-Brain Interface for Direct Collaboration Between Brains

In the past few years, the brain to direct computer communication started to gain more and more attention. A few breakthroughs have traced the path for researchers towards building a…

In the past few years, the brain to direct computer communication started to gain more and more attention. A few breakthroughs have traced the path for researchers towards building a powerful brain to computer interfaces that will enable a different kind of human-machine communication and interaction. Most of the research involving brain communication interfaces have focused on brain-to-computer interfaces (BCI). Less work has been done in the area of connecting two or multiple brains by developing a brain-to-brain interface (BBI).

State-of-the-art Idea

One famous study called “20 Questions” that examined a brain-to-brain interface was done a few years ago, and it was focusing on question answering by connecting two brains and outputting a result by touching a screen. This among a few other studies represents an initial, kind of primitive brain-to-brain interface which allows for the most straightforward interaction and only two subjects (brains).

Recently, researchers from the University of Washington and the Carnegie Mellon University have presented the first multi-person non-invasive direct brain-to-brain interface for collaborative problem-solving. In their work, they address a few problems and shortcomings of previous brain-to-brain interfaces to create an interface that will push the boundaries of direct, brain-to-brain communication. The proposed method, as well as the conducted experiments within this study, are explained in the sections that follow.

Method

Their method is based on EEG to record brain signals and transcranial magnetic stimulation (TMS) to deliver information in a non-invasive way to the brain. The whole architecture of the proposed brain-to-brain interfaces is using two kinds of modules: BCI – brain-to-computer interface based on EEG to convey information about a collaborative task and CBI – computer-to-brain interface based on TMS. A scheme of the interface is given in the image below.

brainnet - brain-to-brain interface
The architecture of BrainNet.

The interface allows three human subjects to collaborate and solve a task using direct brain-to-brain communication. In the experiments, the novel brain-to-brain interface was used to perform a collaborative work based on a Tetris-like game.

Since, this represents the first, non-invasive multi-person brain-to-brain interface, and the researchers explored the possibility to prioritize information as an essential feature of communication in social networks.

The initial implementation of BrainNet – the proposed new-generation brain-to-brain interface, allows for two “Sender” subjects and one “Receiver” subject to communicate. The interface does not require any physical action by the subjects.

brain-to-brain network
Examples of Screens seen by the Receiver and the Senders across Two Rounds.

Experiments and Results

The experiments were organized in a way that three participants are trying to solve a collaborative task – one of them, designated as the Receiver, is in charge of deciding whether or not to rotate a block in the Tetris-like game before it drops to fill a gap in a line at the bottom of the screen. This participant does not have any information about the bottom of the screen, therefore, cannot make a logical decision about the game. The two other participants, designated as the Senders, can see the entire screen and their task is to make the correct choice (rotate or not) based on the shape of the current block and the gap at the bottom and to inform the Receiver of the decision via the brain-to-brain interface.

In a setup like this, a trial is composed of two rounds: the first round is as described above; while in the second round, the Senders are given the opportunity to examine the Receiver’s decision (shown on their screen as the block, now potentially rotated, mid-way through its fall) and have another chance to make new (possibly corrective) suggestion.

brain-to-brain network accuracy
The accuracy achieved by each of the five triads of participants. Accuracy is defined as the proportion of correct block rotations achieved by the triad.

To measure the performance, the researchers use simple but also some more advanced evaluation metrics. First, they measure the number of correct block rotations, reporting a mean accuracy of 0.8125 (much higher than random 0.5). They also compute binary classification metrics such as AUC – the area under Receiver Operating Characteristic (ROC) curve.

ROC Curves for the Five Triads of Participants
ROC Curves for the Five Triads of Participants

To quantify the degree of communication and amount of transmitted information between subjects, the researchers employ mutual information measure – MI. As they report, a significant difference in the amount of transmitted data has been noted compared to a by-chance performance. Also, as expected due to the experimental design, they report significantly higher MI values (that is, more massive amounts of information being transferred) between a good Sender and the Receiver than between a bad Sender and the Receiver.

brain-to-brain network evaluation
MI evaluation metric: r represents a decision made by the Receiver (0 or 1 corresponding to “do not rotate” or “rotate”), s represents a decision made by one of the Senders, pR(r) represents the probability of the Receiver making a decision r, pS(s) represents the probability of one of the Senders making a decision s, and pR,S(r,s) represents the joint probability of the Receiver making a decision r and a Sender making a decision s
The difference in the amount
The difference in the amount of transmitted information given by the MI measure for good and bad senders

Conclusions

This work conducted by researchers at the University of Washington represents a significant contribution in the field of brain-to-brain interfaces. It represents a new-generation method that improves previous brain-to-brain interfaces in many ways. The proposed interface scales the BBIs to multiple human subjects working collaboratively to solve a task.

It is the first brain-to-brain interface to combine EEG and TMS signals in the same human subject, and it introduces an important feature – differentiation of the credibility of information. As such, this contribution represents an important step towards seamless brain-to-brain communication and collaborative problem solving through BBIs.

Fooling Facial Recognition: Fast Method for Generating Adversarial Faces

2 October 2018
Fooling Facial Recognition Fast Method for Generating Adversarial Faces

Fooling Facial Recognition: Fast Method for Generating Adversarial Faces

With the rapid progress and state-of-the-art performance in a wide range of tasks, deep learning based methods are in use in a large number of security-sensitive and critical applications. However,…

With the rapid progress and state-of-the-art performance in a wide range of tasks, deep learning based methods are in use in a large number of security-sensitive and critical applications. However, despite the remarkable, often beyond human-level performance, deep learning methods are vulnerable to well-designed input samples. This kind of input samples is named adversarial examples. In a game of “cat and mouse”, researchers have been competing in designing robust adversarial attacks on one hand and designing robust defense mechanisms on the other.

The problem of adversarial attacks is well emphasized in Computer Vision tasks such as object recognition, classification. In the field of image processing with deep learning, small perturbations in the input space can result in a significant change in the output. Such disorders are almost unnoticeable for humans and do not change the semantics of the image content itself, however, can trick deep learning methods.

Adversarial attacks are a big concern in security-critical applications such as identity verification, access control etc. One particular target of adversarial attacks is face recognition.

Previous works

The excellent performance of deep learning methods for face recognition has contributed for them to be accepted and employed in a wide variety of systems.

In the past, adversarial attacks have targeted face recognition systems. Mainly, these attacks can be divided into two more prominent groups: intensity-based and geometry-based adversarial attacks. Many of them proved to be very successful in fooling a face recognition system. However, a number of defense mechanisms have been proposed to deal with different kinds of attacks.

face adversarial attacks
Comparison of the proposed attack (Column 2) to an intensity-based
attack (Column 3).

To exploit the vulnerability of face recognition systems and surpass defense mechanisms, more and more sophisticated adversarial attacks have been developed. Some of them are changing pixel intensities while others are trying to transform benign images to perform the attack spatially.

State-of-the-art idea

Researchers from West Virginia University have proposed a new fast method for generating adversarial face images. The purpose of their approach is defining a face transformation model based on facial landmark locations.

Method

The problem of manipulating an image and transforming it to an adversarial sample has been addressed by landmark manipulation. The technique is based on optimizing for a displacement field, which is used to process the input image spatially. It is a geometry-based attack, able to generate adversarial sample by only modifying a number of landmarks.

Taking into account the fact that facial landmark locations provide highly discriminative information for face recognition tasks, the researchers exploit the gradients of the prediction to the position of landmarks to update the displacement field. A scheme of the proposed method for generating adversarial face images is shown in the picture below.

face adversarial attacks architecture
The proposed method is optimizing for a displacement field to produce adversarial landmark locations.

To overcome the problem of multiple possible updates of the displacement field due to the different possible direction of the gradients, they propose grouping the landmarks semantically. This allows manipulating the group properties instead of perturbing each landmark for obtaining natural images.

face landmarks
Grouping face landmarks based on semantic regions of the face (eye, nose, mouth etc.).

Results

The new adversarial face generator was evaluated by measuring and comparing the performance of the attacks under several defense methods. To further explore the problem of generating adversarial samples of face images the researchers assess how spatially manipulating the face regions affects the performance of a face recognition system.

First, the performance was evaluated on a white-box attack scenario on the CASIA-WebFace dataset. Six experiments were done to investigate the importance of each region of the face in the proposed attack methods. They evaluate the performance of the attacks on each of the five main areas of the face including 1) eyebrows, 2) eyes, 3) nose, 4) mouth, 5) jaw and 6) all regions. The results are given in the table.

face adversarial comparison
Comparison of the results of the proposed attacks FLM and GFLM to stAdv [33] and exploring the influence of different regions of the face.
The researchers calculate the prediction of the true class for faces which are correctly classified and their manipulated versions.

Comparison with other state-of-the-art

A comparison with several existing methods for generating adversarial faces has been made within this study. They compare the methods in terms of success rate and also speed.

Face adversarial
Comparison of the proposed FLM and GFLM attacks to FGSM and stAdv attacks under the state-of-the-art adversarial training defenses.
face adversarial attacks technique
Linearly interpolating the defined face properties

Conclusion

This approach shows that landmark manipulation can be a reliable way of changing the prediction of face recognition classifiers. The novel method can generate adversarial faces approximately 200 times faster than other geometry-based approaches. This method creates natural samples and can fool state-of-the-art defense mechanisms.

This Neural Network Evaluates Natural Scene Memorability

1 October 2018
natural scene memorability score by neural network

This Neural Network Evaluates Natural Scene Memorability

One hallmark of human cognition is the splendid capacity of recalling thousands of different images, some in details, after only a single view. Not all photos are remembered equally in…

One hallmark of human cognition is the splendid capacity of recalling thousands of different images, some in details, after only a single view. Not all photos are remembered equally in a human brain. Some images stick in our minds, while others fade away in a short time. This kind of capacity is likely to be influenced by individual experiences and is also subject to some degree of inter-subject variability, similar to some individual image properties.

Interestingly, when exposed to the overflow of visual images, subjects have a consistent tendency rather remember or forget the same pictures. Previous research suggests and analyzes the reason why people have the intuition to remember images and provide reliable solutions for ranking images by memorability scores. These works are mostly for generic images, object images and face photographs. However, it is difficult to dig out the obvious cues relevant to the memorability of a natural scene. To date, methods for predicting the visual memorability of a natural scene are scarce.

Previous Works

Previous work showed that memorability is an intrinsic property of an image. DNN has demonstrated splendid achievement in many research areas, e.g., video coding and computer vision. Also, several DNN approaches were proposed to estimate image memorability, which significantly improves the prediction accuracy.

  • Data scientists from MIT  trained the MemNet on a large-scale database, achieving a splendid prediction performance close to human consistency.
  • Baveye et al. fine-tuned the GoogleNet exceeding the performance of handcrafted features. Researchers also studied and targeted the certain objects like faces, natural scenes, etc.
  • Researchers from MIT have also created a database for studying the memorability of human face photographs. They further explored the contribution of certain traits (e.g., kindness, trustworthiness, etc.) to face memorability, but such characteristics only partly explain facial memorability.

State-of-the-art idea

As a first step towards understanding and predicting the memorability of a natural scene, LNSIM database is built. In LNSIM database, there are in total 2,632 natural scene images. For obtaining these natural scene images, 6,886 images are selected, which contain natural scenes from the existing databases, including MIR Flickr, MIT1003, NUSEF, and AVA database. Natural scenes images are selected from these databases. Fig. 1 shows some example images from LNSIM database.

Fig: 1 Image samples from LNSIM database

A memory game is used to quantify the memorability of each image in LNSIM database. A software is developed in which 104 subjects (47 females and 57 males) were involved. They do not overlap with the volunteers who participated in the image selection. The procedure of our memory game is summarized in Fig. 2.

Fig:02 The experimental procedure of memory game. Each level lasts about 5.5 minutes with a total of 186 images. Those 186 images are composed of 66 targets, 30 fillers, and 12 vigilance images. The specific time durations for experiment setting are labeled above.
In this experiment, there were 2,632 target images, 488 vigilance images and 1,200 filler images, which were unknown to all subjects. Vigilance and filler images were randomly sampled from the rest of 6,886 images. Vigilance images were repeated within 7 images, in an attempt to ensure that the subjects were paying attention to the game. Filler images were presented for once, such that spacing between the same target or vigilance images can be inserted. After collecting the data, a memorability score is assigned to quantify how memorable each image is. Also, to evaluate the human consistency, subjects are split into two independent halves(i.e. Group 1 and 2).

Analysis of Natural Scene Memorability

LNSIM database is mined to better understand how natural scene memorability is influenced by the low, middle and high-level handcrafted features and the learned deep feature.

Low-level features, like pixels, SIFT  and HOG2, have the impact on memorability of generic images. It has been investigated whether these low-level features still work on natural scene image set or not. To evaluate this, a support vector regression (SVR) for each low-level feature using training set to predict memorability, and then evaluate the SRCC of these low-level features with memorability on the test set. Below table 1 reports the results of SRCC on natural scenes, with SRCC on generic images as the baseline. It is evident that pixels (ρ=0.08), SIFT (ρ=0.28) and HOG2 (ρ=0.29) are not as effective as expected on the natural scene, especially compared to generic images.
Table 1: The correlation ρ between low-level features and natural scene memorability.
This suggests that the low-level features cannot effectively characterize the visual information for remembering natural scenes.

The middle-level feature of GIST describes the spatial structure of an image. However, Table 2 shows that the SRCC of GIST is only 0.23 for the natural scene, much less thanρ=0.38 of generic images. This illustrates that structural information provided by the GIST feature is less effective for predicting memorability scores on natural scenes.

Table 2: The correlation ρ between middle-level features and natural scene memorability.
There is no salient object, animal or person in natural scene images, such that scene semantics, as a high-level feature. To obtain the ground truth of scene category, two experiments are designed to annotate scene category for 2,632 images in the database.
  • Task 1(Classification Judgement): 5 participants are asked to indicate which scene categories an image has. A random image query was generated for each participant. Participants had
    to choose proper scene category labels to interpret scene stuff for each image.
  • Task 2 (Verification Judgement): A separate task ran on the same set of images by recruiting another 5 participants after Task 1. The participants were asked to provide a binary answer to the question for each image. The default answer was set to “No”, and the participants can check the box of image index to set “No” to “Yes”.

All images are annotated with categories through the majority voting over Task 1 and Task 2. Afterward, an SVR predictor with the histogram intersection kernel is trained for scene category. The scene category attribute achieves a good performance of SRCC(ρ=0.38), outperforming the results of low-level feature combination. This suggests that high-level scene category is an obvious cue of quantifying the natural scene memorability. As shown in below Figure, the horizontal axis represents scene categories in the descending order of corresponding average memorability scores. The average score ranges from 0.79 to 0.36, giving a sense of how memorability changes across different scene categories. The distribution in below Figure indicates that some unusual classes like aurora tend to be more memorable, while usual classes like mountain are more likely to be forgotten. This is possibly due to the frequency of each category appears in daily life.

Comparison of average memorability score and standard deviation of each scene category
To dig out how deep feature influences the memorability of a natural scene, a fine-tuned MemNet is trained on LNSIM database, using the Euclidean distance between the predicted and ground truth memorability scores as the loss function. The output of the last hidden layer is extracted as the deep feature (dimension: 4096).To evaluate the correlation between the deep feature and natural scene memorability, similar to above-handcrafted features, an SVR predictor with histogram intersection kernel is trained for the deep feature. The SRCC of the deep feature is 0.44, exceeding all handcrafted features. It is acceptable that DNN indeed works well on predicting the memorability of a natural scene, as deep feature shows a rather high prediction accuracy. Nonetheless, there is no doubt that the fine-tuned MemNet also has its limitation, since it still has the gap to human consistency (ρ=0.78).

DeepNSM: DNN for natural scene memorability

Fine-tuned MemNet model serves as the baseline model in predicting natural scene memorability. In the proposed DeepNSM architecture, the deep feature is concatenated with the category-related element to predict the memorability of natural scene images accurately. Note that the “deep feature” refers to the 4096-dimension feature extracted from the baseline model.

Figure 2: Architecture of DeepNSM model
Figure 2: Architecture of DeepNSM model
The architecture of DeepNSM model is presented in Figure 2. In DeepNSM model, the aforementioned category-related feature is concatenated with the deep feature obtained from the baseline model. Based on such concatenated element, additional fully-connected layers (including one hidden layer with the dimension of 4096) are designed to predict the memorability scores of natural scene images. In training, the layers of the baseline and ResNet models are initialized by the individually pre-trained models, and the added fully-connected layers are randomly initialized. The whole network is jointly trained in an end-to-end manner, using the Adam optimizer with the Euclidean distance adopted as the loss function.

Comparison with other models

The performance of DeepNSM model in predicting natural scene memorability regarding SRCC (ρ). The DeepNSM model is tested on both the test set of LNSIM database and the NSIM database. The SRCC performance of DeepNSM model is compared with the state-of-the-art memorability prediction methods, including MemNet, MemoNet, and Lu et al. Among them, MemNet and MemoNet are the latest DNN methods for generic images, which beat the conventional techniques using handcrafted features. Lu et al. is a state-of-the-art method for predicting natural scene memorability.
Fig: 3 The SRCC (ρ) performance of DeepNSM and compared methods.

Fig: 3 shows the SRCC performance of DeepNSM and the three compared methods. DeepNSM successfully achieves the outstanding SRCC performance, i.e., ρ=0.58 and 0.55, over the LNSIM and NSIM databases, respectively. It significantly outperforms the state-of-the-art DNN methods, MemNet and MemoNet. The above results demonstrate the effectiveness of DeepNSM in predicting natural scene memorability.

Conclusion

The above approach investigated the memorability of a natural scene from the data-driven perspective. Specifically, it established the LNSIM database for analyzing human memorability on natural scene. In exploring the correlation of memorability with low-, middle- and high-level features, it is worth mentioning that a high-level feature of scene category plays a vital role in predicting the memorability of a natural scene.

Temporal Relational Reasoning in Videos

25 September 2018
temporal relation network

Temporal Relational Reasoning in Videos

The ability to reason about the relations between entities over time is crucial for intelligent decision-making. Temporal relational reasoning allows intelligent species to analyze the current situation relative to the…

The ability to reason about the relations between entities over time is crucial for intelligent decision-making. Temporal relational reasoning allows intelligent species to analyze the current situation relative to the past and formulate hypotheses on what may happen next. Figure 1 shows that given two observations of an event, people can easily recognize the temporal relation between two states of the visual world and deduce what has happened between the two frames of a video.

Figure 1
Figure 1

Humans can easily infer the temporal relations and transformations between these observations, but this task remains difficult for neural networks. Figure 1 shows

  • a – Poking a stack of cans, so it collapses;
  • b – Stack something;
  • c – Tidying up a closet;
  • d – Thumb up.

Activity recognition in videos has been one of the core topics in computer vision. However, it remains difficult due to the ambiguity of describing activities at appropriate timescales.

Previous Work

With the rise of deep convolutional neural networks (CNNs) which achieve state-of-the-art performance on image recognition tasks, many works have looked into designing effective deep convolutional neural networks for activity recognition. For instance, various approaches of fusing RGB frames over the temporal dimension are explored on the Sport1M data-set.

Another technique uses two stream CNNs with one stream of static images, and the other stream of optical flows are proposed to fuse the information of object appearance and short-term motions.

One more technique uses CNN+LSTM model. CNN is used to extract frame features and an LSTM to integrate features over time, is also used to recognize activities in videos. For temporal reasoning, instead of designing the temporal structures manually, it uses a more generic structure to learn the temporal relations in end-to-end training.

This suggestion uses a two-stream Siamese network to learn the transformation matrix between two frames, then uses the brute force search to infer the action category.

State-of-the-art idea

The idea is to use TRN(Temporal Relation Networks). The focus is to model the multi-scale temporal relations in videos. Time contrast networks are used for self-supervised limitation learning of object manipulation from third-person video observation. This work aims to learn various temporal relations in videos in a supervised learning setting. The proposed TRN can be extended to self-supervised learning for robot object manipulation.

The illustration of Temporal Relation Networks.
Figure 2: The illustration of Temporal Relation Networks.

TRN is simple and can be easily plugged into any existing convolutional neural network architecture to enable temporal relational reasoning. It is defined as the pairwise temporal relation as a composite function below:


where the input is the video V with n selected ordered frames as V={f1,f2,…,fn}, where fi is a representation of the ith frame of the video, e.g., the output activation from some standard CNN. To further extend the composite function of the 2-frame temporal relations to higher frame relations such as the 3-frame relation function are given below:

where the sum is again over sets of frames i,j,k that have been uniformly sampled and sorted.

Experiments

Evaluation has been done on a variety of activity recognition tasks using  TRN-equipped networks. For recognizing activities that depend on temporal relational reasoning, TRN-equipped networks outperform a baseline network without a TRN by a large margin. The TRN-equipped networks also obtain competitive results on activity classification in the Something-Something dataset for human-interaction recognition Charades dataset and on Jester Dataset for hand gesture recognition.

Statistics of the datasets used in evaluating the TRNs
Statistics of the datasets used in evaluating the TRNs

The networks used for extracting image features play an important factor in visual recognition tasks. Features from deeper networks such as ResNet usually perform better. The goal here is to evaluate the effectiveness of the TRN module for temporal relational reasoning in videos. Thus, the base network is fixed throughout all the experiments and compare the performance of the CNN model with and without the proposed TRN modules.

Result

Something-Something is a recent video dataset for human-object interaction recognition. There are 174 classes, some of the ambiguous activity categories are challenging, such as ‘Tearing Something into two pieces’ versus ‘Tearing Something just a little bit’, ‘Turn something upside down’ versus ‘Pretending to turn something upside down’. The results on the validation set and test set of Something-V1 and Something-V2 datasets are listed in Figure 3.

Fig:03 Results on the validation set and test set (LEFT), Comparison of TRN and TSN as the number of frames (RIGHT)
Fig:03 Results on the validation set and test set (LEFT), Comparison of TRN and TSN as the number of frames (RIGHT)

RN outperforms TSN in a large margin as the number of frames increases, showing the importance of temporal order.TRN equipped networks also evaluated on the Jester dataset, which is a video dataset for hand gesture recognition with 27 classes. The results on the validation set of the Jester dataset are shown in figure 4.

Fig:04 Jester Dataset Results on (left) the validation set and (right) the test set
Figure 4: Jester Dataset Results on (left) the validation set and (right) the test set
Prediction examples
Predictions example on a) Something-Something, b) Jester, and c) Charades. For each example drawn from Something-Something and Jester, the top two predictions with green text indicating a correct prediction and red indicating an incorrect one.

Comparison with other state-of-the-art

The approach was compared with other state-of-the-art methods.  They evaluate the MultiScale TRN on the Charades dataset for daily activity recognition. The results are listed in Fig 5. This method outperforms various methods such as 2-stream networks and the recent Asynchronous Temporal Field (TempField) method.

Fig:05 Results on Charades Activity Classification

TRN model is capable of correctly identifying actions for which the overall temporal ordering of frames is essential for a successful prediction. This outstanding performance shows the effectiveness of the TRN for temporal relational reasoning and its strong generalization ability across different datasets.

Conclusion

The proposed simple and interpret-able module Temporal Relation Network can do temporal relational reasoning in neural networks for videos. It is evaluated on several recent datasets and established competitive results using only discrete frames and also shown that TRN module discovers visual common sense knowledge in videos.

Temporal alignment of videos from the (a) Something-Something and (b)Jester datasets using the most representative frames as temporal anchor points.
Temporal alignment of videos from the (a) Something-Something and (b)Jester datasets using the most representative frames as temporal anchor points.

Identity Verification with Deep Learning: ID-Selfie Matching Method

24 September 2018
ID selfie verification

Identity Verification with Deep Learning: ID-Selfie Matching Method

A large number of daily activities in our lives require identity verification. Identity verification provides a security mechanism starting from access control to systems all the way to at border crossing…

A large number of daily activities in our lives require identity verification. Identity verification provides a security mechanism starting from access control to systems all the way to at border crossing and bank transactions. However, in many of the activities that require identity verification, the process is done manually, and it is often slow and requires human operators.

Examples of automatic ID document photo matching systems at international borders.
Examples of automatic ID document photo matching systems at international borders

An automated system for identity verification will significantly speed up the process and provide a seamless security check in all those activities where we need to verify our identity. One of the simplest ways to do this is to design a system that will match ID photos with selfie pictures.

Previous works

There have been both successful and unsuccessful attempts in the past to employ an automated system for identity verification. A successful example is Australia’s SmartGate. It is an automated self-service border control system operated by the Australian Border Force and located at immigration checkpoints in arrival halls in eight Australian international airports. It uses a camera to capture a verification picture and tries to match it to a person’s ID. Also, China has introduced such systems at train stations and airports.

While there have also been attempts to match ID Documents and selfies using traditional computer vision techniques, the better-performing methods rely on deep learning. Zhu et al. proposed the first deep learning approach for a document to selfie matching using Convolutional Neural Networks.

State-of-the-art idea

In their new paper, researchers from Michigan State University proposed an improved version of their DocFace – a deep learning approach for document-selfie matching.

They show that gradient-based optimization methods converge slowly when many classes have very few samples – like in the case of existing ID-selfie datasets. To overcome this shortcoming, they propose a method, called Dynamic Weight Imprinting (DWI). Additionally, they introduce a new recognition system for learning unified representations from ID-selfie pairs and an open-source face matcher called DocFace+, for ID-selfie matching.

Method

There are a large number of problems and constraints in building an automated system for ID-selfie matching. Speaking about ID-selfie matching, numerous challenges are different from general face recognition.

The two main challenges are low quality of document (as well as selfie) photos due to compression and the large time gap between the document issue time and the verification moment.

The whole method is based on transfer learning. A base neural network model is trained on a large-scale face dataset (MS-Celeb 1M), and then features are transferred to the target domain of ID-selfie pairs.

Arguing that the convergence is very slow and very often the training can get stuck in local minima when dealing with many classes having very few samples, the researchers propose to use Additive Margin Softmax (AM-Softmax) loss function alongside with a novel optimization method that they call Dynamic Weight Imprinting (DWI).

Generalization performance of different loss functions.

Dynamic Weight Imprinting

Since Stochastic Gradient Descent updates the network with mini-batches, in a two-shot case (like the one of ID-selfie matching), each weight vector will receive signals only twice per epoch. These sparse attraction signals make little difference to the classifier weights. To overcome this problem, they propose a new optimization method where the idea is to update the weights based on sample features and therefore avoid underfitting of the classifier weights and accelerate the convergence.

Compared with stochastic gradient descend and other gradient-based optimization methods, the proposed DWI only updates the weights based on genuine samples. It only updates the weights of classes that are present in the mini-batch, and it works well with extensive datasets where the weight matrix of all classes is too large to be loaded, and only a subset of weights can be sampled for training.

Comparison of AM-Softmax loss and the proposed DIAM loss.

The researchers trained the popular Face-ResNet architecture using stochastic gradient descent and AM-Softmax loss. Then they fine-tune the model on the ID-selfie dataset by binding the proposed Dynamic Weight Imprinting optimization method with the Additive Margin Softmax. Finally, a pair of sibling networks is trained for learning domain-specific features of IDs and selfies sharing high-level parameters.

Workflow of the proposed method. A base model is trained on a large scale unconstrained face dataset. Then, the parameters are transferred to a pair of sibling networks, who have shared high-level modules.

Results

The proposed ID-selfie matching method achieves excellent result obtaining true acceptance rate TAR to 97.51 ± 0.40%. The authors report that their approach using the MS-Celeb-1M dataset and the AM-Softmax loss function achieves 99.67% accuracy on the standard verification protocol of LFW and a Verification Rate (VR) of 99.60% at False Accept Rate (FAR) of 0.1% on the BLUFR protocol.

Examples of falsely classified images by our model on the Private ID-selfie dataset.
The mean performance of constraining different modules of the sibling networks to be shared
Comparing Static and Dynamic Weight Imprinting regarding TAR

Comparison with other state-of-the-art

The approach was compared with other state-of-the-art general face matches since there are no existing public ID-selfie matching methods. The comparison with these methods is given concerning TAR – true accept rate and FAR – false accept rate and shown in the tables below.

The mean (and s.d. of) performance of different matches on the private ID-selfie dataset
Evaluation results were compared with other methods on Public-IvS dataset

Conclusion

The proposed DocFace+ method for ID-selfie matching shows the potential of transfer learning, especially in tasks where not enough data is available. The proposed method is achieving high accuracy in selfie to ID matching and has potential to be employed in identity verification systems. Additionally, the proposed novel optimization method – Dynamic Weight Imprinting shows improved convergence and better generalization performance and represents a significant contribution to the field of machine learning.

Deep Clustering Approach for Image Classification Task

20 September 2018
deepcluster facebook

Deep Clustering Approach for Image Classification Task

Clustering of images seems to be a well-researched topic. But in fact, little work has been done to adapt it to the end-to-end training of visual features on large-scale datasets.…

Clustering of images seems to be a well-researched topic. But in fact, little work has been done to adapt it to the end-to-end training of visual features on large-scale datasets.

The existence and usefulness of ImageNet, a fully-supervised dataset, has contributed to pre-training of convolutional neural networks. However, ImageNet is not so large by today standards: it “only” contains a million images. Now we need to move to the next level and build a bigger and more diverse dataset, potentially consisting of billions of images.

No Supervision Required

Can you imagine the number of manual annotations required for this kind of dataset? This is huge! Replacing labels by raw metadata is also a wrong solution as this leads to biases in the visual representations with unpredictable consequences.

So, it looks like we need methods that can be trained on internet-scale datasets with no supervision. That’s precisely what a Facebook AI Research team suggests. DeepCluster is a novel clustering approach for the large-scale end-to-end training of convolutional neural networks.

The authors of this method claim that the resulting model outperforms the current state of the art by a significant margin on all the standard benchmarks. But let’s first discover the previous works in this research area.

Previous Works

All the related work can be arranged into three groups:

  • Unsupervised learning of features: for example, Yang et al. iteratively learn Сonvnet features and clusters with a recurrent framework, Bojanowski and Joulin learn visual features on a large dataset with a loss that attempts to preserve the information flowing through the network.
  • Self-supervised learning: for instance, Doersch et al. use the prediction of the relative position of patches in an image as a pretext task, Noroozi and Favaro train a network to rearrange shuffled patches spatially. These approaches are usually domain dependent.
  • Generative models: for example, Donahue et al. and Dumoulin et al. have shown that using a GAN with an encoder results in visual features that are pretty much competitive.

State-of-the-art idea

DeepCluster is a clustering method presented recently by a Facebook AI Research team. The method iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update the weights of the network. For simplicity, the researchers have focused their study on k-means, but other clustering approaches can also be used, like for instance, Power Iteration Clustering (PIC).

Images and their 3 nearest neighbors: query –> results from randomly initialized network –> results from the same network after training with DeepCluster PIC
Images and their 3 nearest neighbors: query –> results from randomly initialized network –> results from the same network after training with DeepCluster PIC

Such an approach has a significant advantage over the self-supervised methods as it doesn’t require specific signals from the output or extended domain knowledge. As we will see later, DeepCluster achieves significantly higher performance than previously published unsupervised methods.

Let’s now have a closer look at the design of this model.

Method overview

The performance of random convolutional networks is intimately tied to their convolutional structure which gives a strong prior on the input signal. The idea of DeepCluster is to exploit this weak signal to bootstrap the discriminative power of a Сonvnet.

As illustrated below, the method implies iterative clustering of deep features and using the cluster assignments as pseudo-labels to learn the parameters of the Сonvnet.

DeepCluster
Illustration of the proposed method

This type of alternating procedure is prone to trivial solutions, which we’re going to discuss briefly right now:

  • Empty clusters. Automatic reassigning of empty clusters solve this problem during the k-means optimization.
  • Trivial parametrization. If the vast majority of images is assigned to a few clusters, the parameters will exclusively discriminate between them. The solution to this issue lies in sampling images based on a uniform distribution over the classes, or pseudo-labels.

DeepCluster is based on a standard AlexNet architecture with five convolutional layers and three fully connected layers. To remove color and increase local contrast, the researchers apply a fixed linear transformation based on Sobel filters.

So, the model doesn’t look complicated, but let’s check its performance on the ImageNet classification and transfer tasks.

Results

The results of preliminary studies are demonstrated below:

  • (a) the evolution of the Normalized Mutual Information (NMI) between the cluster assignments and the ImageNet labels during training;
  • (b) the development of the model’s stability along the training epochs;
  • (c) the impact of the number of clusters k on the quality of the model (k = 10,000 gives the best performance).
Preliminary studies
Preliminary studies

To assess the quality of a target filter, the researchers learn an input image that maximizes its activation. The figure below shows these synthetic filter visualizations and the top 9 activated images from a subset of 1 million images from YFCC100M.

Filter visualization and top 9 activated images for target filters in the layers conv1, conv3, and conv5 of an AlexNet trained with DeepCluster
Filter visualization and top 9 activated images for target filters in the layers conv1, conv3, and conv5 of an AlexNet trained with DeepCluster

Deeper layers in the network seem to capture larger textural structures. However, it looks like some filters in the last convolutional layers merely replicate the texture already captured in the previous layers.

Check below the results from the last convolutional layers but this time using VGG-16 architecture instead of AlexNet.

Filter visualization and top 9 activated images for target filters in the last convolutional layer of VGG-16 trained with DeepCluster
Filter visualization and top 9 activated images for target filters in the last convolutional layer of VGG-16 trained with DeepCluster

As you can see, the filters learned without any supervision, can capture quite complex structures.

Next figure shows the top 9 activated images of some filters that seem to be semantically coherent. The filters in the top row reflect the structures that are highly correlated with object class, while the filters in the bottom row seem to trigger on style.

Top 9 activated images for target filters in the last convolutional layer
Top 9 activated images for target filters in the last convolutional layer

Comparison

To compare DeepCluster to other methods, the researchers train a linear classifier on top of different frozen convolutional layers. The table below reports the classification accuracy of different state-of-the-art approaches on the ImageNet and the Places dataset.

On ImageNet, DeepCluster outperforms state of the art from conv2 to conv5 layers by 1-6%. Poor performance in the first layer is probably due to the Sobel filtering discarding color. Remarkably, the difference of performance between DeepCluster and a supervised AlexNet is only around 4% at conv2-conv3 layers, but rises to 12.3% at conv5, showing where the AlexNet stores most of the class level information.

Table 1. Linear classification on ImageNet and Places using activations from the convolutional layers of an AlexNet as features
Table 1. Linear classification on ImageNet and Places using activations from the convolutional layers of an AlexNet as features

The same experiment on the Places dataset reveals that DeepCluster yields conv3-conv4 features that are comparable to those trained with the ImageNet labels. This implies that when the target task is sufficiently far from the domain covered by the ImageNet, labels are less important.

The next table summarizes the comparisons of DeepCluster with other feature learning approaches on the three tasks: classification, detection, and semantic segmentation. As you can see, DeepCluster outperforms all previous unsupervised methods on all three tasks with the most substantial improvement in semantic segmentation.

Table 2: Comparison of the proposed approach to state-of-the-art unsupervised feature learning on classification, detection, and segmentation on Pascal VOC
Table 2: Comparison of the proposed approach to state-of-the-art unsupervised feature learning on classification, detection, and segmentation on Pascal VOC

Bottom Line

DeepCluster proposed by the Facebook AI Research team achieves performance that is significantly better than the previous state of the art on every standard transfer task. What is more, when tested on the Pascal VOC 2007 object detection task with fine-tuning, DeepCluster is only 1.4% below the supervised topline.

This approach makes a little assumption about the inputs and doesn’t require much domain-specific knowledge, which makes it a good candidate for learning deep representations specific domains where labeled datasets are not available.

Realistic Exemplar-Based Image Colorization

18 September 2018
colorization method

Realistic Exemplar-Based Image Colorization

Image colorization is a widespread problem within computer vision. The ultimate objective of image colorization is to map a gray-scale image to a visually plausible and perceptually meaningful color image.…

Image colorization is a widespread problem within computer vision. The ultimate objective of image colorization is to map a gray-scale image to a visually plausible and perceptually meaningful color image.

It is important to mention that image colorization is an ill-posed problem. The definition itself says that the goal is obtaining “visually plausible and realistic image colorization,” which is conditioned as a result of the multi-modal nature of the problem – various colorizations are possible for a gray-scale image.

Realistic Exemplar-Based Image Colorization

Image decolorization (color-to-grayscale) is non-invertible image degradation, and starting from this point it is clear that multiple meaningful and plausible results are possible for a single input image. Moreover, the expected “visually plausible” result is subjective and differs pretty much from different people.

Previous works

However, in the past researchers have proposed some solutions to the ill-conditioned problem of realistic image colorization. As I mentioned multiple times, the result highly subjective and in many of the proposed methods, human intervention was included to obtain better colorization results. In the past few years, several colorization approaches exploiting the power of deep learning have been proposed. In this way the colorization is learned from large amounts of data, hoping for an improved generalization in the process of colonization.

More recently, the new approaches have examined the trade-off between controllability from interaction and robustness from learning.

In novel approach researchers from Microsoft Research Asia, introduce the first deep learning approach for exemplar-based local colorization.

State-of-the-art idea

The idea of the proposed approach is to introduce a reference color image, besides the input gray-scale image to the method that will output plausible colorization. Given a reference color image, potentially semantically similar to the input image, a convolutional neural network maps a gray-scale image to an output colorized image in an end-to-end manner.

The proposed idea: given an input gray-scale image and a reference image, the method outputs a realistic colorized image
The proposed idea: given an input gray-scale image and a reference image, the method outputs a realistic colorized image

Method

Besides the proposed, first deep learning based exemplar colorization method, in their paper the researchers provide a few more contributions: A reference image retrieval algorithm for reference recommendation, with which a fully automatic colorization can also be obtained, a method capable of transferability to unnatural images and an extension to video colorization.

The main contribution: the colorization method is, in fact, able to colorize an image according to a given semantically “similar” reference image. This image is given a user’s input, or it can be obtained with the image retrieval system given as the second contribution. This system is trying to find semantically similar image to the input image to reuse the local color patches and provide more realistic and plausible output colorization.

The colorization method employs two deep convolutional neural networks: similarity sub-network and colorization sub-network. The first network is a pre-processing network that measures the semantic similarity between the reference and the target using a VGG-19 network. The VGG network was pre-trained on gray-scale image object recognition task. This network gives the input to the colorization network, and it provides a robust and more meaningful similarity metric.

The two branches in the colorization network
The two branches in the colorization network

The second network – the colorization sub-network is an end-to-end Convolutional Neural Network that can learn selection, propagation, and prediction of colors simultaneously. This network takes as input the output of the pre-processing done with the similarity sub-network as well as the input grayscale image. More precisely, the data to this network is the target gray-scale image, the aligned reference image and bidirectional similarity maps (between the input and the reference image).

To control both: true reference colors (from the reference image or similar) to be used and natural colorization when no reliable reference color is available, the authors propose a branch learning scheme for the colorization network. This multi-task network involves two branches, Chrominance branch, and Perceptual branch. The same network is used and trained while taking different input depending on the branch as well as different loss function (again depending on which branch is used).

In the Chrominance branch, the network learns to selectively propagate the correct reference colors, which depends on how well the target and the reference are matched. While this network is trying to satisfy the chrominance consistency the other branch through “Perceptual loss,” enforces a close match between the result and the exact color image of high-level feature representations. Both of the networks are shown in the picture below.

The proposed deep learning architecture for image colorization
The proposed deep learning architecture for image colorization

Comparison with other state-of-the-art

The proposed method gives more than satisfactory results in realistic image colorization. The evaluation is divided into few groups: comparison with exemplar-based methods, comparison with learning based methods and comparison with interactive methods.

To compare against exemplar-based methods, the authors collected around 35 image pairs from all the papers of the comparison methods and compared quantitatively and qualitatively the results. They show that this method outperforms other existing exemplar-based methods yielding better visual results. They argue that the success comes from the sophisticated mechanism of color sample selection and propagation that are jointly learned from data rather than through heuristics.

Comparison with other exemplar-based methods
Comparison with other exemplar-based methods

By sending the colorized results into VGG19 or VGG16 pre-trained on image recognition task the authors try to measure the indication of more or less natural generated images and compare against other learning-based methods. Using this evaluation, they show that the method outperforms the existing methods. However, some of the ways yield better PSNR than the proposed method.

Comparison with other learning-based methods
Comparison with other learning-based methods
Evaluation of learning-based methods
Evaluation of learning-based methods
The assessment based on user's colorization preference. Comparison against the ground truth that represents the user's colorization choice
The assessment based on user’s colorization preference. Comparison against the ground truth that represents the user’s colorization choice

Conclusion

The color reference recommendation pipeline.
The color reference recommendation pipeline

In this novel colorization approach, the authors show the power of deep learning again to solve an important although ill-conditioned problem. They try to leverage the flexibility and potential of deep convolutional neural networks and provide a robust and controllable image colorization method. They also offer a whole system for automatic image colorization based on most-similar reference image that is also extended to unnatural images as well as videos. In conclusion, a whole deployable system is proposed incorporating innovative deep-learning based method for colorizing gray-scale images in a robust, and realistic way.