“What Is Going On?” Neural Network by Facebook Detects and Recognises Human-Object Interactions

11 July 2018

“What Is Going On?” Neural Network by Facebook Detects and Recognises Human-Object Interactions

Deep learning played an important role in the past years for the improvement of visual recognition of individual instances e.g detecting objects and pose estimation. However, recognizing individual objects is…

Deep learning played an important role in the past years for the improvement of visual recognition of individual instances e.g detecting objects and pose estimation. However, recognizing individual objects is just a first step for machines to comprehend the visual world. To understand what is happening in images, it is also necessary to identify relationships between individual instances.

From a practical perspective, photos containing people contribute a considerable portion of daily uploads to the internet and social networking sites, and thus human-centric understanding has significant demand in practice. The fine granularity of human actions and their interactions with a wide array of object types presents a new challenge compared to recognition of entry-level object categories.

The idea is to present a human-centric model for recognizing human-object interaction. The central observation is a person’s appearance, which reveals their action and poses, is highly informative for inferring where the target object of the interaction may be located. The search space for the target object can thus be narrowed by conditioning on this estimation. Although there are often many objects detected the inferred target location can help the model to pick the correct object associated with a specific action quickly.

The Faster R-CNN framework is used to model the human-centric recognition branch. Specifically, on a region of interest (RoI) associated with a person, this branch performs action classification and density estimation for the action’s target object location. The density estimator predicts a 4-d Gaussian distribution, for each action type, that models the likely relative position of the target object to the person. The prediction is based purely on the human appearance. This human-centric recognition branch, along with a standard object detection branch and a simple pairwise interaction branch form a multitask learning system that can be jointly optimized.

Model Architecture

The model is consists of:

  • an object detection branch;
  • a human-centric branch;
  • an optional interaction branch.

The person features and their layers are shared between the human-centric and interaction branches (blue boxes).

       object detection branch

The goal is to detect and recognize triplets of the form (human, verb, object). The solution to this problem is extending the Fast R-CNN with an additional human-centric branch that classifies actions and estimates a probability density over the target object location for each activity. The human-centric branch reuses features extracted by Fast R-CNN for object detection, so its marginal computation is lightweight. Specifically, given a set of candidate boxes, Fast R-CNN outputs a set of object boxes and a class label for each box. The model is extended by assigning a triplet score to pairs of candidate human/object boxes b(h), b(o) and an action a. The triplet score is decomposed into four terms.

Fast R-CNN

While the model has multiple components, the basic idea is straightforward. s(h) and s(o) are the class scores from Fast R-CNN of b(h) and b(o) containing a human and object.

Object Detection

The object detection branch of the network, shown in Figure 1, is identical to that of Faster R-CNN.

Action Classification

The first role of the human-centric branch is to assign an action classification score s(a, h) to each human box b(h) and action a. The training objective is to minimize the binary cross entropy losses between the ground-truth action labels and the scores s(a, h) predicted by the model.

Target Localization

The second role of the human-centric branch is to predict the target object location based on a person’s appearance (again represented as features pooled from b(h)). This approach predicts a density over possible locations, and use this output together with the location of actual detected objects to precisely localize the target. To model the density over the target object’s location as a Gaussian function whose mean is predicted based on the human appearance and action being performed. Formally, the human-centric branch predicts µ(a,h), the target object’s 4-d mean location given the human box b(h) and action a. The target localization showed as:

can be used to test the compatibility of an object box b(o) and the predicted target location µ(a,h).

Interaction Recognition

The human-centric model scores actions based on the human appearance. While effective, this does not take into account the appearance of the target object. To improve the discriminative power of the model, and to demonstrate the flexibility of framework s(a, h) is replaced with an interaction branch that scores an action based on the appearance of both the human and target object.

interaction recognition
Estimating target object density from the person features

The model is first to train COCO set (excluding the V-COCO val images). This model, which is in essence Faster R-CNN, has 33.8 object detection AP on the COCO val set. InteractNet, has an AP(role) of 40.0 evaluated on all action classes on the V-COCO test set and also on HICO-DET dataset. This is an absolute gain of 8.2 points over the strong baseline’s 31.8, which is a relative improvement of 26%. The result is shown in the below table.




Result on V-COCO test images
Results on some V-COCO test images

The research addresses the problem of human object detection task. The proposed approach correctly detected triplets of one person taking multiple actions on multiple objects. Moreover, InteractNet can detect multiple interaction instances in an image. Below figures show two test images with all detected triplets shown.

Detected triplets on two
All detected triplets on two V-COCO test images
Multiple actions and multiplube objects detection
An individual person can take multiple actions and affect multiple objects

object detection AP

Muneeb Ul Hassan

Human Rights Violations Recognition in Images Using CNN Features

18 May 2018
Human Rights Violations Recognition in Images Using CNN Features

Human Rights Violations Recognition in Images Using CNN Features

Human rights violations have been unfolding during the entire human history, while nowadays they increasingly appear in many different forms around the world. Human rights violations refer to the actions…

Human rights violations have been unfolding during the entire human history, while nowadays they increasingly appear in many different forms around the world. Human rights violations refer to the actions executed by state or non-state actors that breach any part of those rights that protect individuals and groups from behaviors that interfere with fundamental freedoms and human dignity.

Photos and videos have become an essential source of information for human rights investigations, including Commissions of Inquiry and Fact-finding Missions. Investigators often receive digital images directly from witnesses, providing high-quality corroboration of their testimonies. In most instances, investigators receive images from third parties (e.g. journalists or NGOs), but their provenance and authenticity are unknown.

A third source of digital images is social media, e.g. uploaded to Facebook, again with uncertainty regarding authenticity or source. That sheer volume of images means that to manually sift through the photos to verify if any abuse is taking place and then act on it, would be tedious and time-consuming work for humans. For this reason, a software tool aimed at identifying potential abuses of human rights, capable of going through images quickly to narrow down the field would greatly assist human rights investigators. Major contributions to this work are as follows:

  1. A new dataset of human rights abuses, containing approximately 3k images for 8 violation categories.
  2. Assess the representation capability of deep object-centric CNNs and scene-centric CNNs for recognizing human rights abuses.
  3. Attempt to enhance human rights violations recognition by combining object-centric and scene-centric CNN features over different fusion mechanisms
  4. Evaluate the effects of different feature fusion mechanisms for human rights violations recognition.

Human Rights Violations Database

Many organizations concerned with human rights advocacy use digital images as a tool for improving the exposure of human rights and international humanitarian law violations that may otherwise be impossible. To advance the automated recognition of human rights violations a well-sampled image database is required.

Human Rights Violations Database

Human Rights Archive (HRA) database, a repository of approximately 3k photographs of various human rights violations captured in real-world situations and surroundings, labeled with eight semantic categories, comprising the types of human rights abuses encountered in the world. The dataset contains eight violation categories and a supplementary ‘no violation’ class.

Human rights violations recognition is closely related to, but radically different from, object and scene recognition. For this reason, following a conventional image collection procedure is not appropriate for collecting images with respect to human rights violations. The first issue encountered is that the query terms for describing different categories of human rights violations must be provided by experts in the field of human rights.

Non-governmental organizations (NGOs) and their public repositories are considered to create a dataset.

The first NGO considered is Human Rights Watch which offers an online media platform capable of exposing human rights and international humanitarian law violations in the form of various media types such as videos, photo essays, satellite imagery and audio clips. Their online repository contains nine main topics in the context of human rights violations (arms, business, children’s rights, disabilities, health and human rights, international justice, LGBT, refugee rights and women rights) and 49 subcategories. One considerable drawback in the course of that process is the presence of a watermark in most of the video files available from that platform. As a result, all the recorded images that originally contained the watermark had to be cropped in a suitable way.

Only colour images of 600×900 pixels or larger were retrieved after the cropping stage. In addition to those images, all photo essays available for each topic and its subcategories are added, resulting in 342 more images to the final array. The entire pipeline used for collecting and filtering out the pictures from Human Rights Watch is depicted in Figure 1.

The second NGO investigated is the United Nations which presents an online collection of images in the context of human rights. Their website is equipped with a search mechanism capable of returning relevant images for simple and complex query terms.

Search mechanism
Figure 1

The final dataset contains a set of 8 human rights violations categories and 2847 images. 367 ready-made images are downloaded from the two online repositories representing 12.88% of the entire dataset, while the remainder (2480) images are recorded from videos coming out of Human Rights Watch media platform. The final dataset consists of eight categories which are as follows:

  1. Arms
  2. Child Labour
  3. Child Marriage
  4. Detention Centres
  5. Disability Rights
  6. Displaced populations
  7. Environment
  8. Out of School

How It Works

Given the impressive classification performance of the deep convolutional neural networks, three modern object-centric CNN architectures, ResNet50, VGG 16 and VGG 19 convolutional-layer CNNs are used and then fine-tune them on HRA to create baseline CNN models.

Transfer Learning technique is used to injects knowledge from other tasks by deploying weights and parameters from a pre-trained network to the new one and has become a commonly used method to learn task-specific features.

Considering the size of the dataset, the chosen method to apply a deep CNN is to reduce the number of free parameters. To achieve this, the first filter stages can be trained in advance on different tasks of object or scene recognition and held fixed during training on human rights violations recognition. By freezing (preventing the weights from getting updated during training) the earlier layers, overfitting can be avoided.

Feature extraction modules has been initialized using pre-trained models from two different large-scale datasets, ImageNet and Places. ImageNet is an object-centric dataset which contains images of generic objects including person and therefore is a good option for understanding the contents of the image region comprising the target person. On the contrary, Places is a scene-centric dataset specifically created for high-level visual understanding tasks such as recognizing scene categories.

Network architecture for high-level feature extraction
Figure 2: Network architecture used for high-level feature extraction with HRA

Hence, pretraining the image feature extraction model using this dataset ensures providing global (high level) contextual support. For the target task (human rights violation recognition), the network will output scores for the eight target categories of the HRA dataset or no violation if none of the categories is present in the image.


The classification results for top-1 accuracy and coverage are listed below. A more natural performance metric to use in this situation is coverage, the fraction of examples for which the system can produce a response. For all the experiments in this paper, we employ a threshold of 0.85 over the prediction confidence in order to report the coverage performance metric.

Perfomance metric

Figure 3 shows the responses to examples predicted by the best performing HRA-CNN, VGG19. Broadly, we can identify one type of misclassification given the current label attribution of HRA: images depicting the evidence which are responsible for a particular situation and not the actual action, such as schools being targeted by armed attacks. Future development of the HRA database will explore to assign multi-ground truth labels or free-form sentences to images to better capture the richness of visual descriptions of human rights violations.

Predictions by HRA-CNN and VGG19
Figure 3

This technique addresses the problem of recognizing abuses of human rights given a single image. HRA dataset is created with images used in non-controlled environments containing activities which reveal a human right being violated without any other prior knowledge. Using this dataset and a two-phase deep transfer learning scheme, a state of the art deep learning algorithms is present for the problem of visual human rights violations recognition. A technology capable of identifying potential human rights abuses in the same way as humans do has a lot of potential applications in human-assistive technologies and would significantly support human rights investigators.

Muneeb Ul Hassan

Image Inpainting for Irregular Holes Using Partial Convolutions

8 May 2018
Image Inpainting for Irregular Holes Using Partial Convolutions

Image Inpainting for Irregular Holes Using Partial Convolutions

Deep learning is growing very fast and it is one of the fast-growing areas of artificial intelligence. It has been used in many fields extensively including real-time object detection, image…

Deep learning is growing very fast and it is one of the fast-growing areas of artificial intelligence. It has been used in many fields extensively including real-time object detection, image recognition, and video classification. Deep learning usually implemented as Convolutional Neural Network, Deep Belief Network, Recurrent Neural Network etc. One of the problems with images in image inpainting. Image inpainting is the task of filling the holes in an image. The goal of this work is to propose a model for image inpainting that operates robustly on irregular hole patterns and produces semantically meaningful predictions that incorporate smoothly with the rest of the image without the need for any additional post-processing or blending operation. It can be used with many applications e.g. it can be used in image editing to remove unwanted image content while filling the image with reasonable content.

There are many different approaches used in for image inpainting but none of them uses the deep learning approach and these approaches have some limitations. One of the methods is called patch match which iteratively searches for the best fitting patches to fill in the holes. While this approach generally produces smooth results, it is limited by the available image statistics and has no concept of visual semantics. Another limitation of many recent approaches is the focus on rectangular shaped holes, often assumed to be a center in the image. Another limitation of many recent methods is the focus on rectangular shaped holes, often assumed to be the center in the image. These limitations may lead to overfitting to the rectangular holes, and ultimately limit the utility of these models in the application.

How Does It Work?

To overcome the limitations of the previous approach, Partial convolution has been used by Nvidia Research to solve the image inpainting problem. Partial Convolution Layer comprising a masked and re-normalized convolution operation followed by a mask-update step. The main extension is the automatic mask update step, which removes any masking where the partial convolution was able to operate on an unmasked value. The following contribution has been made:

  • The use of partial convolutions with an automatic mask update step for achieving state-of-the-art on image inpainting.
  • Substituting convolutional layers with partial convolutions and mask updates can achieve state-of-the-art inpainting results.
  • Demonstrate the efficacy of training image-inpainting models on irregularly shaped holes.

Partial Convolutional Layer

The model uses stacked partial convolution operations and mask updating steps to perform image inpainting. Partial convolution operation and mask update function jointly as the Partial Convolutional Layer.

Let W be the convolution filter weights for the convolution filter and b is the corresponding bias. is the feature values (pixels values) for the current convolution (sliding) window and M is the corresponding binary mask. The partial convolution at every location is expressed as:

Partial Convolutional Layer

Where ⊙ denotes element-wise multiplication. As can be seen, output values depend only on the unmasked inputs. The scaling factor 1/sum(M) applies appropriate scaling to adjust for the varying amount of valid (unmasked) inputs. After each partial convolution operation, the mask has been updated. The unmasking rule is simple: if the convolution was able to condition its output on at least one valid input value, then remove the mask for that location. This is expressed as:

Partial Convolutional Layer

and can easily be implemented in any deep learning framework as part of the forward pass.

Network Architecture

Partial convolution layer is implemented by extending existing standard PyTorch. The straightforward implementation is to define binary masks of size C×H×W, the same size with their associated images/features, and then to implement mask updating is implemented using a fixed convolution layer, with the same kernel size as the partial convolution operation, but with weights identically set to 1 and bias set to 0. The entire network inference on a 512×512 image takes 0.23s on a single NVIDIA V100 GPU, regardless of the hole size.

The architecture used is UNet-like architecture, replacing all convolutional layers with partial convolutional layers and using nearest neighbor up-sampling in the decoding stage.

Network Architecture
Figure 1: This architecture used in image inpainting where all the convolutional layers replaced with partial convolutional layers

ReLU is used in the encoding stage and LeakyReLU with alpha = 0.2 is used between all decoding layers. The encoder comprises eight partial convolutional layers with stride=2. The kernel sizes are 7, 5, 5, 3, 3, 3, 3 and 3. The channel sizes are 64, 128, 256, 512, 512, 512, 512, and 512. The last partial convolution layer’s input will contain the concatenation of the original input image with hole and original mask.

Loss Function

The loss functions target both per-pixel reconstruction accuracy as well as composition, i.e. how smoothly the predicted hole values transition into their surrounding context. Given an input image with hole Iin and mask M, the network prediction Iout and ground truth image Igt, then the pixel loss is defined as:

Loss Function

The perceptual (perceptual loss functions measures high-level perceptual and semantic differences between images. They make use of a loss network which is pretrained for image classification, meaning that these perceptual loss functions are themselves deep convolutional neural networks) loss is defined as:

Loss Function

The perceptual loss computes the L1 distances between both Iout and Icompand the ground truth. To perform autocorrelation, style loss term is introduced on each feature map.

Loss Function

The total loss is the combination of all the above loss:

Total Loss


Partial convolution outperforms other methods. To prove that partial convolution performs better than other methods, l1 error, peak signal-to-noise ratio (PSNR), Structural SIMilarity (SSIM) index and Inception score(IScore) evaluation metrics are used. Below table shows the comparison results. It can be seen that PConv method outperforms all the other methods on these measurements on irregular masks.

measurements on irregular masks

The use of a partial convolution layer with an automatic mask updating mechanism and achieve state-of-the-art image inpainting results. The model can robustly handle holes of any shape, size location, or distance from the image borders. Further, the performance does not deteriorate catastrophically as holes increase in size, as seen in Figure 2.

Сorresponding inpainted results
Figure 2: Top row: input; bottom row: corresponding inpainted results
Comparison convolution layer and partial convolution results
Figure 3: Comparison between typical convolution layer based results (Conv) and partial convolution layer based results (PConv)

More results of partial convolution (PConv) approach:

PConv PConv

Muneeb ul Hassan

Materials for Masses: SVBRDF Acquisition with a Single Mobile Phone Image

3 May 2018
SVBRDF Acquisition with a Single Mobile Phone Image

Materials for Masses: SVBRDF Acquisition with a Single Mobile Phone Image

A wide variety of images around us are the outcome of interactions between lighting, shapes and materials. In recent years, the advent of convolutional neural networks (CNN) has led to…

A wide variety of images around us are the outcome of interactions between lighting, shapes and materials. In recent years, the advent of convolutional neural networks (CNN) has led to significant advances in recovering shape using just a single image. One of the problems which didn’t get much attention is material estimation which has not seen as much progress, which might be attributed to multiple causes. First, material properties can be more complex. Even discounting more complex global illumination effects, materials are represented by a spatially-varying bidirectional reflectance distribution function (SVBRDF), which is an unknown high-dimensional function that depends on incident lighting directions. Secondly, pixel observations in a single image contain entangled information from factors such as shape and lighting, besides material, which makes estimation ill-posed.

The researchers from Adobe developed a state-of-the-art technique to recover SVBRDF from a single image of a near-planar surface, acquired using the camera of the mobile phone. This is a contrast to conventional BRDF captures setups that usually require significant equipment and expenses. Convolutional Neural Networks is specifically designed to account for the physical form of NDRFs and the interaction of light with materials.

How It Works

A state of the art novel architecture that encodes the input image into a latent representation, which is decoded into components corresponding to surface normal, diffuse texture and specular roughness. The experiments demonstrate advantages over several baselines and prior works in quantitative comparisons, while also achieving superior qualitative results. The generalization ability of this network trained on the synthetic BRDF dataset is demonstrated by strong performance on real images, acquired in the wild, in both indoor and outdoor environments, using multiple different phone cameras. Given the estimated BRDF parameters, authors also demonstrate applications such as material editing and relighting of novel shapes. To summarise, the authors propose the following contributions:

  • A novel lightweight SVBRDF acquisition method that produces state-of-the-art reconstruction quality.
  • A CNN architecture that exploits domain knowledge for joint SVBRDF reconstruction and material classification.
  • Novel DCRF-based post-processing that accounts for the microfacet BRDF model to refine network outputs.
  • An SVBRDF dataset that is large-scale and specifically attuned to the estimation of spatially-varying materials.
Distribution of materials
Figure 2: Distribution of materials in our Training and test sets

SetupOur goal is to reconstruct the spatially-varying BRDF of a near planar surface from a single image captured by a mobile phone with the flash turned on for illumination. Authors assume that the z-axis of the camera is approximately perpendicular to the planar surface (they explicitly evaluate against this assumption in our experiments). For most mobile devices, the position of the flashlight is usually very close to the position of the camera, which provides us with a univariate sampling of anisotropic BRDF. Our surface appearance is represented by a microfacet parametric BRDF model. Let di, ni, ri be the diffuse colour, normal and roughness, respectively, at pixel i. The BRDF model is defined as:

BRDF model

Where vi and li are the view and light directions and hi is the half-angle vector. Given an observed image I (di, ni, ri, L), captured under unknown illumination L, scientists wish to recover the parameters di, ni and ri for each pixel in the image.

Dataset: The dataset has been used is Adobe Stock 3D Material dataset which contains 688 materials with high resolution (4096 x 4096) spatially-varying BRDFs. Scientists use 588 materials for training and 100 materials for testing. For data augmentation, authors randomly crop 12, 8, 4, 2, 1 image patches of size 512, 1024, 2048, 3072, 4096. The distribution is shown in figure 2.

Network Design

Network Design

The basic network architecture consists of a single encoder and three decoders which reconstruct the three spatially-varying BRDF parameters: diffuse colour di, normal ni and roughness ri. The intuition behind using a single encoder is that different BRDF parameters are correlated, thus, representations learned for one should be useful to infer the others, which allows a significant reduction in the size of the network. The input to the network is an RGB image, augmented with the pixel coordinates as a fourth channel. Authors add the pixel coordinates since the distribution of light intensities is closely related to the location of pixels, for instance, the centre of the image will usually be much brighter. Since CNNs are spatially invariant, they need the extra signal to let the network learn to behave differently for pixels at different locations. Skip links are added to connect the encoder and decoders to preserve details of BRDF parameters. To this end, our encoder network has seven convolutional layers of stride 2, so that the receptive field of every output pixel covers the entire image.

For each BRDF parameter, authors have an L2 loss for direct supervision. For each batch, researchers create novel lights by randomly sampling the point light source on the upper hemisphere. This ensures that the network does not overfit to collocated illumination and is able to reproduce appearance under other light conditions. The final loss function for the encoder-decoder part of our network is:

encoder-decoder part



are the L2 losses for diffuse, normal, roughness and rendered image predictions, respectively. Given the highest level of features extracted by the encoder, the features are sent to a classifier to predict its material type. Then to evaluate the BRDF parameters for each material type and use the classification results as weights (the output of SoftMax layer). This averages the prediction from different material types to obtain the final BRDF reconstruction results. The classifier is trained together with the encoder and decoder from scratch, with the weights of each label set to be inversely proportional to the number of examples in figure 2 to balance different material types in the loss function. The overall loss function of our network with the classifier is:

network with the classifier


Acquisition setup: To verify the generalizability of our method to real data, we show results on real images captured with different mobile devices in both indoor and outdoor environments. Authors capture linear RAW images (with potentially clipped highlights) with the flash enabled, using the Adobe Lightroom Mobile app. The mobile phones were hand-held, and the optical axis of the camera was only approximately perpendicular to the surfaces (See fig 4)

Acquisition setup
Figure 4

Qualitative results with different mobile phones: Figure 5 presents SVBRDF and normal estimation results for real images captured with three different mobile devices: Huawei P9, Google Tango and iPhone 6s. Scientists observe that even with a single image, our network successfully predicts the SVBRDF and normals, with images rendered using the predicted parameters, appear very similar to the input. Also, the exact same network generalizes well to different mobile devices, which shows that our data augmentation successfully helps the network factor out variations across devices. For some materials with specular highlights, the network can hallucinate information lost due to saturation. The network can also reconstruct reasonable normals even for complex instances.

incorrect material classification
Figure 5: A failure case, due to incorrect material classification into metal, which causes the specularity to be over-smoothed
BRDF reconstruction results
Figure 6: BRDF reconstruction results on real data. Authors tried different mobile devices to capture raw images using Adobe Lightroom. The input images in were captured using Huawei P9 (first three rows), Google Tango (fourth row) and iPhone 6s (fifth row), all with a handheld mobile phone where the z-axis of a camera was only approximately perpendicular to the sample surface.

Muneeb ul Hassan

“What I See” vs “What My Camera Sees”: How DNN Can Bring Details Back to Overexposed and Underexposed Images

28 April 2018
How DNN Can Bring Details

“What I See” vs “What My Camera Sees”: How DNN Can Bring Details Back to Overexposed and Underexposed Images

Human eyes are better than any known camera, so far. When we look at scenes where there’s a huge gap between the bright and dark tones (for example sunrise and…

Human eyes are better than any known camera, so far. When we look at scenes where there’s a huge gap between the bright and dark tones (for example sunrise and sunset), we can see details everywhere. However, cameras struggle in these situations and what we get as a result is an image with the shadows crunched down to black or highlights blown out to white. Many similar problems have arisen in the past, and many approaches and techniques have been proposed to correct (adjust) an image to obtain a visually pleasing one.

what i see
A meme is exposing a typical situation where we struggle to capture a beautiful view with a camera.

Researchers from the University of Hong Kong and the Dalian University of Technology have proposed a new method for transforming an image to a visually pleasing one using deep neural networks.

The proposed technique allows to correct an image with under/over exposure and introduce many details. It works by taking a standard LDR image (LDR Image is an image in the low dynamic range), and it produces an enhanced image, again in the LDR domain but visually enriched with the recovered details. The deep neural network can bring back the details since they exist in HDR domain (high dynamic range) but they have diminished in the LDR domain, and this is actually where the magic comes from.

How Does It Work?

The new method is called Deep Reciprocating HDR Transformation, and as the name suggests it works by applying a reciprocal transformation utilizing two deep neural networks. In fact, the idea is simple: taking an LDR image, we reconstruct the details in the HDR domain and map the image back to the LDR domain (enriched with details). Although it sounds super simple, there are a few tricks that have to be performed to make all this work, and I explain them below.

In order to make the aforementioned reciprocal transformation, two convolutional neural networks (CNN) are used. The first one called HDR estimation network takes the input image, encodes it into a latent representation (of lower dimension, of course) and then decodes this representation to reconstruct an HDR image. The second one, called LDR correction network, is doing the reciprocal transformation: it takes the estimated HDR (from the first network) and outputs the corrected LDR image. Both of the networks are simple auto-encoders encoding the data into a latent representation of size 512.

The two networks are trained jointly and have the same architecture. However, as I mentioned before, there are some tricks and the optimization and the cost function are explicitly defined to address the problem at hand.

underexposed image in dark scene
Visual comparison of the new method (DRHT) on the underexposed image in dark scene.

The HDR Estimation Network

The first network is trained to predict the HDR data. It has been trained using ELU activations, batch normalisation and the loss function is the simple mean squared error. And here comes the first trick: the MSE loss function is defined using the difference between the output and the ground truth modified with the constants used to convert HDR data to LDR.

The LDR Correction Network

The second network is taking the output of the first network and giving the corrected LDR image. The second trick comes in this part: the output of the first network (the HDR estimation) is modified before feeding it into the second network. In fact, the output of the first network is still in LDR domain (this comes from the first trick). So, this output is converted to HDR domain via gamma correction and then a logarithmic operation is applied.

The two auto-encoder networks share the same architecture and employ skip connections. Here is the complete architecture of the network.

HDR Estimation Network and LDR Correction Network
The two autoencoder networks: HDR Estimation Network (left) and LDR Correction Network (right)

Each network is composed of five convolutional and five deconvolutional layers:

Conv1 Layer: 9 x 9 kernel size, 64 feature maps
Conv2 Layer: 5 x 5 kernel size, 64 feature maps
Conv3 Layer: 3 x 3 kernel size, 128 feature maps
Conv4 Layer: 3 x 3 kernel size, 256 feature maps
Conv5 Layer: 3 x 3 kernel size, 256 feature maps
Latent representation: 1 x 512
Deconv1 Layer: 3 x 3 kernel size, 256 feature maps
Deconv2 Layer: 3 x 3 kernel size, 256 feature maps
Deconv3 Layer: 3 x 3 kernel size, 128 feature maps
Deconv4 Layer: 5 x 5 kernel size, 64 feature maps
Deconv5 Layer: 9 x 9 kernel size, 64 feature maps

The training was done using ADAM optimizer algorithm, with the initial learning rate of 1e-2.


Two datasets are used for training and testing the method: city scene panorama dataset and Sun360 outdoor panorama dataset. Moreover, the authors mention that they used Photoshop to generate ground truth LDR images with human supervision for the intermediate task. The size of the training set used to train both networks is around 40 000 image triplets (original LDR image, ground-truth HDR and ground- truth LDR image).


As the authors state, the proposed method performs favourably against state-of-the-art methods. The evaluation was done by comparing the method to 5 state-of-the-art methods: Cape, WVM, SMF, L0S, and DJF.

The evaluation is not pretty straightforward since the goal of the creation of visually pleasing images, which is both difficult to quantify and also subjective. However, the authors use a range of different evaluation metrics. They used HDR-VDP-2 parameter for evaluating the first network as it reflects the human perception and for assessing the whole method and comparing with existing methods a few different metrics have been used like: PSNR, SSIM, FSIM.

Quantitative evaluation
Table 1: Quantitative evaluation of the proposed method for image correction and other existing methods
Quantitative evaluation of the HDR prediction method
Table 2: Quantitative evaluation of the HDR prediction method (the first network) with other methods.

Visual evaluation is also provided where the significant results of the proposed method can be seen next to the results from other existing methods.

(1) Comparison of the proposed method (DRHT) with existing methods and LDR ground truth.
(2) Comparison of the proposed method (DRHT) with existing methods and LDR ground truth.

In conclusion, the proposed method shows that buried details in the under/over exposed images can be recovered and deep learning proved successful in another critical and non-trivial task.

Dan Mitriev