Automatic Creation of Personalized GIFs: a New State-of-the-Art Approach

27 July 2018

Automatic Creation of Personalized GIFs: a New State-of-the-Art Approach

Suppose you have a 10-minutes-video but are interested only in a small portion of it. You are thinking about creating a 5-second GIF out of this video, but video-editing can…

Suppose you have a 10-minutes-video but are interested only in a small portion of it. You are thinking about creating a 5-second GIF out of this video, but video-editing can be quite a cumbersome task. Would it be possible to automatically create such GIF for you? Would the algorithm be able to detect the moments you want to highlight? Well, in this article we are going to talk about a new approach to this task. It takes into account a history of GIFs previously created by you and suggests an option, which is pretty much likely to highlight the moments you are interested in.

Figure 1. Taking into account users previously selected highlights when creating GIFs

Typically, highlight detection models are trained to identify cues that make visual content appealing or interesting to most of the people. However, the “interestingness” of a video segment or image is in fact subjective. As a result, such highlight models often provide results that are of limited relevance for the individual user. Another approach suggests training one model per user, but this turns out to be inefficient and, in addition, requires large amounts of personal information, which is typically not available. So…

What is suggested?

Ana Garcia del Molino and Michael Gygli, while working at, suggested a new global ranking model, which can condition on a particular user’s interests. Rather than training one model per user, their model is personalized via its inputs, which allows it to effectively adapt its predictions, given only a few user-specific examples. It is built on the success of deep ranking models for highlight detection but makes the crucial enhancement of making highlight detection personalized.

If put in simple terms, the researchers suggest using information on the GIFs that a user previously created as this represents his or her interests and thus, provides a strong indication for personalization. Knowing, that a specific user is interested in basketball, for example, is not sufficient. One user may edit basketball videos to extract the slams, another one may just be interested in the team’s mascot jumping. A third one may prefer to see the kiss cam segments of the game.

Figure 2. Some examples of user histories from the paper

To obtain data about the GIFs previously created by different users, the researchers have turned to and its user base and collected a novel and large-scale dataset of users and the GIFs they created. Moreover, they made this dataset publicly available from here. It consists of 13,822 users with 222,015 annotations on 119,938 videos.

Model Architecture

The model that is suggested predicts the score of a segment based on both the segment itself and the user’s previously selected highlights. The method uses a ranking approach, where a model is trained to score positive video segments higher than negative segments from the same video. In contrast to previous works, however, the predictions are not based on the segment solely, but also take a user’s previously chosen highlights, their history, into account.

Figure 3. Model architecture. Proposed model (in bold) and alternative ways to encode the history and fuse predictions

In fact, the researchers propose two models, which are combined with late fusion. One takes the segment representation and aggregated history as input (PHD-CA), while the second directly uses the distances between the segments and the history (SVM-D). For the model with the aggregated history, they suggest using a feed-forward neural network (FNN). It is the quite small neural network with 2 hidden layers with 512 and 64 neurons. Within the distance-based model, they have created a feature vector that contains the cosine distances to the number of most similar history elements. Then the two models are combined with late fusion. As the models differ in the range of their predictions and their performance, a weight was applied to ensemble the models.

Performance of the proposed model in comparison to other methods
The performance of the suggested model was compared against several strong baselines:

  • Video2GIF. This approach is the state-of-the-art for automatic highlight detection for GIF creation. The comparison was carried out for both originally pre-trained model and a model with slight variations trained on the dataset, which is referred to as Video2GIF (ours).
  • Highlight SVM. This model is a ranking SVM trained to correctly rank positive and negative segments, but only using the segment’s descriptor and ignoring the user history.
  • Maximal similarity. This baseline scores segments according to their maximum similarity with the elements in the user history. Cosine similarity was used as a similarity measure.
  • Video-MMR. Within this model, the segments that are most similar are scored highly. Specifically, the mean cosine similarity to the history elements is used as an estimate of the relevance of a segment.
  • Residual Model. Here the researchers decided to adopt an idea from another study, where a generic regression model was used together with a user-specific model that personalizes predictions by fitting the residual error of the generic model. So, in order to adapt this idea to the ranking setting, they proposed training a user-specific ranking SVM that gets the generic predictions from Video2GIF (ours) as an input, in addition to the segment representation.
  • Ranking SVM on the distances (SVM-D). This one corresponds to the second part of the proposed model (distance-based model).
    The following metrics were used for quantitative comparison: mAP — mean average precision; nMSD — normalized Meaningful Summary Duration and Recall@5 — the ratio of frames from the user-generated GIFs (the ground truth) that are included in the 5 highest ranked GIFs.
    Here are the results:

Table 1. Comparison of the suggested approach (denoted as Ours) to the state-of-the-art alternatives for videos segmented into 5-second long shots. For mAP and R@5, the higher the score, the better the method. For MSD, the smaller is better. Best result per category in bold.

Table 2. Comparison of different ways to represent and aggregate the history, as well as ways to use the distances to the history to improve the prediction.

As you can see, the proposed method outperforms all baselines by a significant margin. Adding information about the user history to the highlight detection model (Ours (CA + SVM-D)) leads to a relative improvement over generic highlight detection (Video2GIF (ours)) of 5.2% (+0.8%) in mAP, 4.3% (-1.8%) in mMSD and 8% (+2.3%) in Recall@5.

Figure 4. Qualitative comparison to the state-of-the-art method (Video2GIF). Correct results have green borders. © provides a failure case when the user’s history is misleading the model

Let’s sum up

A novel model for personalized highlight detection was introduced. The distinctive feature of this model is that its predictions are conditioned on a specific user by providing his previously chosen highlight segments as inputs to the model. The experiments demonstrated that the users often have high consistency in the content they select, which allows the proposed model to outperform other state-of-the-art methods. In particular, the suggested approach outperforms generic highlight detection by 8% in Recall@5. This is a considerable improvement in this challenging high-level task.

Finally, a new large-scale dataset with personalized highlight information was introduced, which can be of particular use for further studies in this area.

“What Is Going On?” Neural Network by Facebook Detects and Recognises Human-Object Interactions

11 July 2018

“What Is Going On?” Neural Network by Facebook Detects and Recognises Human-Object Interactions

Deep learning played an important role in the past years for the improvement of visual recognition of individual instances e.g detecting objects and pose estimation. However, recognizing individual objects is…

Deep learning played an important role in the past years for the improvement of visual recognition of individual instances e.g detecting objects and pose estimation. However, recognizing individual objects is just a first step for machines to comprehend the visual world. To understand what is happening in images, it is also necessary to identify relationships between individual instances.

From a practical perspective, photos containing people contribute a considerable portion of daily uploads to the internet and social networking sites, and thus human-centric understanding has significant demand in practice. The fine granularity of human actions and their interactions with a wide array of object types presents a new challenge compared to recognition of entry-level object categories.

The idea is to present a human-centric model for recognizing human-object interaction. The central observation is a person’s appearance, which reveals their action and poses, is highly informative for inferring where the target object of the interaction may be located. The search space for the target object can thus be narrowed by conditioning on this estimation. Although there are often many objects detected the inferred target location can help the model to pick the correct object associated with a specific action quickly.

The Faster R-CNN framework is used to model the human-centric recognition branch. Specifically, on a region of interest (RoI) associated with a person, this branch performs action classification and density estimation for the action’s target object location. The density estimator predicts a 4-d Gaussian distribution, for each action type, that models the likely relative position of the target object to the person. The prediction is based purely on the human appearance. This human-centric recognition branch, along with a standard object detection branch and a simple pairwise interaction branch form a multitask learning system that can be jointly optimized.

Model Architecture

The model is consists of:

  • an object detection branch;
  • a human-centric branch;
  • an optional interaction branch.

The person features and their layers are shared between the human-centric and interaction branches (blue boxes).

       object detection branch

The goal is to detect and recognize triplets of the form (human, verb, object). The solution to this problem is extending the Fast R-CNN with an additional human-centric branch that classifies actions and estimates a probability density over the target object location for each activity. The human-centric branch reuses features extracted by Fast R-CNN for object detection, so its marginal computation is lightweight. Specifically, given a set of candidate boxes, Fast R-CNN outputs a set of object boxes and a class label for each box. The model is extended by assigning a triplet score to pairs of candidate human/object boxes b(h), b(o) and an action a. The triplet score is decomposed into four terms.

Fast R-CNN

While the model has multiple components, the basic idea is straightforward. s(h) and s(o) are the class scores from Fast R-CNN of b(h) and b(o) containing a human and object.

Object Detection

The object detection branch of the network, shown in Figure 1, is identical to that of Faster R-CNN.

Action Classification

The first role of the human-centric branch is to assign an action classification score s(a, h) to each human box b(h) and action a. The training objective is to minimize the binary cross entropy losses between the ground-truth action labels and the scores s(a, h) predicted by the model.

Target Localization

The second role of the human-centric branch is to predict the target object location based on a person’s appearance (again represented as features pooled from b(h)). This approach predicts a density over possible locations, and use this output together with the location of actual detected objects to precisely localize the target. To model the density over the target object’s location as a Gaussian function whose mean is predicted based on the human appearance and action being performed. Formally, the human-centric branch predicts µ(a,h), the target object’s 4-d mean location given the human box b(h) and action a. The target localization showed as:

can be used to test the compatibility of an object box b(o) and the predicted target location µ(a,h).

Interaction Recognition

The human-centric model scores actions based on the human appearance. While effective, this does not take into account the appearance of the target object. To improve the discriminative power of the model, and to demonstrate the flexibility of framework s(a, h) is replaced with an interaction branch that scores an action based on the appearance of both the human and target object.

interaction recognition
Estimating target object density from the person features

The model is first to train COCO set (excluding the V-COCO val images). This model, which is in essence Faster R-CNN, has 33.8 object detection AP on the COCO val set. InteractNet, has an AP(role) of 40.0 evaluated on all action classes on the V-COCO test set and also on HICO-DET dataset. This is an absolute gain of 8.2 points over the strong baseline’s 31.8, which is a relative improvement of 26%. The result is shown in the below table.




Result on V-COCO test images
Results on some V-COCO test images

The research addresses the problem of human object detection task. The proposed approach correctly detected triplets of one person taking multiple actions on multiple objects. Moreover, InteractNet can detect multiple interaction instances in an image. Below figures show two test images with all detected triplets shown.

Detected triplets on two
All detected triplets on two V-COCO test images
Multiple actions and multiplube objects detection
An individual person can take multiple actions and affect multiple objects

object detection AP

Muneeb Ul Hassan

“What I See” vs “What My Camera Sees”: How DNN Can Bring Details Back to Overexposed and Underexposed Images

28 April 2018
How DNN Can Bring Details

“What I See” vs “What My Camera Sees”: How DNN Can Bring Details Back to Overexposed and Underexposed Images

Human eyes are better than any known camera, so far. When we look at scenes where there’s a huge gap between the bright and dark tones (for example sunrise and…

Human eyes are better than any known camera, so far. When we look at scenes where there’s a huge gap between the bright and dark tones (for example sunrise and sunset), we can see details everywhere. However, cameras struggle in these situations and what we get as a result is an image with the shadows crunched down to black or highlights blown out to white. Many similar problems have arisen in the past, and many approaches and techniques have been proposed to correct (adjust) an image to obtain a visually pleasing one.

what i see
A meme is exposing a typical situation where we struggle to capture a beautiful view with a camera.

Researchers from the University of Hong Kong and the Dalian University of Technology have proposed a new method for transforming an image to a visually pleasing one using deep neural networks.

The proposed technique allows to correct an image with under/over exposure and introduce many details. It works by taking a standard LDR image (LDR Image is an image in the low dynamic range), and it produces an enhanced image, again in the LDR domain but visually enriched with the recovered details. The deep neural network can bring back the details since they exist in HDR domain (high dynamic range) but they have diminished in the LDR domain, and this is actually where the magic comes from.

How Does It Work?

The new method is called Deep Reciprocating HDR Transformation, and as the name suggests it works by applying a reciprocal transformation utilizing two deep neural networks. In fact, the idea is simple: taking an LDR image, we reconstruct the details in the HDR domain and map the image back to the LDR domain (enriched with details). Although it sounds super simple, there are a few tricks that have to be performed to make all this work, and I explain them below.

In order to make the aforementioned reciprocal transformation, two convolutional neural networks (CNN) are used. The first one called HDR estimation network takes the input image, encodes it into a latent representation (of lower dimension, of course) and then decodes this representation to reconstruct an HDR image. The second one, called LDR correction network, is doing the reciprocal transformation: it takes the estimated HDR (from the first network) and outputs the corrected LDR image. Both of the networks are simple auto-encoders encoding the data into a latent representation of size 512.

The two networks are trained jointly and have the same architecture. However, as I mentioned before, there are some tricks and the optimization and the cost function are explicitly defined to address the problem at hand.

underexposed image in dark scene
Visual comparison of the new method (DRHT) on the underexposed image in dark scene.

The HDR Estimation Network

The first network is trained to predict the HDR data. It has been trained using ELU activations, batch normalisation and the loss function is the simple mean squared error. And here comes the first trick: the MSE loss function is defined using the difference between the output and the ground truth modified with the constants used to convert HDR data to LDR.

The LDR Correction Network

The second network is taking the output of the first network and giving the corrected LDR image. The second trick comes in this part: the output of the first network (the HDR estimation) is modified before feeding it into the second network. In fact, the output of the first network is still in LDR domain (this comes from the first trick). So, this output is converted to HDR domain via gamma correction and then a logarithmic operation is applied.

The two auto-encoder networks share the same architecture and employ skip connections. Here is the complete architecture of the network.

HDR Estimation Network and LDR Correction Network
The two autoencoder networks: HDR Estimation Network (left) and LDR Correction Network (right)

Each network is composed of five convolutional and five deconvolutional layers:

Conv1 Layer: 9 x 9 kernel size, 64 feature maps
Conv2 Layer: 5 x 5 kernel size, 64 feature maps
Conv3 Layer: 3 x 3 kernel size, 128 feature maps
Conv4 Layer: 3 x 3 kernel size, 256 feature maps
Conv5 Layer: 3 x 3 kernel size, 256 feature maps
Latent representation: 1 x 512
Deconv1 Layer: 3 x 3 kernel size, 256 feature maps
Deconv2 Layer: 3 x 3 kernel size, 256 feature maps
Deconv3 Layer: 3 x 3 kernel size, 128 feature maps
Deconv4 Layer: 5 x 5 kernel size, 64 feature maps
Deconv5 Layer: 9 x 9 kernel size, 64 feature maps

The training was done using ADAM optimizer algorithm, with the initial learning rate of 1e-2.


Two datasets are used for training and testing the method: city scene panorama dataset and Sun360 outdoor panorama dataset. Moreover, the authors mention that they used Photoshop to generate ground truth LDR images with human supervision for the intermediate task. The size of the training set used to train both networks is around 40 000 image triplets (original LDR image, ground-truth HDR and ground- truth LDR image).


As the authors state, the proposed method performs favourably against state-of-the-art methods. The evaluation was done by comparing the method to 5 state-of-the-art methods: Cape, WVM, SMF, L0S, and DJF.

The evaluation is not pretty straightforward since the goal of the creation of visually pleasing images, which is both difficult to quantify and also subjective. However, the authors use a range of different evaluation metrics. They used HDR-VDP-2 parameter for evaluating the first network as it reflects the human perception and for assessing the whole method and comparing with existing methods a few different metrics have been used like: PSNR, SSIM, FSIM.

Quantitative evaluation
Table 1: Quantitative evaluation of the proposed method for image correction and other existing methods
Quantitative evaluation of the HDR prediction method
Table 2: Quantitative evaluation of the HDR prediction method (the first network) with other methods.

Visual evaluation is also provided where the significant results of the proposed method can be seen next to the results from other existing methods.

(1) Comparison of the proposed method (DRHT) with existing methods and LDR ground truth.
(2) Comparison of the proposed method (DRHT) with existing methods and LDR ground truth.

In conclusion, the proposed method shows that buried details in the under/over exposed images can be recovered and deep learning proved successful in another critical and non-trivial task.

Dan Mitriev