New Datasets for Action Recognition

22 October 2018

New Datasets for Action Recognition

Action recognition is vital for many real-life applications, including video surveillance, healthcare, and human-computer interaction. What do we need to do to classify video clips based on the actions being…

Action recognition is vital for many real-life applications, including video surveillance, healthcare, and human-computer interaction. What do we need to do to classify video clips based on the actions being performed in these videos?

We need to identify different actions from video clips where the action may or may not be performed throughout the entire duration of the video. This looks similar to the image classification problem, but in this case, the task is extended to multiple frames with further aggregation of the predictions from each frame. And we know that after the introduction of the ImageNet dataset, deep learning algorithms are doing a pretty good job in image classification. But do we observe the same progress in video classification or action recognition tasks?

Actually, there is a number of things that turn action recognition into a much more challenging task. This includes huge computational cost, capturing long context, and of course, a need for good datasets.

A good dataset for action recognition problem should have a number of frames comparable to ImageNet and diversity of action classes that will allow for generalization of the trained architecture to many different tasks.

Fortunately, several such datasets were presented during the last year. Let’s have a look.


Number of videos: 500,000

Number of action classes: 600

Year: 2018

Samples from the Kinetics-600 dataset
Samples from the Kinetics-600 dataset

We start with the dataset introduced by Google’s DeepMind team. This is a Kinetics dataset – a large-scale, high-quality dataset of YouTube URLs created to advance models for human action recognition. Its last version is called Kinetics-600 and includes around 500,000 video clips that cover 600 human action classes with at least 600 video clips for each action class.

Each clip in Kinetics-600 is taken from a unique YouTube video, lasts around 10 seconds and is labeled with a single class. The clips have been through multiple rounds of human annotation. A single-page web application was built for the labeling task, and you can see the labeling interface below.

Labeling Interface
Labeling Interface

If a worker responded with ‘Yes’ to the initial question “Can you see a human performing the action class-name?”, he was also asked the follow-up question “Does the action last for the whole clip?” in order to use this signal later during model training.

The creators of Kinetics-600 have also checked if the dataset is gender balanced and discovered that approximately 15% of action classes are imbalanced but this doesn’t lead to a biased performance.

The actions cover a broad range of classes including human-object interactions such as playing instruments, arranging flowers, mowing a lawn, scrambling eggs and so on.

Moments in Time

Number of videos: 1,000,000

Number of action classes: 339

Year: 2018

Samples from the Moments in Time dataset
Samples from the Moments in Time dataset

Moments in Time is another large-scale dataset, which was developed by the MIT-IBM Watson AI Lab. With a collection of one million labeled 3-second videos, it is not restricted to human actions only and includes people, animals, objects and natural phenomena, that capture the gist of a dynamic scene.

The dataset has a significant intra-class variation among the categories. For instance, video clips labeled with the action “opening” include people opening doors, gates, drawers, curtains and presents, animals and humans opening eyes, mouths and arms, and even a flower opening its petals.

That’s natural for the humans to recognize that all of the above-mentioned scenarios belong to the same category “opening” even though visually they look very different from each other. So, as pointed out by the researchers, the challenge is to develop deep learning algorithms that will be also able to discriminate between different actions, yet generalize to other agents and settings within the same action.

The action classes in Moments in Time dataset are chosen such that they include the most commonly used verbs in the English language, covering a wide and diverse semantic space. So, there are 339 different action classes in the dataset with 1,757 labeled videos per class on average; each video is labeled with only one action class.

Labeling interface
Labeling interface

As you can see from the image, the annotation process was very straightforward: workers were presented with video-verb pairs and asked to press a Yes or No key responding if the action is happening in the scene. For the training set, the researchers run each video through annotation at least 3 times and required a human consensus of at least 75%. For the validation and test sets, they increased the minimum number of rounds of annotation to 4 with a human consensus of at least 85%.


Number of videos: 520,000 videos –> 1.75M 2-second clips

Number of action classes: 200

Year: 2017

Data collection procedure
Data collection procedure

A Sparsely Labeled ACtions Dataset (SLAC) is introduced by the group of researchers from MIT and Facebook. The dataset is focused on human actions, similarly to Kinetics, and includes over 520K untrimmed videos retrieved from YouTube with an average length of 2.6 minutes. 2-second clips were sampled from the videos by a novel active sampling approach. This resulted in 1.75M clips, including 755K positive samples and 993K negative samples as annotated by a team of 70 professional annotators.

As you see, the distinctive feature of this dataset is the presence of negative samples. See the illustration of negative samples below.

Negative samples from the SLAC dataset
Negative samples from the SLAC dataset

The dataset includes 200 action classes taken from the ActivityNet dataset.

Please note that even though the paper introducing this dataset was released in December 2017, the dataset is still not available for download. Hopefully, this will change very soon.


Number of videos: 114,000

Year: 2017

Samples from the VLOG dataset
Samples from the VLOG dataset

The VLOG dataset differs from the previous datasets in the way it was collected. The traditional approach to getting data starts with a laundry list of action classes and then searching for the videos tagged with the corresponding labels.

However, such an approach runs into trouble because everyday interactions are not likely to be tagged on the Internet. Could you imagine uploading and tagging video of yourself opening a microwave, opening a fridge, or getting out of bed? The people tend to tag unusual things like for example, jumping in a pool, presenting the weather, or playing the harp. As a result, available datasets are often imbalanced with more data featuring unusual events and less data on our day-to-day activities.

To solve this issue, the researchers from the University of California suggest starting out with a superset of what we actually need, namely interaction-rich video data, and then annotating and analyzing it after the fact. They start data collection from the lifestyle VLOGs – an immensely popular genre of video that people publicly upload to YouTube to document their lives.

Illustration of the automatic gathering process
Illustration of the automatic gathering process

As the data was gathered implicitly, it represents certain challenges for annotation. The researchers decided to focus on the crucial part of the interaction, the hands, and how they interact with the semantic objects at a frame level. Thus, this dataset can also make headway on the difficult problem of understanding hands in action.

Bottom Line

The action recognition problem requires huge computational costs and lots of data. Fortunately, several very good datasets have appeared during the last year. Together with the previously available benchmarks (ActivityNet, UCF101, HMDB), they build a great foundation for significant improvements in the performance of the action recognition systems.

This Neural Network Evaluates Natural Scene Memorability

1 October 2018
natural scene memorability score by neural network

This Neural Network Evaluates Natural Scene Memorability

One hallmark of human cognition is the splendid capacity of recalling thousands of different images, some in details, after only a single view. Not all photos are remembered equally in…

One hallmark of human cognition is the splendid capacity of recalling thousands of different images, some in details, after only a single view. Not all photos are remembered equally in a human brain. Some images stick in our minds, while others fade away in a short time. This kind of capacity is likely to be influenced by individual experiences and is also subject to some degree of inter-subject variability, similar to some individual image properties.

Interestingly, when exposed to the overflow of visual images, subjects have a consistent tendency rather remember or forget the same pictures. Previous research suggests and analyzes the reason why people have the intuition to remember images and provide reliable solutions for ranking images by memorability scores. These works are mostly for generic images, object images and face photographs. However, it is difficult to dig out the obvious cues relevant to the memorability of a natural scene. To date, methods for predicting the visual memorability of a natural scene are scarce.

Previous Works

Previous work showed that memorability is an intrinsic property of an image. DNN has demonstrated splendid achievement in many research areas, e.g., video coding and computer vision. Also, several DNN approaches were proposed to estimate image memorability, which significantly improves the prediction accuracy.

  • Data scientists from MIT  trained the MemNet on a large-scale database, achieving a splendid prediction performance close to human consistency.
  • Baveye et al. fine-tuned the GoogleNet exceeding the performance of handcrafted features. Researchers also studied and targeted the certain objects like faces, natural scenes, etc.
  • Researchers from MIT have also created a database for studying the memorability of human face photographs. They further explored the contribution of certain traits (e.g., kindness, trustworthiness, etc.) to face memorability, but such characteristics only partly explain facial memorability.

State-of-the-art idea

As a first step towards understanding and predicting the memorability of a natural scene, LNSIM database is built. In LNSIM database, there are in total 2,632 natural scene images. For obtaining these natural scene images, 6,886 images are selected, which contain natural scenes from the existing databases, including MIR Flickr, MIT1003, NUSEF, and AVA database. Natural scenes images are selected from these databases. Fig. 1 shows some example images from LNSIM database.

Fig: 1 Image samples from LNSIM database

A memory game is used to quantify the memorability of each image in LNSIM database. A software is developed in which 104 subjects (47 females and 57 males) were involved. They do not overlap with the volunteers who participated in the image selection. The procedure of our memory game is summarized in Fig. 2.

Fig:02 The experimental procedure of memory game. Each level lasts about 5.5 minutes with a total of 186 images. Those 186 images are composed of 66 targets, 30 fillers, and 12 vigilance images. The specific time durations for experiment setting are labeled above.
In this experiment, there were 2,632 target images, 488 vigilance images and 1,200 filler images, which were unknown to all subjects. Vigilance and filler images were randomly sampled from the rest of 6,886 images. Vigilance images were repeated within 7 images, in an attempt to ensure that the subjects were paying attention to the game. Filler images were presented for once, such that spacing between the same target or vigilance images can be inserted. After collecting the data, a memorability score is assigned to quantify how memorable each image is. Also, to evaluate the human consistency, subjects are split into two independent halves(i.e. Group 1 and 2).

Analysis of Natural Scene Memorability

LNSIM database is mined to better understand how natural scene memorability is influenced by the low, middle and high-level handcrafted features and the learned deep feature.

Low-level features, like pixels, SIFT  and HOG2, have the impact on memorability of generic images. It has been investigated whether these low-level features still work on natural scene image set or not. To evaluate this, a support vector regression (SVR) for each low-level feature using training set to predict memorability, and then evaluate the SRCC of these low-level features with memorability on the test set. Below table 1 reports the results of SRCC on natural scenes, with SRCC on generic images as the baseline. It is evident that pixels (ρ=0.08), SIFT (ρ=0.28) and HOG2 (ρ=0.29) are not as effective as expected on the natural scene, especially compared to generic images.
Table 1: The correlation ρ between low-level features and natural scene memorability.
This suggests that the low-level features cannot effectively characterize the visual information for remembering natural scenes.

The middle-level feature of GIST describes the spatial structure of an image. However, Table 2 shows that the SRCC of GIST is only 0.23 for the natural scene, much less thanρ=0.38 of generic images. This illustrates that structural information provided by the GIST feature is less effective for predicting memorability scores on natural scenes.

Table 2: The correlation ρ between middle-level features and natural scene memorability.
There is no salient object, animal or person in natural scene images, such that scene semantics, as a high-level feature. To obtain the ground truth of scene category, two experiments are designed to annotate scene category for 2,632 images in the database.
  • Task 1(Classification Judgement): 5 participants are asked to indicate which scene categories an image has. A random image query was generated for each participant. Participants had
    to choose proper scene category labels to interpret scene stuff for each image.
  • Task 2 (Verification Judgement): A separate task ran on the same set of images by recruiting another 5 participants after Task 1. The participants were asked to provide a binary answer to the question for each image. The default answer was set to “No”, and the participants can check the box of image index to set “No” to “Yes”.

All images are annotated with categories through the majority voting over Task 1 and Task 2. Afterward, an SVR predictor with the histogram intersection kernel is trained for scene category. The scene category attribute achieves a good performance of SRCC(ρ=0.38), outperforming the results of low-level feature combination. This suggests that high-level scene category is an obvious cue of quantifying the natural scene memorability. As shown in below Figure, the horizontal axis represents scene categories in the descending order of corresponding average memorability scores. The average score ranges from 0.79 to 0.36, giving a sense of how memorability changes across different scene categories. The distribution in below Figure indicates that some unusual classes like aurora tend to be more memorable, while usual classes like mountain are more likely to be forgotten. This is possibly due to the frequency of each category appears in daily life.

Comparison of average memorability score and standard deviation of each scene category
To dig out how deep feature influences the memorability of a natural scene, a fine-tuned MemNet is trained on LNSIM database, using the Euclidean distance between the predicted and ground truth memorability scores as the loss function. The output of the last hidden layer is extracted as the deep feature (dimension: 4096).To evaluate the correlation between the deep feature and natural scene memorability, similar to above-handcrafted features, an SVR predictor with histogram intersection kernel is trained for the deep feature. The SRCC of the deep feature is 0.44, exceeding all handcrafted features. It is acceptable that DNN indeed works well on predicting the memorability of a natural scene, as deep feature shows a rather high prediction accuracy. Nonetheless, there is no doubt that the fine-tuned MemNet also has its limitation, since it still has the gap to human consistency (ρ=0.78).

DeepNSM: DNN for natural scene memorability

Fine-tuned MemNet model serves as the baseline model in predicting natural scene memorability. In the proposed DeepNSM architecture, the deep feature is concatenated with the category-related element to predict the memorability of natural scene images accurately. Note that the “deep feature” refers to the 4096-dimension feature extracted from the baseline model.

Figure 2: Architecture of DeepNSM model
Figure 2: Architecture of DeepNSM model
The architecture of DeepNSM model is presented in Figure 2. In DeepNSM model, the aforementioned category-related feature is concatenated with the deep feature obtained from the baseline model. Based on such concatenated element, additional fully-connected layers (including one hidden layer with the dimension of 4096) are designed to predict the memorability scores of natural scene images. In training, the layers of the baseline and ResNet models are initialized by the individually pre-trained models, and the added fully-connected layers are randomly initialized. The whole network is jointly trained in an end-to-end manner, using the Adam optimizer with the Euclidean distance adopted as the loss function.

Comparison with other models

The performance of DeepNSM model in predicting natural scene memorability regarding SRCC (ρ). The DeepNSM model is tested on both the test set of LNSIM database and the NSIM database. The SRCC performance of DeepNSM model is compared with the state-of-the-art memorability prediction methods, including MemNet, MemoNet, and Lu et al. Among them, MemNet and MemoNet are the latest DNN methods for generic images, which beat the conventional techniques using handcrafted features. Lu et al. is a state-of-the-art method for predicting natural scene memorability.
Fig: 3 The SRCC (ρ) performance of DeepNSM and compared methods.

Fig: 3 shows the SRCC performance of DeepNSM and the three compared methods. DeepNSM successfully achieves the outstanding SRCC performance, i.e., ρ=0.58 and 0.55, over the LNSIM and NSIM databases, respectively. It significantly outperforms the state-of-the-art DNN methods, MemNet and MemoNet. The above results demonstrate the effectiveness of DeepNSM in predicting natural scene memorability.


The above approach investigated the memorability of a natural scene from the data-driven perspective. Specifically, it established the LNSIM database for analyzing human memorability on natural scene. In exploring the correlation of memorability with low-, middle- and high-level features, it is worth mentioning that a high-level feature of scene category plays a vital role in predicting the memorability of a natural scene.

A Style-Aware Content Loss for Real-time HD Style Transfer

14 August 2018

A Style-Aware Content Loss for Real-time HD Style Transfer

A picture may be worth a thousand words, but at least it contains a lot of very diverse information. This not only comprises what is portrayed, e.g., a composition of…

A picture may be worth a thousand words, but at least it contains a lot of very diverse information. This not only comprises what is portrayed, e.g., a composition of a scene and individual objects but also how it is depicted, referring to the artistic style of a painting or filters applied to a photo. Especially when considering artistic images, it becomes evident that not only content but also style is a crucial part of the message an image communicates (just imagine van Gogh’s Starry Night in the style of Pop Art). A vision system then faces the challenge to decompose and separately represent the content and style of an image to enable a direct analysis based on each individually. The ultimate test for this ability is style transfer, exchanging the style of an image while retaining its content.

Neural Style Transfer Example
Neural Style Transfer Example

Recent work has been done using neural networks and the crucial representation in all these approaches has been based on a VGG16 or VGG19 network, pretrained on ImageNet. However, a recent trend in deep learning has been to avoid supervised pre-training on a million images with tediously labeled object bounding boxes. In the setting of style transfer, this has the particular benefit of avoiding from the outset any bias introduced by ImageNet, which has been assembled without artistic consideration. Rather than utilizing a separate pre-trained VGG network to measure and optimize the quality of the stylistic output, an encoder-decoder architecture with adversarial discriminator is used, to stylize the input content image and also use the encoder to measure the reconstruction loss.

State of the Art

To enable a fast style transfer that instantly transfers a content image or even frames of a video according to a particular style, a feed-forward architecture is required rather than the slow optimization-based approach. To this end, t an encoder-decoder architecture that utilizes an encoder network E to map an input content image x onto a latent representation z = E(x). A generative decoder G then plays the role of a painter and generates the stylized output image y = G(z) from the sketchy content representation z. Stylization then only requires a single forward pass, thus working in real-time.


1) Training with a Style-Aware Content Loss

Previous approaches have been limited in that training worked only with a single style image. In contrast, in this work, a single image y0 is given with a set Y of related style images yj ∈ Y. To train E and G, a standard adversarial discriminator D is used to distinguish the stylized output G(E(xi)) from real examples yj ∈ Y. The transformed image loss is defined as then:


where C × H × W is the size of image x and for training T is initialized with uniform weights. Fig. 3 illustrates the full pipeline of approach. To summarize, the full objective of our model is:


where λ controls the relative importance of adversarial loss.

2) Style Image Grouping

Given a single style image y0 the task is to find a set Y of related style images yj ∈ Y. A VGG16 is trained from scratch on Wikiart dataset to predict an artist given the artwork. The network is trained on the 624 largest (by the number of works) artists from the Wikiart dataset. Artist classification, in this case, is the surrogate task for learning meaningful features in the artworks’ domain, which allows retrieving similar artworks to image y0.

Let φ(y) be the activations of the fc6 layer of the VGG16 network C for input image y. To get a set of related style images to y0 from the Wikiart dataset Y we retrieve all nearest neighbors of y0 based on the cosine distance δ of the activations φ(·), i.e.


The basis for style transfer model is an encoder-decoder architecture. The encoder network contains 5 conv layers: 1×conv-stride-1 and 4×conv-stride-2. The decoder network has 9 residual blocks, 4 upsampling blocks, and 1×conv-stride-1. The discriminator is a fully convolutional network with 7×conv-stride-2 layers. During the training process sample 768 × 768 content image patches from the training set of Places365 [51] and 768×768 style image patches from the Wikiart dataset. We train for 300000 iterations with batch size 1, learning rate 0.0002 and Adam optimizer. The learning rate is reduced by a factor of 10 after 200000 iterations.

Table: 1
Table: 1
Training time

Experts were asked to choose one image which best and most realistically reflects the current style. The score is computed as the fraction of times a specific method was chosen as the best in the group. Mean expert score is calculated for each method using 18 different styles and report them in Tab. 1.


This paper has addressed major conceptual issues in state-of-the-art approaches for style transfer. The proposed style-aware content loss enables a real-time, high-resolution encoder-decoder based stylization of images and videos and significantly improves stylization by capturing how style affects content.


High-resolution-style-transfer-result-e1533903958847 (1)
Result in high resolution

Anatomically-Aware Facial Animation from a Single Image

7 August 2018
GAN animation

Anatomically-Aware Facial Animation from a Single Image

Let’s say you have a picture of a Hugh Jackman for an advertisement. He looks great, but the client wants him to look a little bit happier. No, you don’t…

Let’s say you have a picture of a Hugh Jackman for an advertisement. He looks great, but the client wants him to look a little bit happier. No, you don’t need to invite the celebrity for another photo shoot. You even don’t need to spend hours with Photoshop. You can automatically generate a dozen images from this single image, where Hugh will have the smiles of different intensities. You can even create an animation from this single image, where Hugh changes his facial expression from absolutely serious to absolutely happy.

The authors of a novel GAN conditioning scheme based on Action Units (AUs) annotations claim that such scenario is not futuristic but absolutely feasible as of today. And in this article, we’re going to have an overview of their approach, observe the results generated with the suggested method, and compare its performance with the state-of-the-art approaches.

Suggested Method

Facial expressions are the result of the combined and coordinated action of facial muscles. They can be described in terms of the so-called Action Units, which are anatomically related to the contractions of specific facial muscles. For example, the facial expression for fear is generally produced with the following activations: Inner Brow Raiser (AU1), Outer Brow Raiser (AU2), Brow Lowerer (AU4), Upper Lid Raiser (AU5), Lid Tightener (AU7), Lip Stretcher (AU20) and Jaw Drop (AU26). The magnitude of each AU defines the extent of emotion.

Building on this approach to defining facial expressions, Pumarola and his colleagues suggest a GAN architecture, which is conditioned on a one-dimensional vector indicating the presence/absence and the magnitude of each action unit. They train this architecture in an unsupervised manner that only requires images with their activated AUs. Next, they split the problem into two main stages:

  • Rendering a new image under the desired expression by considering an AU-conditioned bidirectional adversarial architecture provided with a single training photo.
  • Rendering back the synthesized image to the original pose.

Moreover, the researchers wanted to ensure that their system will be able to handle images under changing backgrounds and illumination conditions. Hence, they’ve added an attention layer to their network. It focuses the action of the network only in those regions of the image that are relevant to convey the novel expression.

Let’s now move on to the next section to reveal the details of this network architecture – how does it succeed in generating anatomically coherent facial animations from images in the wild?

Overview of the face expression approach
Overview of the approach

Network Architecture

The proposed architecture consists of two main blocks:

  • Generator G regresses attention and color masks (note that it’s applied twice, first to map the input image and then to render it back). The aim was to make generator focusing only on those regions that are responsible for synthesizing the novel face expression while keeping the rest elements of the image such as hair, glasses, hats or jewelry untouched. Thus, instead of regressing the full image, this generator outputs a color mask C and an attention mask A. The mask A indicates to which extent each pixel of the C contributes to the output image. This leads to sharper and more realistic images at the end.
Attention-based generator
Attention-based generator
  • Conditional critic. Critic D evaluates the generated image in its photorealism and expression conditioning fulfillment.

The loss function for this network is a linear combination of several partial losses: image adversarial loss, attention loss, conditional expression loss, and identity loss:

loss function

The model is trained on a subset of 200 000 images from the EmotioNet dataset using the Adam optimizer with the learning rate of 0.0001, beta1 0.5, beta2 0.999 and batch size 25.

Experimental Evaluation

The image below demonstrates the model’s ability to activate AUs at different intensities while preserving the person’s identity. For instance, you can see that the model properly handles the case with zero intensity and generates an identical copy of the input image. For the non-zero cases, the model realistically renders complex facial movements and outputs images that are usually indistinguishable from the real ones.

AU intensity
Single AU’s edition: varying the levels of intensity

Next figure displays the attention mask A and the color mask C. You can see how the model focuses its attention (darker area) onto the corresponding action units in an unsupervised manner. Hence, only the pixels relevant to the expression change are carefully estimated, while the background pixels are directly copied from the input image. This feature of the model is very handy when dealing with images in the wild.

attention mask
Attention Model

Now, let’s see how the model handles the task of editing multiple AUs. The outcomes are depicted below. You can observe here remarkably smooth and consistent transformation across frames even with challenging light conditions and non-real world data, as in the case of the avatar. These results encourage the authors to further extend their model to video generation. They should definitely try, shouldn’t they?

editing multiple AUs
Editing multiple Action Units

Meanwhile, they compare their approach against several state-of-the-art methods to see how well their model performs at generating different facial expressions from a single image. The results are demonstrated below. It looks like the bottom row, representing the suggested approach, contains much more visually compelling images with notably higher spatial resolution. As we’ve already discussed before, the use of the attention mask allows applying the transformation only on the cropped face and put it back on the original image without producing an artifact.

face expression
Qualitative comparison with state-of-the-art

Limitations of the Model

Let’s now discuss the model’s limits – which types of challenging images it is still able to handle, and when it actually fails. As demonstrated on the image below, the model succeeds when dealing with human-like sculptures, non-realistic drawings, non-homogeneous textures across the face, anthropomorphic faces with non-real textures, non-standard illuminations/colors and even the face sketches.

However, there are several cases, when it fails. The first failure case depicted below results from the errors in the attention mechanism when given extreme input expressions. The model may also fail when the input image contains non-previously seen occlusions such as an eye patch causing artifacts in the missing face attributes. It is also not ready to deal with non-human anthropomorphic distributions as in the case of cyclopes. Finally, the model can also generate artifacts like human face features when dealing with animals.

face expression failure
Success and failure cases

Bottom Line

The model presented here is able to generate anatomically-aware face animations from the images in the wild. The resulting images surprise with their realism and high spatial resolution. The suggested approach advances current works, which had only addressed the problem for discrete emotions category editing and portrait images. Its key contributions include a) encoding face deformations by means of AUs to render a wide range of expressions and b) embedding an attention model to focus only on the relevant regions of the image. Several failure cases that were observed are presumably due to insufficient training data. Thus, we can conclude that the results of this approach are very promising, and we look forward to observing its performance for video sequences.

Automatic Creation of Personalized GIFs: a New State-of-the-Art Approach

27 July 2018

Automatic Creation of Personalized GIFs: a New State-of-the-Art Approach

Suppose you have a 10-minutes-video but are interested only in a small portion of it. You are thinking about creating a 5-second GIF out of this video, but video-editing can…

Suppose you have a 10-minutes-video but are interested only in a small portion of it. You are thinking about creating a 5-second GIF out of this video, but video-editing can be quite a cumbersome task. Would it be possible to automatically create such GIF for you? Would the algorithm be able to detect the moments you want to highlight? Well, in this article we are going to talk about a new approach to this task. It takes into account a history of GIFs previously created by you and suggests an option, which is pretty much likely to highlight the moments you are interested in.

Figure 1. Taking into account users previously selected highlights when creating GIFs

Typically, highlight detection models are trained to identify cues that make visual content appealing or interesting to most of the people. However, the “interestingness” of a video segment or image is in fact subjective. As a result, such highlight models often provide results that are of limited relevance for the individual user. Another approach suggests training one model per user, but this turns out to be inefficient and, in addition, requires large amounts of personal information, which is typically not available. So…

What is suggested?

Ana Garcia del Molino and Michael Gygli, while working at, suggested a new global ranking model, which can condition on a particular user’s interests. Rather than training one model per user, their model is personalized via its inputs, which allows it to effectively adapt its predictions, given only a few user-specific examples. It is built on the success of deep ranking models for highlight detection but makes the crucial enhancement of making highlight detection personalized.

If put in simple terms, the researchers suggest using information on the GIFs that a user previously created as this represents his or her interests and thus, provides a strong indication for personalization. Knowing, that a specific user is interested in basketball, for example, is not sufficient. One user may edit basketball videos to extract the slams, another one may just be interested in the team’s mascot jumping. A third one may prefer to see the kiss cam segments of the game.

Figure 2. Some examples of user histories from the paper

To obtain data about the GIFs previously created by different users, the researchers have turned to and its user base and collected a novel and large-scale dataset of users and the GIFs they created. Moreover, they made this dataset publicly available from here. It consists of 13,822 users with 222,015 annotations on 119,938 videos.

Model Architecture

The model that is suggested predicts the score of a segment based on both the segment itself and the user’s previously selected highlights. The method uses a ranking approach, where a model is trained to score positive video segments higher than negative segments from the same video. In contrast to previous works, however, the predictions are not based on the segment solely, but also take a user’s previously chosen highlights, their history, into account.

Figure 3. Model architecture. Proposed model (in bold) and alternative ways to encode the history and fuse predictions

In fact, the researchers propose two models, which are combined with late fusion. One takes the segment representation and aggregated history as input (PHD-CA), while the second directly uses the distances between the segments and the history (SVM-D). For the model with the aggregated history, they suggest using a feed-forward neural network (FNN). It is the quite small neural network with 2 hidden layers with 512 and 64 neurons. Within the distance-based model, they have created a feature vector that contains the cosine distances to the number of most similar history elements. Then the two models are combined with late fusion. As the models differ in the range of their predictions and their performance, a weight was applied to ensemble the models.

Performance of the proposed model in comparison to other methods
The performance of the suggested model was compared against several strong baselines:

  • Video2GIF. This approach is the state-of-the-art for automatic highlight detection for GIF creation. The comparison was carried out for both originally pre-trained model and a model with slight variations trained on the dataset, which is referred to as Video2GIF (ours).
  • Highlight SVM. This model is a ranking SVM trained to correctly rank positive and negative segments, but only using the segment’s descriptor and ignoring the user history.
  • Maximal similarity. This baseline scores segments according to their maximum similarity with the elements in the user history. Cosine similarity was used as a similarity measure.
  • Video-MMR. Within this model, the segments that are most similar are scored highly. Specifically, the mean cosine similarity to the history elements is used as an estimate of the relevance of a segment.
  • Residual Model. Here the researchers decided to adopt an idea from another study, where a generic regression model was used together with a user-specific model that personalizes predictions by fitting the residual error of the generic model. So, in order to adapt this idea to the ranking setting, they proposed training a user-specific ranking SVM that gets the generic predictions from Video2GIF (ours) as an input, in addition to the segment representation.
  • Ranking SVM on the distances (SVM-D). This one corresponds to the second part of the proposed model (distance-based model).
    The following metrics were used for quantitative comparison: mAP — mean average precision; nMSD — normalized Meaningful Summary Duration and Recall@5 — the ratio of frames from the user-generated GIFs (the ground truth) that are included in the 5 highest ranked GIFs.
    Here are the results:

Table 1. Comparison of the suggested approach (denoted as Ours) to the state-of-the-art alternatives for videos segmented into 5-second long shots. For mAP and R@5, the higher the score, the better the method. For MSD, the smaller is better. Best result per category in bold.

Table 2. Comparison of different ways to represent and aggregate the history, as well as ways to use the distances to the history to improve the prediction.

As you can see, the proposed method outperforms all baselines by a significant margin. Adding information about the user history to the highlight detection model (Ours (CA + SVM-D)) leads to a relative improvement over generic highlight detection (Video2GIF (ours)) of 5.2% (+0.8%) in mAP, 4.3% (-1.8%) in mMSD and 8% (+2.3%) in Recall@5.

Figure 4. Qualitative comparison to the state-of-the-art method (Video2GIF). Correct results have green borders. © provides a failure case when the user’s history is misleading the model

Let’s sum up

A novel model for personalized highlight detection was introduced. The distinctive feature of this model is that its predictions are conditioned on a specific user by providing his previously chosen highlight segments as inputs to the model. The experiments demonstrated that the users often have high consistency in the content they select, which allows the proposed model to outperform other state-of-the-art methods. In particular, the suggested approach outperforms generic highlight detection by 8% in Recall@5. This is a considerable improvement in this challenging high-level task.

Finally, a new large-scale dataset with personalized highlight information was introduced, which can be of particular use for further studies in this area.

Realistic 3D Avatars from a Single Image

27 July 2018
3d avatar

Realistic 3D Avatars from a Single Image

Digital media needs realistic 3D avatars with faces. The recent surge in augmented and virtual reality platforms has created an even stronger demand for high-quality content, and rendering realistic faces…

Digital media needs realistic 3D avatars with faces. The recent surge in augmented and virtual reality platforms has created an even stronger demand for high-quality content, and rendering realistic faces plays a crucial role in achieving engaging face-to-face communication between digital avatars in simulated environments.

So, what could be a perfect algorithm? The person takes mobile “selfie” image, uploads the picture and gets an avatar in the simulated environment with accurately modeled facial shape and reflectance. In practice, however, significant compromises are made to balance the amount of input data to be captured, the amount of computation required, and the quality of the final output.

Figure 1. Inferring high-fidelity facial reflectance and geometry maps from a single image

Despite the high complexity of the task, group of researchers from USC Institute for Creative Technologies claims that their model allows to efficiently create accurate, high-fidelity 3D avatars from a single input image, captured in an unconstrained environment. Furthermore, the avatars will be close in quality to those created by professional capture systems but will require minimal computation and no special expertise on the part of the photographer.

So, let’s discover their approach to creating high-fidelity avatars from a single image without extensive computations or manual efforts.

Overview of the Suggested Approach

First of all, the model is trained with high-resolution facial scans obtained using a state-of-the-art multi-view photometric facial scanning system. This approach helps to get high-resolution and high-fidelity geometric and reflectance maps from a 2D input image, which can be captured under arbitrary illumination and contain partial occlusions of the face. The inferred maps can be next used to render a compelling and realistic 3D avatar in novel lighting conditions. The whole process can be accomplished in seconds.

Considering the complexity of the task, it was decomposed into several problems, which are addressed by separate convolutional neural networks:

· Stage 1 includes obtaining the coarse geometry by fitting a 3D template model to the input image, extracting an initial facial albedo map from this model, and then using networks that estimate illumination-invariant specular and diffuse albedo and displacement maps from this texture.

· Stage 2: the inferred maps, which may have missing regions due to occlusions in the input image, are passed through networks for texture completion. High-fidelity textures are obtained using a multi-resolution image-to-image translation network, in which latent convolutional features are flipped so as to achieve a natural degree of symmetry while maintaining local variations.

· Stage 3: another network is used to obtain additional details in the completed regions.

· Stage 4: a convolutional neural network performs super-resolution to increase the pixel resolution of the completed texture from 512 × 512 into 2048 × 2048.

Let’s discuss the architecture of the suggested model in more details.

Model Architecture

The pipeline of the proposed model is illustrated below. Given a single input image, the base mesh and corresponding facial texture map are extracted. This map is passed through two convolutional neural networks (CNNs) that perform inference to obtain the corresponding reflectance and displacement maps. Since these maps may contain large missing regions, the next step includes texture completion and refinement to fill these regions based on the information from the visible regions. And finally, super-resolution is performed. The resulting high-resolution reflectance and geometry maps may be used to render high-fidelity avatars in novel lighting environments.

Figure 2. The pipeline of the proposed model

Reflectance and geometry inference. The pixel-wise optimization algorithm is adopted to obtain the base facial geometry, head orientation, and camera parameters. This data is then used to project the face into a texture map in the UV space. The non-skin region is removed. The RGB texture extracted is fed into the model of U-net architecture with skip connections to obtain the corresponding diffuse and specular reflectance maps and the mid- and high-frequency displacement maps.

To obtain the best overall performance, two networks with identical architectures were employed: one operating on the diffuse albedo map (subsurface component), and the other on the tensor obtained by concatenating the specular albedo map with the mid- and high-frequency displacement maps (collectively surface components).

Symmetry aware texture completion. Again, the best results were obtained by training two network pipelines: one pipeline — to complete the diffuse albedo, and another one — to complete the other components (specular albedo, mid- and high-level displacement).

Next, it was discovered that completing large areas at high resolution doesn’t give satisfactory results due to the high complexity of the learning objective. Thus, the inpainting problem was divided into simpler sub-problems as shown on the picture below.

Figure 3. Texture completion pipeline

Furthermore, the researchers leveraged the spatial symmetry of UV parameterization and maximized the feature coverage by flipping intermediate features over the V-axis in UV space and concatenate them to the original features. This allowed completed textures to contain a natural degree of symmetry as seen in real faces instead of an uncanny degree of near-perfect symmetry.

Each network was trained using the Adam optimizer with a learning rate set to 0.0002.

Figure 4. Examples of resulting renderings in new lighting conditions


Quantitative evaluations of the system’s ability to faithfully recover the reflectance and geometry data from a set of 100 test images are depicted in the Table below.

Table 1. Peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) of the inferred images for 100 test images compared to the ground truth

Even though we observe relatively large differences from the ground truth of the specular albedo results, qualitative evaluations demonstrate that the inferred data is still sufficient for rendering compelling and high-quality avatars.

Figure 5. Zoom-in results showing synthesized mesoscopic details

Furthermore, the results were compared quantitatively and qualitatively to other state-of-the-art methods. This comparison revealed that the new approach presented here results in more coherent and plausible facial textures than any of the alternative methods.

Figure 6. Comparison with PCA, Visio-lization [Mohammed et al. 2009], and a state-of-the-art diffuse albedo inference method [Saito et al. 2017] 

Table 2. Quantitative comparison of the suggested model with several alternative methods, measured using the PSNR and the root-mean-square error (RMSE)


In summary, the suggested approach makes it feasible to infer high-resolution reflectance and geometry maps using a single unconstrained image. Not only these maps are sufficient for rendering compelling and realistic avatars, but they can be obtained within only several seconds rather than several minutes like it’s required for alternative methods. These great results are possible in large part due to the use of high-quality ground truth 3D scans and the corresponding input images. Moreover, the technique of flipping and concatenating convolutional features encoded in the latent space of the model allowed to perform texture completion with preserving the natural degree of facial symmetry.

Figure 7. Demonstration of the model’s limitations

Still, the suggested approach has several limitations that are demonstrated in the figure above. The method produces artifacts in the presence of strong shadows and non-skin objects due to segmentation failures. Also, volumetric beards are not faithfully reconstructed, and strong dynamic wrinkles may cause artifacts in the inferred displacement maps.

Nevertheless, these limitations could not deny the great contribution that the suggested approach makes to the problem of creating high-fidelity avatars for the simulated environments.

3D Hair Reconstruction Out of a Single Image

10 July 2018
3D Hair Reconstruction Out of a Single Image

3D Hair Reconstruction Out of a Single Image

Generating a realistic 3D model of an object out from 2D data represents a challenging task and this problem has been explored by many researchers in the past. The creation…

Generating a realistic 3D model of an object out from 2D data represents a challenging task and this problem has been explored by many researchers in the past. The creation and rendering of a high-quality 3D model are itself challenging and estimating the 3D object shape out of a 2D image is a very difficult task. People have been trying to address this issue, especially while trying to digitize virtual humans (in many different contexts ranging from video games to medical applications). Although there has been an enormous success, the generation of high-quality, realistic 3D object models is still not a solved problem. Talking about human shape modeling, there has been a great success in constructing human face but for example much less in generating 3D hair models.

This problem (of generating 3D hair models) has been addressed recently by researchers from University of Southern California, USC Institute for Creative Technologies, Pinscreen, Microsoft Research Asia, who propose a deep learning based method for 3D hair reconstruction from a single 2D unconstrained image.

Different from previous approaches, the proposed method based on Deep Learning is, in fact, able to directly generate hair strands instead of volumetric grids or point cloud structures. The new approach, according to the authors achieves state-of-the-art performance on resolution and quality and brings significant improvement in speed and storage costs. Moreover, as a very important contribution, the model in the proposed method actually provides the smooth, compact and continuous representation of hair geometry and this enables smooth sampling and interpolation.

3D Hair Reconstruction
Data representation in the proposed method

The Method

The proposed approach consists of three steps:

  1. Preprocessing that calculates the 2D orientation field of the hair region.
  2. A deep neural network that takes the 2D orientation field and outputs generated hair strands (in a form of sequences of 3D points).
  3. A reconstruction step that generates a smooth and dense hair model.

As mentioned before, the first step is the actual preprocessing of the image where the authors want to obtain the 2D orientation field but only of the hair region part. Therefore, the first filter is actually extracting the hair region. It is done using a robust pixel-wise hair mask on the portrait image. After that Gabor filters are used to detect the orientation and construct the pixel-wise 2D orientation map. It is also worth to note that the researchers use undirected orientation being only interested in the orientation but not the actual hair growth direction. In order to better improve the result on segmenting the hair region, they also apply a human head and body segmentation masks. Finally, the output of the preprocessing step is 3 x 256 x 256 image where the first two image channels encode the colour-coded orientation and the third one is the actual segmentation.

Deep Neural Network

Data Representation

The output of the hair prediction network is a hair model which is represented with sequences of ordered 3D points corresponding to each modeled hair strand. In the experiments, the size of each sequence is 100 3D points each of them containing attributes of position and curvature. Thus, a hair model would contain N number of strands (sequences).

3D Hair Reconstruction

The input orientation image is first encoded into a high-level feature vector and then decoded to 32 x 32 individual strand-features. Then, each of these features is decoded to a hair geometry represented by positions and curvatures for each of the points in the strand.

Network Architecture

The employed network is taking the orientation image as input and gives two matrices as output: the positions and curvatures, as explained above. The network has an Encoder-Decoder convolutional architecture that deterministically encodes the input image into a latent vector of fixed size: 512. This latent vector, in fact, represents the hair feature which is then decoded by the decoder part. The encoder consists of 5 convolutional layers and a max pooling layer. The encoded latent vector is then decoded with the decoder which consists of 3 deconvolutional layers into multiple strand feature vectors (as mentioned above) and finally, an MLP is used to further decode the feature vectors into the desired geometry consisting of curvatures and positions.

Encoder Decoder architecture for hair reconstruction
The proposed Encoder-Decoder architecture that does the 3D hair reconstruction

To perform the optimization of such a defined architecture and the specific problem, the authors employ 3 loss functions: two of them are the L2 reconstruction loss of the geometry (3D position and curvature) and the third one is actually a collision loss measuring the collision between the hair strand and the human body.

The ellipsoids used for collision testing
The ellipsoids used for collision testing

Evaluation and Conclusions

To evaluate the defined method and approach towards the problem of 3D hair reconstruction, the researchers use quantitative as well as qualitative evaluation metrics. In fact, for the quantitative analysis, they compute the reconstruction loss of the visible and the non-visible part of the hair separately to make a comparison. They create a synthetic test set with 100 random hair models and 4 images rendered from random views for each hair model. The results and the comparison with existing methods are given in the following table.

3D Hair Reconstruction table1
Comparison of the proposed method with already existing methods. Divided into subgroups of visible and invisible parts
3D Hair Reconstruction
Space and time complexity of the method and comparison to Chai et al. approach

On the other hand, to be able to qualitatively evaluate the performance of the proposed approach, the researchers actually test a few real portrait photographs as input and they show that the method is able to handle different shapes (short, medium, long hair) as well as to reconstruct different levels curliness within hairstyles.

Comparison on 4 real portrait images
Comparison of 2 real portrait images

Moreover, they test also the smooth sampling and the interpolation. They show that their model is able to smoothly interpolate between hairstyles (from straight to curly or short to long).

Interpolation results between two hairstyles
Interpolation results between two hairstyles (short and long)

Overall, the proposed method is interesting in many ways. It shows that an end-to-end network architecture is able to successfully reconstruct 3D hair from the 2D image, which is impressive itself but also to smoothly transition between hairstyles via interpolation, thanks to the employed encoder-decoder architecture.

Dane Mitrev

FAIR Proposed a New Partially Supervised Trading Paradigm to Segment Every Thing

26 June 2018
image segmentation

FAIR Proposed a New Partially Supervised Trading Paradigm to Segment Every Thing

Object detectors have become significantly more accurate and gained new capabilities. One of the most exciting is the ability to predict a foreground segmentation mask for each detected object, a…

Object detectors have become significantly more accurate and gained new capabilities. One of the most exciting is the ability to predict a foreground segmentation mask for each detected object, a task called instance segmentation. In practice, typical instance segmentation systems are restricted to a narrow slice of the vast visual world that includes only around 100 object categories. A principal reason for this limitation is that state-of-the-art instance segmentation algorithms require strong supervision and such supervision may be limited and expensive to collect for new categories. By comparison, bounding box annotations are more abundant and cheaper.

FAIR (Facebook AI Research) introduced a new partially supervised instance segmentation task and proposed a novel transfer learning method to address it. The partially supervised instance segmentation task as follows:

  1. Given a set of categories of interest, a small subset has instance mask annotations, while the other categories have only bounding box annotations.
  2. The instance segmentation algorithm should utilize this data to fit a model that can segment instances of all object categories in the set of interest.

Since the training data is a mixture of strongly annotated examples (those with masks) and weakly annotated examples (those with only boxes), the task is referred to partially supervised. To address partially supervised instance segmentation, a novel transfer learning approach built on Mask R-CNN. Mask R-CNN is well-suited to this task because it decomposes the instance segmentation problem into the subtasks of bounding box object detection and masks prediction.

Learning to Segment Every Thing

Let C be the set of object categories for which instance segmentation model is trained. All training examples in C are annotated with instance masks. It is to be assumed that C = A ∪ B where samples from the categories in A have masks, while those in B have only bounding boxes. Since the examples of the B categories are weakly labeled w.r.t., the target task (instance segmentation), it is referred to train on the combination of strong and weak labels as a partially supervised learning problem. Given an instance segmentation model like Mask RCNN that has a bounding box detection component and a mask prediction component, a model Mask^X RCNNmethod that transfers category-specific information from the model’s bounding box detectors to its instance mask predictors.

Mask^X R-CNN method
Detailed illustration of proposed Mask^X R-CNN method.

This method is built on Mask R-CNN, because it is a simple instance segmentation model that also achieves state-of-the-art results. In Mask R-CNN, the last layer in the bounding box branch and the last layer in the mask branch both contain category-specific parameters that are used to perform bounding box classification and instance mask prediction, respectively, for each category. Instead of learning the category-specific bounding box parameters and mask parameters independently, authors propose to predict a category’s mask parameters from its bounding box parameters using a generic, category-agnostic weight transfer function that can be jointly trained as part of the whole model.

For a given category c, let w(det) be the class-specific object detection weights in the last layer of the bounding box head, and w(seg) be the class-specific mask weights in the mask branch. Instead of treating w(seg) as model parameters, w(seg) is parameterized using a generic weight prediction function T (·):

Generic weight prediction function T (·)

where θ are class-agnostic, learned parameters. The same transfer function T(·) may be applied to any category c and, thus, θ should be set such that Tgeneralizes to classes whose masks are not observed during training.

T (·) can be implemented as a small fully connected neural network. Figure 1 illustrates how the weight transfer function fits into Mask R-CNN to form Mask^X R-CNN. Note that the bounding box head contains two types of detection weights: the RoI classification weights w(cls) and the bounding box regression weights w(box).

Experiments on COCO

This method is evaluated on COCO dataset which is small-scale w.r.t. the number of categories but contains exhaustive mask annotations for 80 categories. This property enables rigorous quantitative evaluation using standard detection metrics, like average precision (AP). Each class has a 1024-d RoI classification parameter vector w(cls) and a 4096- d bounding box regression parameter vector w(box) in the detection head, and a 256-d segmentation parameter vector w(seg) in the mask head. The output mask resolution is M × M = 28 × 28. Table below compares full Mask^X R-CNN method (i.e., Mask R-CNN with ‘transfer+MLP’ and T implemented as ‘cls+box, 2-layer, LeakyReLU’) and the class-agnostic baseline using end-to-end training.

Experiments on COCO

Mask^X R-CNN outperforms these approaches by a large margin (over 20% relative increase in mask AP).


Mask^X R-CNN approach
Mask predictions from the class-agnostic baseline (top row) vs. Mask^X R-CNN approach (bottom row). Green boxes are classes in set A while the red boxes are classes in set B. The left 2 columns are A = {voc} and the right 2 columns are A = {non-voc}.

This research addresses the problem of large-scale instance segmentation by formulating a partially supervised learning paradigm in which only a subset of classes have instance masks during training while the rest have box annotations.The FAIR proposes a novel transfer learning approach, where a learned weight transfer function predicts how each class should be segmented based on parameters learned for detecting bounding boxes. Experimental results on the COCO dataset demonstrate that this method significantly improves the generalization of mask prediction to categories without mask training data. This model will help to build a large-scale instance segmentation model over 3000 classes in the Visual Genome dataset.

Mask predictions on 3000 classes in Visual Genome.
Mask predictions from Mask^X R-CNN on 3000 classes in Visual Genome.

Muneeb Ul Hassan

Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning

12 June 2018
Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning

Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning

The immeasurable amount of multimedia data is recorded and shared in the current era of the Internet. Among it, a video is one of the most common and rich modalities,…

The immeasurable amount of multimedia data is recorded and shared in the current era of the Internet. Among it, a video is one of the most common and rich modalities, albeit it is also one of the most expensive to process. Algorithms for fast and accurate video processing thus become crucially important for real-world applications. Video object segmentation, i.e. classifying the set of pixels of a video sequence into the object(s) of interest and background, is among the tasks that despite having numerous and attractive applications, cannot currently be performed in a satisfactory quality level and at an acceptable speed.

The problem is model in a simple and intuitive, yet powerful and unexplored way. The video object segmentation is formulating as pixel-wise retrieval in a learned embedding space. Ideally, in the embedding space, pixels belonging to the same object instance are close together, and pixels from other objects are further apart. The model is built by learning a Fully Convolutional Network (FCN) as the embedding model, using a modified triplet loss tailored for video object segmentation, where no clear correspondence between pixels is given.

There are several main advantages of this formulation: Firstly, the proposed method is highly efficient as there is no fine-tuning in test time, and it only requires a single forward pass through the embedding network and a nearest-neighbour search to process each frame. Secondly, this method provides the flexibility to support different types of user input (i.e., clicked points, scribbles, segmentation masks, etc.) in a unified framework. Moreover, the embedding process is independent of user input. Thus the embedding vectors do not need to be recomputed when the user input changes, which makes this method ideal for the interactive scenario.

Interactive Video Object Segmentation: Interactive Video Object Segmentation relies on iterative user interaction to segment the object of interest. Many techniques have been proposed for the task.

Deep Metric LearningThe key idea of deep metric learning is usually to transform the raw features by a network and then compare the samples in the embedding space directly. Usually, metric learning is performed to learn the similarity between images or patches, and methods based on pixel-wise metric learning are limited.

Proposed Architecture

The problem is to formulate video object segmentation as a pixel-wise retrieval problem, that is, for each pixel in the video, we look for the most similar reference pixel in the embedding space and assign the same label to it. The method consists of two steps:

  1. First, embed each pixel into a d-dimensional embedding space using the proposed embedding network.
  2. Secondly, perform per-pixel retrieval in embedding space to transfer labels to each pixel according to its nearest reference pixel.

Embedding Network

User input to fine-tune the modelThe first way is to fine-tune the network to the specific object based on the user input. For example, techniques such as OSVOS or MaskTrack fine-tune the network at test time based on the user input. When processing a new video, they require many iterations of training to adapt the model to the specific target object. This approach can be time-consuming (seconds per sequence) and therefore impractical for real-time applications, especially with a human in the loop.

User input as the network input: Another way of injecting user interaction is to use it as an additional input to the network. In this way, no training is performed at test time. A drawback of these methods is that the network has to be recomputed once the user input changes. This can still be a considerable amount of time, especially for video, considering a large number of frames.

In contrast to above two methods, in the proposed work user input is disentangled from the network computation. Thus the forward pass of the network needs to be computed only once. The only computation after the user input is then the nearest neighbour search, which is very fast and enables rapid response to the user input.

Embedding Model: In the proposed Model f where each pixel xj,i is represented as a d-dimensional embedding vector ej,i = f(xj,i). Ideally, pixels belonging to the same object are close to each other in the embedding space, and pixels belonging to different objects are distant to each other. The embedding model is built on DeepLab based on the ResNet backbone architecture.

  1. First, the network is pre-train for semantic segmentation on COCO.
  2. Secondly, the final classification layer is removed and replace it with a new convolutional layer with d output channels.
  3. Then fine-tune the network to learn the embedding for video object segmentation.

The DEEP lab architecture is a base feature extractor and to the two convolutional layers as embedding head. The resulting network is fully convolutional, thus the embedding vector of all pixels in a frame can be obtained in a single forward pass. For an image of size h × w pixels the output is a tensor [h/8, w/8, d], where d is the dimension of the embedding space. Since an FCN is deployed as the embedding model, spatial and temporal information are not kept due to the translation invariance nature of the convolution operation. Formally, the embedding function can be represented as:

Modified Triplet

where i and j refer to the ith pixel in frame j. The modified triplet loss is used:

Modified Triplet

The proposed method is evaluated on the DAVIS 2016 and DAVIS 2017 data sets, both in the semi-supervised and interactive scenario. In the context of semi-supervised Video Object Segmentation (VOS), where the full annotated mask in the first frame is provided as input.

Evaluation results
Evaluation results on DAVIS 2016 validation set


 Pixel-wise feature distribution
Illustration of pixel-wise feature distribution

This work presents a conceptually simple yet highly effective method for video object segmentation. The problem is cast as a pixel-wise retrieval in an embedding space learned via a modification of the triplet loss designed specifically for video object segmentation. This way, the annotated pixels on the video (via scribbles, segmentation on the first mask, clicks, etc.) are the reference samples, and the rest of pixels are classified via a simple and fast nearest neighbour approach.

Video object segmentation

Muneeb Ul Hassan