New Datasets for Object Tracking

8 November 2018

New Datasets for Object Tracking

Object tracking in the wild is far from being solved. Existing object trackers do quite a good job on the established datasets (e.g., VOT, OTB), but these datasets are relatively…

Object tracking in the wild is far from being solved. Existing object trackers do quite a good job on the established datasets (e.g., VOT, OTB), but these datasets are relatively small and do not fully represent the challenges of real-life tracking tasks. Deep learning is at the core of the most state-of-the-art trackers today. However, a dedicated large-scale dataset to train deep trackers is still lacking.

In this article, we discuss three recently introduced datasets for object tracking. They differ in scale, annotations and other characteristics but all of them can contribute something to solving object tracking problem: TrackingNet is a first large-scale dataset for object tracking in the wild, MOT17 is a benchmark for multiple object tracking, and Need for Speed is the first higher frame rate video dataset.


Number of videos: 30,132 (train) + 511 (test)

Number of annotations: 14,205,677 (train) + 225,589 (test)

Year: 2018

Examples from TrackingNet test set

TrackingNet is a first large-scale dataset for object tracking in the wild. It includes over 30K videos with an average duration of 16.6s and more than 14M dense bounding box annotations. The dataset is not limiting to a specific context but instead covers a wide selection of object classes in a broad and diverse context. TrackingNet has a number of notable advantages:

  • a large scale of this dataset enables the development of deep design specific for tracking;
  • by being specifically created for object tracking, the dataset enables model architectures to focus on the temporal context between the consecutive frames;
  • the dataset was sampled from YouTube videos and thus, represents real-world scenarios and contains a large variety of frame rates, resolutions, context and object classes.

TrackingNet training set was derived from the YouTube-Bounding Boxes (YT-BB), a large-scale dataset for object detection with roughly 300K video segments, annotated every second with upright bounding boxes. To build TrackingNet, the researchers filtered out 90% of the videos by selecting the videos that a) are longer than 15 seconds; b) include bounding boxes that cover less than 50% of the frame; c) contain a reasonable amount of motion between bounding boxes.

To increase annotation density from 1fps provided by YT-BB, the creators of TrackingNet rely on a mixture of state-of-the-art trackers. They claim that any tracker is reliable on a short interval of 1 second. So, they have densely annotated 30,132 videos using a weighted average between a forward and a backward pass using the DCF tracker. Furthermore, the code for automatically downloading videos from YouTube and extracting the annotated frames is also available.

Comparison of tracking datasets across the number of videos, the average length of the videos, and the number of annotated bounding boxes (reflected with the circle’s size)

Finally, TrackingNet dataset comes with a new benchmark composed of 511 novel videos from YouTube with Creative Commons license, namely YT-CC. These videos have the same object class distribution as the training set and are annotated with the help of Amazon Mechanical Turk workers. With the tight supervision in the loop, TrackingNet team has ensured the quality of the annotations after a few iterations, discouraged bad annotators and incentivized the good ones.

Thus, by sequestering the annotation of the test set and maintaining an online evaluation server, the researchers behind TrackingNet provide a fair benchmark for the development of object trackers.


Number of videos: 21 (train) + 21 (test)

Number of annotations: 564,228

Year: 2017

Examples from the MOT17 dataset

MOT17 (Multiple Object Tracking) is an extended version of the MOT16 dataset with new and more accurate ground truth. As evident from its name, the specific focus of this dataset is on multi-target tracking. It should be also noted that the context of MOTChallenge datasets, including this last MOT17 dataset, is limited to the street scenes.

The new MOT17 benchmark includes a set of 42 sequences with crowded scenarios, camera motions, and weather conditions. The annotations for all sequences have been carried out by qualified researchers from scratch following a strict protocol. Even more, to ensure the highest annotations accuracy, all the annotations were double-checked. Another thing that distinguishes this dataset from the earlier versions of MOTChallenge datasets is that here not only pedestrians are annotated, but also vehicles, sitting people, occluding objects, as well as other significant object classes.

An overview of annotated classes and example of an annotated frame

The researchers have defined some classes as the target ones – they are depicted with orange in the above image; these classes are the central ones to evaluate on. The red classes include ambiguous cases such that both recovering and missing them will not be penalized in the evaluation. Finally, the classes in green are annotated for training purposes and for computing the occlusion level of all pedestrians.

An exemplar of an annotated frame demonstrates how partially cropped objects are also marked outside of the frame. Also, note that the bounding box encloses the entire person but not the white bag of the pedestrian.

Rich ground truth information provided within the MOT17 dataset can be very useful for developing more accurate tracking methods and advancing the field further.


Number of videos: 100

Number of annotations: 383,000

Year: 2017

The effect of tracking higher frame rate videos

NfS (Need for Speed) is the first higher frame rate video dataset and benchmark for visual object tracking. It includes 100 videos comprised out of 380K frames and captured with 240 FPS cameras, which are now often used in real-world scenarios.

Particularly, 75 videos were captured using the iPhone 6 (and above) and the iPad Pro, while 25 videos were taken from YouTube. The tracking targets include vehicles, humans, faces, animals, aircraft, boats and generic objects such as sport balls, cups, bags etc.

All frames in NfS dataset are annotated with axis-aligned bounding boxes using the VATIC toolbox. Moreover, all videos are manually labeled with nine visual attributes: occlusion, illumination variation, scale variation, object deformation, fast motion, viewpoint change, out of view, background clutter, and low resolution.

Comparing lower frame rate (green boxes) to higher frame rate (red boxes) tracking. Ground truth is shown by blue boxes

NfS benchmark provides a great opportunity to evaluate state-of-the-art trackers on higher frame rate sequences. Actually, some surprising results were already revealed thanks to this dataset: apparently, at higher frame rates, simple trackers such as correlation filters outperform complex deep learning algorithms.

Bottom Line

The scarcity of the dedicated large-scale tracking datasets leads to the situation when object trackers based on the deep learning algorithms are forced to rely on the object detection datasets instead of the dedicated object tracking ones. Of course, this limits advances in object tracking field. Fortunately, the object tracking datasets introduced recently, especially the large-scale TrackingNet dataset, provide data-hungry trackers with the great opportunities for significant performance upgrades.

3D Hair Reconstruction Out of In-the-Wild Videos

22 October 2018
hair reconstruction from video

3D Hair Reconstruction Out of In-the-Wild Videos

3D hair reconstruction is a problem with numerous applications in different areas such as Virtual Reality, Augmented Reality, video games, medical software, etc. As a non-trivial problem, researchers have proposed…

3D hair reconstruction is a problem with numerous applications in different areas such as Virtual Reality, Augmented Reality, video games, medical software, etc. As a non-trivial problem, researchers have proposed various solutions in the past, some of them more some of them less successful. Generating realistic 3D hair model represents a challenge even when done in controlled, relatively sterile environments. Therefore, the generation of 3D hair model in-the-wild out of ordinary photos or videos is a challenging task.

Previous works

Recently, we wrote about an approach for realistic 3D hair reconstruction out of a single image. This kind of methods work well, but fail to produce high-fidelity results of 3D hair reconstruction models due to the limitations and ambiguity of the problem. Other approaches use multiple images or views and yield improved results while increasing the complexity of the solution. These approaches require controlled environments with 360 views of the person and multiple images.

Additionally, some approaches require input such as hair segmentation, making the whole process of 3D hair reconstruction more cumbersome.

State-of-the-art idea

A new approach proposed by researchers from the University of Washington can take in an in-the-wild video and automatically output a full head model with a 3D hair-strand model. The input to the proposed method is a video whose frames are used by a few components to produce hair strands – estimated and deformed in 3D (rather than 2D as in state of the art) thus enabling superior results.


The method is composed of 4 components which are shown in the illustration below:

A: Module which uses structure from motion to get relative camera poses, depth maps and a visual hull shape with view-confidence values.

B: Module in which hair segmentation and gradient direction networks are trained to apply on each frame and obtain 2D strands.

C: The segmentations from module B are used to recover the texture of the face area, and a 3D face morphable model is used to estimate face and bald head shapes.

D: The last module and the core of the algorithm where the depth maps and 2D strands are used to create 3D strands. These 3D strands are used to query a hair database and the strands from the best match are refined both globally and locally to fit the input frames from the video.

In this way, a robust and flexible method is obtained which can successfully recover 3D hair strands from in-the-wild video frames.

The proposed method’s architecture showing the four components

Module A: The first module is used to obtain a rough head shape. Each frame of the videos is pre-processed using semantic segmentation to separate the background from a person. The goal here is to estimate camera pose per frame and to create a rough initial structure from all the frames.

First, after pre-processing and removing the background, the head moving within all frames is extracted by using structure from motion approach – estimation of camera poses per frame and per-frame depth for all frames in the video. The output of this module is a rough visual hull shape.

Module B: The second module contains the trained hair segmentation and hair directional classifiers to label and predict the hair direction in hair pixels of each video frame inspired by strand direction estimation method of Chai et al. 2016.

Hair segmentation, directional labels and 2D hair strands of example video frames

Module C: In this module, the segmented frames are used to select the frame that is closest to frontal face (where yaw and pitch are approximately 0), and fed to a morphable-model-based face model estimator.

Module D: The last, in fact, the core module is estimating 3D hair strands using the outputs of modules A, B and C. Initially, since in this module each frame has an estimation of 2D strands, 3D hair strands are obtained by projecting them to depths to obtain the initial estimate of 3D strands. Then, those 3D hair strands are used to query a database of 3D hair models since the initial strands are incomplete. In their work, the researchers use the hair dataset created by Chai et al. 2016, which contains 35, 000 different hairstyles, each hairstyle model consisting of more than 10, 000 hair strands. A global and also a local deformation is applied in the end to refine the obtained 3D hair strands.

The local and global transformation applied to the 3D hair strands


To evaluate the proposed approach, the researchers use both quantitative and qualitative evaluation as well as human study comparison. A quantitative comparison is made by projecting the reconstructed hair as lines onto the images, computing the intersection-over-union rate to the ground truth hair mask per frame. The results are shown in the table below. A larger IOU means that the reconstructed hair approximates the input better.

This figure shows the results compared to the state-of-the-art methods

The approach was evaluated qualitatively against some state-of-the-art methods. Moreover, human preference tests using Mechanical Turk have been done, and the results are shown in the tables.

This figure shows four example frames comparing the silhouettes of the reconstructed hairstyles to the hair segmentation results.
compared to Hu et al. 2017 based on Amazon Mechanical Turk tests.
The ratio of preference of the methods’ results over total compared to Hu et al. 2017 based on Amazon Mechanical Turk tests.
The ratio of preference of methods' results
The ratio of preference of methods’ results over total compared to Zhang et al. 2017 based on Amazon Mechanical Turk tests.


In this paper, researchers from the University of Washington proposed a fully automatic way of 3D hair reconstruction out of in-the-wild videos, which can have a wide variety of potential applications. Although the method itself is quite complex and involves many steps, the results are more than satisfactory. The approach shows that higher fidelity in the results can be obtained by incorporating information from multiple frames of videos where slightly different views are present. The proposed system is exploiting this to reconstruct 3D hair model while not restricted to specific views and head poses.

Vid2Vid – Conditional GANs for Video-to-Video Synthesis

3 September 2018

Vid2Vid – Conditional GANs for Video-to-Video Synthesis

Researchers from NVIDIA and MIT’s Computer Science and Artificial Intelligence Lab have proposed a novel method for video-to-video synthesis, showing impressive results. The proposed method – Vid2Vid – can synthesize…

Researchers from NVIDIA and MIT’s Computer Science and Artificial Intelligence Lab have proposed a novel method for video-to-video synthesis, showing impressive results. The proposed method – Vid2Vid – can synthesize high-resolution, photorealistic, temporally coherent videos on a diverse set of input formats including segmentation masks, sketches, and poses.

Previous works

Due to the inherent complexity of the problem of video-to-video synthesis, this topic remained relatively unexplored in the past. Compared to its image counterpart – image-to-image synthesis, much fewer studies have explored and tackled this problem.

Arguing that a general-purpose solution to video-to-video synthesis has not yet been explored in the prior work (as opposed to the image counterpart – image-to-image translation), the researchers compare and benchmark this approach with a strong baseline that combines a state-of-the-art video style transfer algorithm with a state-of-the-art image-to-image translation approach.

State-of-the-art idea

The general idea is to learn a mapping function that can convert an input video to a realistic output video. They frame this problem as a distribution matching problem, and they leverage the generative adversarial learning framework to produce a method that can generate photorealistic videos given an input video (such as a sequence of segmentation masks, poses, sketches, etc.).


Vid2Vid Method

As I mentioned before, the authors proposed a method based on Generative Adversarial Networks. They tackle the complex problem of video-to-video translation or video-to-video synthesis in a really impressive way by carefully designing an adversarial learning framework.

Their goal is to learn a mapping function that will map a sequence of input (source) images to a series of realistic output images, where the conditional distribution of the generated sequence given the source sequence is identical to the distribution of the original sequence given the source sequence:


The matching of the distributions enforces the method to learn to output realistic and temporally coherent output videos. In a generative adversarial learning context, a Generator-Discriminator framework is designed to learn the mapping function. The generator is trained by solving an optimization problem – minimizing the Jensen-Shannon divergence between the two distributions. A minimax optimization is applied to a defined objective function:


As mentioned in their paper, also widely known, optimizing for an objective function as the given one is a very challenging task. Often, the training of a generator and discriminator models becomes very unstable or even impossible depending on the optimization problem being solved. Therefore, in their research, they propose a simplified sequential generator making a few assumptions. They make the Markov property assumption to factorize the conditional distribution and decouple the dependencies across frames in the sequences.

In simple words, they simplify the problem by assuming that the video frames can be generated sequentially and the generation of the t-th frame only depends on three things: 1) the current source image, 2) the past L source images, and 3) the past L generated images.

From this point, they design a feed-forward network F to learn a mapping from the past L source images, and the past L-1 generated images to a newly generated output image.

By the way, Neurohive is creating the new app for professional business headshots based on neural network. We are going to release it in September.

To model such a network, researchers make another assumption based on the fact that if the optical flow from the current frame to the next frame is known, it can be used to warp the current frame to generate an estimation of the next frame. Arguing that this will be largely true except for occluded areas (where it is unknown what is happening), they are considering a specific model.

The estimation network F is modeled in such a way to take into account a given occlusion mask, the estimated optical flow between the previous and the current image (which is given by an optical flow estimation function) and a hallucinated image (generated from scratch). The hallucinated image is necessary to fill the areas under occlusions.

Similarly as in images, wherein the context of generative adversarial learning many methods exploit local discriminators besides a global one, here the authors propose an interesting approach utilizing conditional image discriminator and conditional video discriminator.

Additionally, to make their method even better and more impressive, the researchers extend their approach to multimodal synthesis. They propose a generative method based on feature embedding scheme and using a Gaussian Mixture Models, to output several videos with different visual appearances depending on sampling different feature vectors.


The proposed method yields impressive results. It was tested on several datasets such as Cityscapes, Apolloscape, Face video dataset, Dance video dataset. Moreover, two strong baselines were used for comparison: pix2pix method and modified video style transfer method (CONVST) where they changed the stylization network to pix2pix.

Face vid2vid

pose-to-body vid2vi2



Comparison with other state-of-the-art

They use both subjective and objective evaluation metrics for performance evaluation: Human preference score, Fréchet Inception Distance (FID). A comparison between the proposed and other methods is given in the tables.

Future video prediction results. Top left: ground truth. Top right: PredNet. Bottom left: MCNet. Bottom right: proposed.
Future video prediction results. Top left: ground truth. Top right: PredNet. Bottom left: MCNet. Bottom right: proposed.
Example output from the multimodal video synthesis.
Example output from the multimodal video synthesis.


A new state-of-the-art method in video synthesis has been proposed. The conditional GAN-based approach shows impressive results in several different tasks within the scope of video-to-video translation. There are numerous applications of this kind of methods in computer vision, robotics, and computer graphics. Using a learned video synthesis model, one can generate realistic videos for many different purposes and applications.

Everybody Dance Now: a New Approach to “Do As I Do” Motion Transfer

30 August 2018
everybody dance now

Everybody Dance Now: a New Approach to “Do As I Do” Motion Transfer

Not very good at dancing? Not a problem anymore! Now you can easily impress your friends with a stunning video, where you dance like a superstar. Researchers from UC Berkeley…

Not very good at dancing? Not a problem anymore! Now you can easily impress your friends with a stunning video, where you dance like a superstar.

Researchers from UC Berkeley propose a new method of transferring motion between human subjects in different videos. They claim that given a source video of a person dancing they can transfer that performance to a novel target after only a few minutes of the target subject performing standard moves.

But let’s first check what were the previous approaches to this kind of tasks.

Previous Works

Motion transfer or retargeting received a considerable attention from researchers over the last two decades. Early methods were creating new content by manipulating existing video footage.

So, what is the idea behind this new approach?

State-of-the-art Idea

This method poses the problem as a per-frame image-to-image translation with spatiotemporal smoothing. The researchers use pose detection, represented with the pose stick figures, as an intermediate representation between source and target. The aligned data is used to learn an image-to-image translation model between pose stick figures and images of the target person in a supervised way.

Two additional components improve the results: conditioning the prediction at each frame on that of the previous time step for temporal smoothness and a specialized GAN for realistic face synthesis.

Before diving deeper into the architecture of the suggested approach, let’s check the results with this short video:

So, in essence, the model is trained to produce personalized videos of a specific target subject. Motion transfer occurs when the pose stick figures go into the trained model to obtain images of the target subject in the same pose as the source.


The pipeline of suggested approach includes three stages:

  1. Pose detection – using a pretrained state-of-the-art pose detector to create pose stick figures from the source video.
  2. Global pose normalization – accounting for differences between the source and target subjects in body shapes and locations within the frame.
  3. Mapping from normalized pose stick figures to the target subject.

Here is an overview of the method:

Method overview
Method overview

For the training purposes, the model uses a pose detector P to create pose stick figures from video frames of the target subject. Then, the mapping G is learned alongside an adversarial discriminator D, which attempts to distinguish between the “real” correspondence pair (x, y) and the “fake” pair (G(x), y).

Next, for the transferring purposes, a pose detector P helps to obtain pose joints for the source person. These are then transformed with the normalization process Norm into the joints for the target person for which pose stick figures are created. Finally, the trained mapping G is applied.

The researchers base their method on the objective presented in pix2pixHD with some extensions to produce temporally coherent video frames and synthesize realistic face images.

Temporal Smoothing

To create video sequences, they modify the single image generation set up to enforce temporal coherence between adjacent frames as shown in the figure below.

Temporal smoothing setup
Temporal smoothing setup

In brief, the current frame G(xt) is conditioned on its corresponding pose stick figure xt and the previously synthesized frame G(xt−1) to obtain temporally smooth outputs. Discriminator D then attempts to differentiate the “real” temporally coherent sequence (xt−1, xt , yt−1, yt ) from the “fake” sequence (xt−1, xt , G(xt−1), G(xt)).

Face GAN setup

The researchers further extend the model with a specialized GAN setup designed to add more detail and realism to the face region as shown in the figure below.  To be specific, the model uses a single 70×70 Patch-GAN discriminator for the face discriminator.

Face GAN setup

Now let’s move to the results of the experiments…


The target subjects were recorded for around 20 minutes of real time footage at 120 frames per second. Moreover, considering that the model does not encode information about clothes, the target subjects wear tight clothing with minimal wrinkling.

Videos of the source subjects were found online – these videos just need to be of the reasonably high quality and include the subject performing a dance.

Here are the results with the top row showing the source subject, the middle row showing the normalized pose stick figures, and the bottom row depicting the model outputs of the target person:

Transfer results for five consecutive frames
Transfer results for five consecutive frames

The tables below demonstrate the performance of the full model (with both temporal smoothing and Face GAN setups) in comparison to the baseline model (pix2pixHD) alone and the baseline model with a temporal smoothing setup. The quality of individual frames was assessed with the measure of Structural Similarity (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS).

table 01

Comparison of synthesis results for different models (T.S.: a model with temporal smoothing, T.S. + Face: a full model with both temporal smoothing setup and Face GAN)
Comparison of synthesis results for different models (T.S.: a model with temporal smoothing, T.S. + Face: a full model with both temporal smoothing setup and Face GAN)

To further analyze the quality of the results, the researchers run the pose detector P on the outputs of each model and compare these reconstructed key points to the pose detections of the original input video. If all body parts are synthesized correctly, then the reconstructed pose should be close to the input pose on which the output was conditioned. See the results in the tables below:

table 1

As you can see from the tables, the temporal smoothing setup doesn’t seem to add much to the baseline if looking only on the quantitative results. However, qualitatively, the temporal smoothing setup helps with smooth motion, color consistency across frames, and also in individual frame synthesis.

Face GAN setup, on the other hand, improves both the quantitative and qualitative results of the model. As obvious from the pictures below, this component adds considerable detail to the output video and encourages synthesizing realistic body parts.

Face image comparison from different models on the validation set
Face image comparison from different models on the validation set


The presented model is able to create reasonable and arbitrarily long videos of a target person dancing given body movements to follow through an input video of another subject dancing. However, the results still often suffer from jittering. This is especially the case when the input movements or movement speed is different from the movements seen at the training time.

Considering that jittering and shakiness remain even if the target person tries to copy the movements of the source subject during the training sequence, the researchers suppose that jittering could also result from the underlying difference between how the source and target subjects move given their unique body structures. Still, this approach to motion transfer is able to generate compelling videos given a variety of inputs.

Real-time Video Style Transfer: Fast, Accurate and Temporally Consistent

25 July 2018
video style transfer

Real-time Video Style Transfer: Fast, Accurate and Temporally Consistent

Developers all over the world deploy convolutional neural networks for recomposing images with the style of other pictures or simply image style transfer. After existing methods achieved high enough processing…

Developers all over the world deploy convolutional neural networks for recomposing images with the style of other pictures or simply image style transfer. After existing methods achieved high enough processing speed, video style transfer also gained interests among researchers and developers. However, image style transfer models usually don’t work well for videos due to high temporal inconsistency, which can be observed visually as flickering between consecutive stylized frames and inconsistent stylization of moving objects. Some video style transfer models have succeeded in improving temporal consistency, yet they fail to guarantee fast processing speed and nice perceptual style quality at the same time.

To solve this challenging task, a novel real-time video style transfer model, ReCoNet, was introduced recently. Its authors claim that it can generate temporally coherent style transfer videos while maintaining favorable perceptual styles. Moreover, when compared to the other existing methods, ReCoNet demonstrates outstanding performance both quantitatively and qualitatively. So, let’s now discover, how the authors of this model were able to achieve high temporal consistency, fast processing speed, and nice perceptual style quality — all at the same time!

Suggested Approach

Real-time coherent video style transfer network (ReCoNet) is proposed by a group of researchers from the University of Hong Kong as a state-of-the-art approach to video style transfer. This is a feed-forward neural network that generates coherent stylized video in real-time speed. The process goes frame by frame through an encoder and a decoder. VGG loss network is responsible for capturing the perceptual style of the transfer target.

The novelty of their approach lies in introducing a luminance warping constraint in the output-level temporal loss. It allows to capture luminance changes of traceable pixels in the input video and increases stylization stability in the areas with illumination effects. Overall, this constraint is a key to suppressing temporal inconsistency. However, the authors also propose a feature-map-level temporal loss, which penalizes variations in high-level features of the same object in consecutive frames, and hence, further enhances temporal consistency on traceable objects.

Network Architecture

Let’s now discover the technical details of the suggested approach and study more carefully the network architecture, presented in Figure 1.

Figure 1. Pipeline of ReCoNet

ReCoNet consists of three modules:

1. An encoder converts input image frames to encoded feature maps with aggregated perceptual information. There are three convolutional layers and four residual blocks in the encoder.

2. A decoder generates stylized images from feature maps. To reduce checkerboard artifacts, the decoder includes two up-sampling convolutional layers with a final convolutional layer instead of one traditional deconvolutional layer.

3. A VGG-16 loss network computes the perceptual losses. It is pre-trained on the ImageNet dataset.

Additionally, a multi-level temporal loss is added to the output of the encoder and the output of the decoder to reduce temporal incoherence.

In the training stage, a two-frame synergic training mechanism is carried out. This implies that for each iteration, the network generates feature maps and stylized output for two consecutive image frames in two runs. Note that in the inference stage, only one image frame is processed by the network in a single run. Yet, during the training, the temporal losses are computed using the feature maps and stylized output of both frames, and the perceptual losses are computed on each frame independently and summed up. The final loss function for the two-frame synergic training is:

where α, 𝛽, 𝛾, 𝜆𝑓, and 𝜆𝜊 are hyper-parameters for the training process.

Results generated by ReCoNet

Figure 2 demonstrates how the suggested method transfers four different styles on three consecutive video frames. As you can see, ReCoNet successfully reproduces color, strokes, and textures of the style target and creates visually coherent video frames.

Figure 2. Video style transfer results using ReCoNet

Next, the researchers carried out a quantitative comparison of ReCoNet’s performance against three other methods. The table below demonstrates temporal errors of four video transfer models on five different scenes. Ruder et al’s model demonstrates the lowest errors, but as you can see from its FPS parameter, it is not suitable for real-time usage due to the low inference speed. Huang et al’s model shows lower temporal errors than ReCoNet, but let’s turn to the qualitative analysis to see if this model is able to capture strokes and minor textures similarly to ReCoNet.

As obvious from the top row of Figure 3, Huang et al’s model fails to learn much about the perceptual strokes and patterns. This could be due to the fact that they use a low weight ratio between perceptual losses and temporal loss to maintain temporal coherence. In addition, their model uses feature maps from a deeper layer relu4_2 in the loss network to calculate the content loss, which makes it more difficult to capture low-level features such as edges.

Figure 3. 
Qualitative comparison of style transfer results against other approaches

The bottom row of Figure 3 shows that Chen et al’s work maintains well the perceptual information of both the content image and the style image. However, zoom-in regions reveal a noticeable inconsistency in their stylized results, as confirmed by higher temporal errors.

Interestingly, the models were also compared through a user study. For each of the two comparisons, 4 different styles were applied to 4 different video clips, and 50 people were asked to answer the following questions:

  • Q1. Which model perceptually resembles the style image more, regarding the color, strokes, textures, and other visual patterns?
  • Q2. Which model is more temporally consistent such as fewer flickering artifacts and consistent color and style of the same object?
  • Q3. Which model is preferable overall?

The results of this user study, as shown in Table 3, validate the conclusions reached from the qualitative analysis: ReCoNet achieves much better temporal consistency than Chen et al’s model while maintaining similarly good perceptual styles; Huang et al’s model outperforms ReCoNet when it comes to temporal consistency, but is much worse in perceptual styles.

Bottom line

This novel approach to video style transfer performs great at generating coherent stylized videos in real-time processing speed while maintaining really nice perceptual style. The authors suggested using a luminance warping constraint in the output-level temporal loss and a feature-map level temporal loss for better stylization stability under illumination effects as well as better temporal consistency. Even though these constraints are effective in improving the temporal consistency of the resulted videos, ReCoNet is still behind some of the state-of-the-art methods when it comes to temporal consistency. However, considering its high processing speed and outstanding results in capturing perceptual information of both the content image and the style image, this approach is for sure at the forefront of video style transfer.