A Style-Aware Content Loss for Real-time HD Style Transfer

14 August 2018

A Style-Aware Content Loss for Real-time HD Style Transfer

A picture may be worth a thousand words, but at least it contains a lot of very diverse information. This not only comprises what is portrayed, e.g., a composition of…

A picture may be worth a thousand words, but at least it contains a lot of very diverse information. This not only comprises what is portrayed, e.g., a composition of a scene and individual objects but also how it is depicted, referring to the artistic style of a painting or filters applied to a photo. Especially when considering artistic images, it becomes evident that not only content but also style is a crucial part of the message an image communicates (just imagine van Gogh’s Starry Night in the style of Pop Art). A vision system then faces the challenge to decompose and separately represent the content and style of an image to enable a direct analysis based on each individually. The ultimate test for this ability is style transfer, exchanging the style of an image while retaining its content.

Neural Style Transfer Example
Neural Style Transfer Example

Recent work has been done using neural networks and the crucial representation in all these approaches has been based on a VGG16 or VGG19 network, pretrained on ImageNet. However, a recent trend in deep learning has been to avoid supervised pre-training on a million images with tediously labeled object bounding boxes. In the setting of style transfer, this has the particular benefit of avoiding from the outset any bias introduced by ImageNet, which has been assembled without artistic consideration. Rather than utilizing a separate pre-trained VGG network to measure and optimize the quality of the stylistic output, an encoder-decoder architecture with adversarial discriminator is used, to stylize the input content image and also use the encoder to measure the reconstruction loss.

State of the Art

To enable a fast style transfer that instantly transfers a content image or even frames of a video according to a particular style, a feed-forward architecture is required rather than the slow optimization-based approach. To this end, t an encoder-decoder architecture that utilizes an encoder network E to map an input content image x onto a latent representation z = E(x). A generative decoder G then plays the role of a painter and generates the stylized output image y = G(z) from the sketchy content representation z. Stylization then only requires a single forward pass, thus working in real-time.


1) Training with a Style-Aware Content Loss

Previous approaches have been limited in that training worked only with a single style image. In contrast, in this work, a single image y0 is given with a set Y of related style images yj ∈ Y. To train E and G, a standard adversarial discriminator D is used to distinguish the stylized output G(E(xi)) from real examples yj ∈ Y. The transformed image loss is defined as then:


where C × H × W is the size of image x and for training T is initialized with uniform weights. Fig. 3 illustrates the full pipeline of approach. To summarize, the full objective of our model is:


where λ controls the relative importance of adversarial loss.

2) Style Image Grouping

Given a single style image y0 the task is to find a set Y of related style images yj ∈ Y. A VGG16 is trained from scratch on Wikiart dataset to predict an artist given the artwork. The network is trained on the 624 largest (by the number of works) artists from the Wikiart dataset. Artist classification, in this case, is the surrogate task for learning meaningful features in the artworks’ domain, which allows retrieving similar artworks to image y0.

Let φ(y) be the activations of the fc6 layer of the VGG16 network C for input image y. To get a set of related style images to y0 from the Wikiart dataset Y we retrieve all nearest neighbors of y0 based on the cosine distance δ of the activations φ(·), i.e.


The basis for style transfer model is an encoder-decoder architecture. The encoder network contains 5 conv layers: 1×conv-stride-1 and 4×conv-stride-2. The decoder network has 9 residual blocks, 4 upsampling blocks, and 1×conv-stride-1. The discriminator is a fully convolutional network with 7×conv-stride-2 layers. During the training process sample 768 × 768 content image patches from the training set of Places365 [51] and 768×768 style image patches from the Wikiart dataset. We train for 300000 iterations with batch size 1, learning rate 0.0002 and Adam optimizer. The learning rate is reduced by a factor of 10 after 200000 iterations.

Table: 1
Table: 1
Training time

Experts were asked to choose one image which best and most realistically reflects the current style. The score is computed as the fraction of times a specific method was chosen as the best in the group. Mean expert score is calculated for each method using 18 different styles and report them in Tab. 1.


This paper has addressed major conceptual issues in state-of-the-art approaches for style transfer. The proposed style-aware content loss enables a real-time, high-resolution encoder-decoder based stylization of images and videos and significantly improves stylization by capturing how style affects content.


High-resolution-style-transfer-result-e1533903958847 (1)
Result in high resolution

Real-time Video Style Transfer: Fast, Accurate and Temporally Consistent

25 July 2018
video style transfer

Real-time Video Style Transfer: Fast, Accurate and Temporally Consistent

Developers all over the world deploy convolutional neural networks for recomposing images with the style of other pictures or simply image style transfer. After existing methods achieved high enough processing…

Developers all over the world deploy convolutional neural networks for recomposing images with the style of other pictures or simply image style transfer. After existing methods achieved high enough processing speed, video style transfer also gained interests among researchers and developers. However, image style transfer models usually don’t work well for videos due to high temporal inconsistency, which can be observed visually as flickering between consecutive stylized frames and inconsistent stylization of moving objects. Some video style transfer models have succeeded in improving temporal consistency, yet they fail to guarantee fast processing speed and nice perceptual style quality at the same time.

To solve this challenging task, a novel real-time video style transfer model, ReCoNet, was introduced recently. Its authors claim that it can generate temporally coherent style transfer videos while maintaining favorable perceptual styles. Moreover, when compared to the other existing methods, ReCoNet demonstrates outstanding performance both quantitatively and qualitatively. So, let’s now discover, how the authors of this model were able to achieve high temporal consistency, fast processing speed, and nice perceptual style quality — all at the same time!

Suggested Approach

Real-time coherent video style transfer network (ReCoNet) is proposed by a group of researchers from the University of Hong Kong as a state-of-the-art approach to video style transfer. This is a feed-forward neural network that generates coherent stylized video in real-time speed. The process goes frame by frame through an encoder and a decoder. VGG loss network is responsible for capturing the perceptual style of the transfer target.

The novelty of their approach lies in introducing a luminance warping constraint in the output-level temporal loss. It allows to capture luminance changes of traceable pixels in the input video and increases stylization stability in the areas with illumination effects. Overall, this constraint is a key to suppressing temporal inconsistency. However, the authors also propose a feature-map-level temporal loss, which penalizes variations in high-level features of the same object in consecutive frames, and hence, further enhances temporal consistency on traceable objects.

Network Architecture

Let’s now discover the technical details of the suggested approach and study more carefully the network architecture, presented in Figure 1.

Figure 1. Pipeline of ReCoNet

ReCoNet consists of three modules:

1. An encoder converts input image frames to encoded feature maps with aggregated perceptual information. There are three convolutional layers and four residual blocks in the encoder.

2. A decoder generates stylized images from feature maps. To reduce checkerboard artifacts, the decoder includes two up-sampling convolutional layers with a final convolutional layer instead of one traditional deconvolutional layer.

3. A VGG-16 loss network computes the perceptual losses. It is pre-trained on the ImageNet dataset.

Additionally, a multi-level temporal loss is added to the output of the encoder and the output of the decoder to reduce temporal incoherence.

In the training stage, a two-frame synergic training mechanism is carried out. This implies that for each iteration, the network generates feature maps and stylized output for two consecutive image frames in two runs. Note that in the inference stage, only one image frame is processed by the network in a single run. Yet, during the training, the temporal losses are computed using the feature maps and stylized output of both frames, and the perceptual losses are computed on each frame independently and summed up. The final loss function for the two-frame synergic training is:

where α, 𝛽, 𝛾, 𝜆𝑓, and 𝜆𝜊 are hyper-parameters for the training process.

Results generated by ReCoNet

Figure 2 demonstrates how the suggested method transfers four different styles on three consecutive video frames. As you can see, ReCoNet successfully reproduces color, strokes, and textures of the style target and creates visually coherent video frames.

Figure 2. Video style transfer results using ReCoNet

Next, the researchers carried out a quantitative comparison of ReCoNet’s performance against three other methods. The table below demonstrates temporal errors of four video transfer models on five different scenes. Ruder et al’s model demonstrates the lowest errors, but as you can see from its FPS parameter, it is not suitable for real-time usage due to the low inference speed. Huang et al’s model shows lower temporal errors than ReCoNet, but let’s turn to the qualitative analysis to see if this model is able to capture strokes and minor textures similarly to ReCoNet.

As obvious from the top row of Figure 3, Huang et al’s model fails to learn much about the perceptual strokes and patterns. This could be due to the fact that they use a low weight ratio between perceptual losses and temporal loss to maintain temporal coherence. In addition, their model uses feature maps from a deeper layer relu4_2 in the loss network to calculate the content loss, which makes it more difficult to capture low-level features such as edges.

Figure 3. 
Qualitative comparison of style transfer results against other approaches

The bottom row of Figure 3 shows that Chen et al’s work maintains well the perceptual information of both the content image and the style image. However, zoom-in regions reveal a noticeable inconsistency in their stylized results, as confirmed by higher temporal errors.

Interestingly, the models were also compared through a user study. For each of the two comparisons, 4 different styles were applied to 4 different video clips, and 50 people were asked to answer the following questions:

  • Q1. Which model perceptually resembles the style image more, regarding the color, strokes, textures, and other visual patterns?
  • Q2. Which model is more temporally consistent such as fewer flickering artifacts and consistent color and style of the same object?
  • Q3. Which model is preferable overall?

The results of this user study, as shown in Table 3, validate the conclusions reached from the qualitative analysis: ReCoNet achieves much better temporal consistency than Chen et al’s model while maintaining similarly good perceptual styles; Huang et al’s model outperforms ReCoNet when it comes to temporal consistency, but is much worse in perceptual styles.

Bottom line

This novel approach to video style transfer performs great at generating coherent stylized videos in real-time processing speed while maintaining really nice perceptual style. The authors suggested using a luminance warping constraint in the output-level temporal loss and a feature-map level temporal loss for better stylization stability under illumination effects as well as better temporal consistency. Even though these constraints are effective in improving the temporal consistency of the resulted videos, ReCoNet is still behind some of the state-of-the-art methods when it comes to temporal consistency. However, considering its high processing speed and outstanding results in capturing perceptual information of both the content image and the style image, this approach is for sure at the forefront of video style transfer.

Twin-GAN: Cross-Domain Translation of Human Portraits

25 June 2018
Neural style transfer

Twin-GAN: Cross-Domain Translation of Human Portraits

That comes as no surprise that many discoveries and inventions in any domain come out of the researchers’ personal interests. This new approach to translation of human portraits is also…

That comes as no surprise that many discoveries and inventions in any domain come out of the researchers’ personal interests. This new approach to translation of human portraits is also one of these inspiring personal projects. The author of Twin-GAN, Jerry Li, was interested in anime, but not satisfied with his attempts to draw his favorite characters. So, when he started doing machine learning, he arrived at the question: “How to turn human portraits into anime characters using AI?”. And voila, now we have a tool that can turn a human portrait into an original anime character, a cat face or any character given by the user.

But let’s, first, check the previous attempts at teaching AI how to draw.

Neural Style Transfer. Within this approach, the style of one image is applied to another, as you can see below. The important notion is that style transfer method requires a trained object detection network. Most such networks are trained on real-life objects. So, this solution is not likely to help with anime style, unless you create a new dataset manually by hiring labelers to label all the noses, mouths, eyes, hair and other specific features. But that’s going to cost you LOTS of money.

Style Transfer result from Deep Painterly Harmonization
Style Transfer result from Deep Painterly Harmonization

Generative Adversarial Network (GAN) is another way to the anime world. GAN includes a pair of competing for neural networks that can mimic any data given enough samples, good enough network, and enough time for training. Below you can see incredibly realistic faces generated using progressive growing of GANs (PGGAN).

Realistic human faces generated by PGGAN
Realistic human faces generated by PGGAN

Besides generating pretty high-quality images, GAN is also capable of translating one type of images into another. However, this approach requires paired data (one image from each domain), but unfortunately, there is no paired datasets on human and anime portraits.

Unpaired cross-domain GAN and CycleGAN. Luckily, Facebook in 2016 (Unsupervised Cross-Domain Image Generation) and Jun-Yan Zhu et.al. in 2017 (CycleGAN) introduced quite similar approaches on how to translate two type of images, with one type having labels, without paired data. Both of the models assume: when translating image type A to image type B, and translate B back to A, the resulting image after two translations should not be too different from the original input image. This difference, that anyway occurs, is called the cycle consistency loss, and it can be used for training an image translation model.

Taken from Unsupervised Cross-Domain Image Generation
Taken from Unsupervised Cross-Domain Image Generation

So, before creating Twin-GAN, Jerry Li tried to use CycleGAN for translation of human portraits into anime characters. He took 200K images from CelebA dataset as human portraits and around 30K anime figures from Getchu website. Two days of training and he got the results depicted below.

Results from CycleGAN
Results from CycleGAN

The results are not bad, but they reveal some limitations of CycleGAN. This network is minimizing the cycle consistency loss, and hence, it is forced to find a one-to-one mapping for all information from the input to the target domain. However, this is not always possible between human portraits and anime characters: people usually don’t have purple or green hair color and their faces are much more detailed than in anime. Forcing the network to find a one-to-one mapping in such circumstances will probably not yield good results.

So, the question was: without labeled data, how to find the matching parts from the two domains and innovate a little bit on the rest?

To solve this issue, Jerry Li looked for some inspirations from Style Transfer. He refers to Vincent Dumoulin from Google Brain, who has found that by only learning two variables in Batch Normalization, it is possible to use the same network to output images with a wide range of styles. Moreover, it is even possible to mix and match the style by mixing those parameters.

With these ideas in mind, the structure of Twin-GAN was finally created. PGGAN was chosen as a generator. This network takes a high dimensional vector as its input, and in our case, an image is an input. Thus, the researcher used an encoder with structure symmetric to PGGAN to encode the image into the high dimensional vector. Next, in order to keep the details of the input image, he used the UNet structure for connecting the convolutional layers in the encoder with the corresponding layers in the generator.

The input and output fall into the following three categories:

1. Human Portrait->Encoder->High Dimensional Vector->PGGAN Generator + human-specific-batch-norm->Human Portrait

2. Anime Portrait->Encoder->High Dimensional Vector->PGGAN Generator + anime-specific-batch-norm->Anime Portrait

3. Human Portrait->Encoder->High Dimensional Vector->PGGAN Generator + anime-specific-batch-norm->Anime Portrait

The idea behind this structure is that letting the human and anime portraits share the same network will help the network realize that although they look different, both human and anime portraits are describing a face. This is crucial to image translation. The switch that decides whether to output human or anime portrait lies in the batch norm parameters.

Regarding the loss function, the following four losses were used: 1) human portrait reconstruction loss; 2) anime portrait reconstruction loss; 3) human to anime GAN loss; 4) human to anime cycle consistency loss.

Here are the results of translating human portraits into anime characters using Twin-GAN.

Results from Twin-GAN
Results from Twin-GAN

The results are quite impressive. But it turned out that Twin-GAN can do even more. Since anime and real human portraits share the same latent embedding, it is possible to extract that embedding and do the nearest neighbor search in both domains. To put it simply, given an image, we can find who looks the closest in both real and anime images! See the results below.

Results of matching the closest human and anime images using Twin-GAN
Results of matching the closest human and anime images using Twin-GAN

Although the results are quite good with only a few error cases, sometimes you might not be satisfied with the image translation result due to some personal preferences or requests. For example, if the input image has brown hair and you want the anime character to have bright green hair, the model won’t allow you to make direct modifications of such features.

In these cases, you can use illust2vec to extract character features that you wish to copy, supply those features as embeddings to the generator, and then, when generating an image, add an additional anime character as input. The final result should look like that additional character, with the position and facial expression of the human portrait. See some examples below:

Results of creating anime characters with the requested features
Results of creating anime characters with the requested features

Twin-GAN can turn a human portrait into an original anime character, cat face or any character given by the user, and the algorithm demonstrates quite a good performance when completing these tasks. However, it is also prone to some errors like mistaking the background color for hair color, ignoring important features and/or misrecognizing them. When considering anime characters generation, the problem is also with the availability of well-balanced dataset: most of the anime faces collected by the researcher are female, and so the network is prone to translating male human portraits into female anime characters, like on the image below.

Failure case generated by Twin-GAN
Failure case generated by Twin-GAN

To sum up, this is a great start, but some more work needs to be done to improve the performance of this budding approach.