Dissecting GANs for Better Understanding and Visualization

5 December 2018
dissecting gan paper

Dissecting GANs for Better Understanding and Visualization

GANs can be taught to create (or generate) worlds similar to our own in any domain: images, music, speech, etc. Since 2014, a large number of improvements of GANs have…

GANs can be taught to create (or generate) worlds similar to our own in any domain: images, music, speech, etc. Since 2014, a large number of improvements of GANs have been proposed, and GANs have achieved impressive results. Researchers from MIT-IBM Watson Lab have presented GAN Paint based on Dissecting GAN – the method to validate if an explicit representation of an object is present in an image (or feature map) from a hidden layer:

GAN paint gif
The GAN Paint interactive tool

State-of-the-art Idea

However, a question that is raised very often in ML is the lack of understanding of the methods developed and applied. Despite the success of GANs, visualization and understanding of GANs are very little explored fields in research.

A group of researchers led by David Bau have done the first systematic study for understanding the internal representations of GANs. In their paper, they present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level.

Their work resulted with a general method for visualizing and understanding GANs at different levels of abstraction, several practical applications enabled by their analytic framework and an open source interpretation tools for better understanding Generative Adversarial Network models.

dissecting gan
Inserting door by setting 20 causal units to a fixed high value at one pixel in the representation.

Method

From what we have seen so far, especially in the image domain, Generative Adversarial Networks can generate super realistic images from different domains. From this perspective, one might say that GANs have learned facts about a higher abstraction level – objects for example. However, there are cases where GANs fail terribly and produce some very unrealistic images. So, is there a way to explain at least these two cases? David Bau and his team tried to answer this question among a few others in their paper. They studied the internal representations of GANs and tried to understand how a GAN represents structures and relationships between objects (from the point of view of a human observer).

As the researchers mention in their paper, there has been previous work on visualizing and understanding deep neural networks but mostly for image classification tasks. Much less work has been done in visualization and understanding of generative models.

The main goal of the systematic analysis is to understand how objects such as trees are encoded by the internal representations of a GAN generator network. To do this, the researchers study the structure of a hidden representation given as a feature map. Their study is divided into two phases that they call: dissection and intervention.

Characterizing units by Dissection

The goal of the first phase – Dissection, is to validate if an explicit representation of an object is present in an image (or feature map) from a hidden layer. Moreover, the goal is to identify which classes from a dictionary of classes have such explicit representation.

To search for explicit representations of objects they quantify the spatial agreement between the unit thresholded feature map and a concept’s segmentation mask using intersection-over-union (IoU) measure. The result is called agreement, and it allows for individual units to be characterized. It allows to rank the concepts related to each unit and label each unit with the concept that matches it best.

Dissection algorithm
Phase 1: Dissection.

Measuring causal relationships using Intervention

The second important question that was mentioned before is causality. Intervention – denoted as the second phase, seeks to estimate the causal effect of a set of units on a particular concept.

To measure this effect, in the intervention phase the impact of forcing units on (unit insertion) and off (unit ablation) is measured, again using segmentation masks. More precisely, a feature map’s units are forced on and off, and both resulting images from those two representations are segmented to obtain two segmentation masks. Finally, these masks are compared to measure the causal effect.

Intervention algorithm
Phase 2: Intervention.

Results

For the whole study, the researchers use three variants of Progressive GANs (Karras et al., 2018) trained on LSUN scene datasets. For the segmentation task, they use a recent image segmentation model (Xiao et al., 2018) trained on the ADE20K scene dataset.

An extensive analysis was done using the proposed framework for understanding and visualization of GANs. The first part – Dissection was used by the researchers for analyzing and comparing units across datasets, layers, and models, and locating artifact units.

Comparing representations learned by progressive GANs
Comparing representations learned by progressive GANs trained on different scene types. The units that emerge match objects that commonly appear in the scene type: seats in conference rooms and stoves in kitchens.
GAN
Removing successively larger sets of tree-causal units from a GAN.

A set of dominant object classes and the second part of the framework- intervention, were used to locate causal units that can remove and insert objects in different images. The results are presented in the paper, the supplementary material and a video were released demonstrating the interactive tool. Some of the results are shown in the figures below.

Visualizing the activations of individual units in two GANs.
Visualizing the activations of individual units in two GANs.

Conclusion

This is one of the first extensive studies that target the understanding and visualization of generative models. Focusing on the most popular generative model – Generative Adversarial Networks, this work reveals significant insights about generative models. One of the main findings is that the larger part of GAN representations can be interpreted. It shows that GAN’s internal representation encodes variables that have a causal effect on the generation of objects and realistic images.

Many researchers will potentially benefit from the insights that came out of this work and the proposed framework that will provide a basis for analysis, debugging and understanding of Generative Adversarial Network models.

Pix2Pix – Image-to-Image Translation Neural Network

27 November 2018
pix2pix network

Pix2Pix – Image-to-Image Translation Neural Network

Pix2pix architecture was presented in 2016 by researchers from Berkeley in their work “Image-to-Image Translation with Conditional Adversarial Networks.” Most of the problems in image processing and computer vision can…

Pix2pix architecture was presented in 2016 by researchers from Berkeley in their work “Image-to-Image Translation with Conditional Adversarial Networks.” Most of the problems in image processing and computer vision can be posed as “translating” an input image into a corresponding output image. For example, a scene may be rendered as an RGB image, a gradient field, an edge map, a semantic label map, etc. In analogy to automatic language translation, we define automatic image-to-image translation as the task of translating one possible representation of a scene into another, given a large amount of training data. But with the rise of deep learning, CNN becomes the common workhorse behind a wide variety of image prediction problems.

The paper: https://arxiv.org/pdf/1611.07004.pdf

DataSet

  • Cityscapes (which is a large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high-quality pixel-level annotations of 5 000 frames. The dataset is thus an order of magnitude larger than similar previous attempts)datasets with 2975 training images from the Cityscapes training set trained for 200 epochs, with random jitter and mirroring. Validation set has been used for testing.
  • Maps aerial photograph 1096 training images scraped from Google Maps, trained for 200 epochs, batch size 1. Data was then split into train and test set (with a buffer region added to ensure that no training pixel appeared in the test set).
  • BW-> color 1.2 million training images (Imagenet training set ), trained for 6 epochs, batch size 4, with only mirroring, no random jitter. Tested on a subset of Imagenet validation set.
  • Edges ->shoes 50k training images from UT Zappos50K dataset trained for 15 epochs, batch size 4. Data were split into train and test randomly.

Pix2pix architecture

Pix2pix uses a conditional generative adversarial network (cGAN) to learn a function to map from an input image to an output image. The network is made up of two main pieces, the Generator, and the Discriminator. The Generator transforms the input image to get the output image. The Discriminator measure the similarity of the input image to an unknown image (either a target image from the dataset or an output image from the generator) and tries to guess if this was produced by the generator.

Generator

The Generator has the job of taking an input image and performing the transform to produce the target image. The encoder-decoder architecture consists of:

encoder: C64-C128-C256-C512-C512-C512-C512-C512

decoder: CD512-CD512-CD512-C512-C256-C128-C64

An example input could be an image (black and white), and the output of that image is to be a colorized version. The structure of the generator is called an “encoder-decoder,” and in pix2pix the encoder-decoder looks more or less like this:

pix2pix decoder
A convolution is applied after the last layer in the decoder to map the number of output channels (3 in general, except in colorization, where it is 2), followed by a Tanh function. Batch-Normalization is not applied to the first C64 layer in the encoder. All ReLUs in the encoder is leaky, with slope 0.2, while ReLUs in the decoder is not leaky. The U-Net architecture is identical except with skip connections between each layer i in the encoder and layer n-i in the decoder, where n is the total number of layers. The skip connections concatenate activations from layer i to layer n-i.

Discriminator

The Discriminator has the job of taking two images, an input image and an unknown image (which will be either a target or output image from the generator), and also decide if the other image was produced by the generator or not. The 70×70 discriminator architecture is:

C64-C128-C256-C512

pix2pix neural network

A convolution is applied after the last layer to map to a 1-dimensional output, followed by a Sigmoid function. BatchNorm is not applied to the first C64 layer. All ReLUs are leaky, with slope 0.2. All other discriminators follow the same basic architecture, with depth varied to modify the receptive field size:

  • 1 x1 discriminator: C64-C128 (note, in this special case, all convolutions are1 x1 spatial filters)
  • 16×16 discriminator: C64-C128
  • 286×286 discriminator: C64-C128-C256-C512-C512-C512

In the pix2pix implementation, each pixel from this 30×30 image corresponds to the believability of a 70×70 patch of the input image (the patches overlap a lot since the input images are 256×256). The architecture is called a “PatchGAN”.

Training

For training the network, two steps have to be taken: training the discriminator and training the generator. All networks were trained from scratch. Weights were initialized from a Gaussian distribution with mean 0 and standard deviation of 0.02.

First, the generator generates an output image. The discriminator looks at the input/target pair and the input/output pair and produces its guess about how realistic they look. The weights vector of the discriminator is then adjusted based on the classification error of the input/output pair and the input/target pair.

The generator’s weights are then adjusted based on the output of the discriminator as well as the difference between the output and the target image.

The thing to remember here, when the generator is trained on the output of the discriminator, actually, the gradient is calculating through the discriminator, which means that while the discriminator improves, you’re training the generator to beat the discriminator. If the discriminator is good at its job and the generator is capable of learning the correct mapping function through gradient descent, you should get generated outputs that could fool a human.

The objective function, we want to minimize during training is of conditional GAN, which can be expressed as:
pix2pix GAN

Use Cases and Implementation

The nice thing about pix2pix is that it is not specific to some class i.e. it is generic. It does not require to define any relationship between the two types of images. It makes no assumptions about the relationship and instead learns the objective during training, by comparing the defined inputs and outputs during training and inferring the objective. This makes pix2pix highly adaptable to a wide variety of situations, including ones where it is not easy to verbally or explicitly define the task we want to model.

[Tensorflow][Pytorch][Keras]

Result

Pix2pix suggest that conditional adversarial networks are a promising approach for many image-to-image translation tasks, especially those involving highly structured graphical outputs. These networks learn a loss adapted to the task and data at hand, which makes them applicable to a wide variety of settings. The result achieved by pix2pix is state of the art.

pix2pix
Example results of our method on Cityscapes labels->photo, compared to ground truth

 

Example results of our method on automatically detected edges -> shoes, compared to ground truth.

AlphaGAN: Natural Image Matting

11 September 2018
AlphaGAN

AlphaGAN: Natural Image Matting

Many image-editing and film post-production applications rely on natural image matting as one of the processing steps. The task of the matting algorithm is to estimate the opacity of a…

Many image-editing and film post-production applications rely on natural image matting as one of the processing steps. The task of the matting algorithm is to estimate the opacity of a foreground object in an image or video sequence accurately. Researchers from Trinity College Dublin propose the AlphaGAN architecture for natural image matting.

In mathematical terms, every pixel i in the image is assumed to be a linear combination of the foreground and background colors:

alphagan i

where ai is a scalar value that defines the foreground opacity at pixel i and is referred to as the alpha value.

So, how to solve this equation with so many unknown values? Let’s first discover the current state-of-the-art approaches to solving this problem…

Previous Works

Lots of current algorithms aim to solve the matting equation by treating it as a color-problem following either sampling or propagation approaches.

Sample-based image matting assumes that true foreground and background colors of the unknown pixel can be derived from the known foreground and background samples that are near that pixel. Methods that follow this assumption include:

Propagation image matting works by propagating the known alpha value between known local foreground and background samples to the unknown pixels. The examples include:

However, the over-dependency on color information can lead to artifacts in images where the foreground and background color distributions overlap.

Thus, recently, several deep learning approaches to the natural image matting were introduced, including:

  • a two-stage network consisting of an encoder-decoder stage and a refinement stage by Xu et al.;
  • end-to-end CNN for deep automatic portrait matting by Shen et al.;
  • end-to-end CNN that utilizes the results deduced from local and non-local matting algorithms by Cho et al.;
  • granular deep learning (GDL) architecture by Hu et al.

But is it possible to improve further the performance of these algorithms by applying GANs? Let’s find out now!

State-of-the-art Idea

Lutz, Amplianitis, and Smolić from Trinity College Dublin are the first to propose generative adversarial network (GAN) for natural image matting. Their generator network is trained to predict visually appealing alphas, while the discriminator is trained to classify well-composited images.

The researchers build their approach by improving the network architecture of Xu et al. to better deal with the spatial localization issues inherent in CNNs. In particular, they use dilated convolutions to capture global context information without downscaling feature maps and losing spatial information.

We are now ready to move on to the details of AlphaGAN – this is how Lutz and his colleagues call their image matting algorithm.

Network Architecture

AlphaGAN architecture consists of one generator G and one discriminator D.

The generator G is a convolutional encoder-decoder network that is trained both with the help of the ground-truth alphas as well as the adversarial loss from the discriminator. It takes an image composited from the foreground, alpha and a random background appended with the trimap as 4th-channel as input and attempts to predict the correct alpha. Resnet50 architecture is used for the encoder.

As you can see from the figure below, the decoder part of the network includes skipping connections from the encoder to improve the alpha prediction by reusing local information to capture fine structures in the image.

The generator of the AlphaGAN
The generator of the AlphaGAN

The discriminator D tries to distinguish between real 4-channel inputs and fake inputs where the first three channels are composited from the foreground, background and the predicted alpha. PatchGAN introduced by Isola et al. is used for the discriminator in this network.

The full objective of the network includes alpha-prediction loss, compositional loss, and adversarial loss:

Loss Alphagan

Experimental Results

The proposed method was evaluated based on two datasets:

  • the Composition-1k dataset, which includes 1000 test images composed of 50 unique foreground objects;
  • the alphamatting.com dataset, which consists of 28 training images and 8 test images; for each set, three different sizes of trimaps are provided, namely, “small” (S), “large” (L) and “user” (U).

The Composition-1k Dataset

The metrics used include the sum of absolute differences (SAD), mean square error (MSE), gradient and connectivity errors.

The researchers compared their method with several state-of-the-art approaches where there is public code available. For all methods, the original code from the authors was used, without any modifications.

Quantitative results on the Composition-1k dataset. AlphaGAN’s results are shown in parenthesis
Quantitative results on the Composition-1k dataset. AlphaGAN’s results are shown in parenthesis

As it can be observed from the table, AlphaGAN delivers noticeably better results than other image matting algorithms selected for comparison. There is only one case (gradient error from the comprehensive sampling approach), where they do not achieve the best result.

See also some qualitative results of this comparison in the next set of pictures:

Results

results 2 alphagan
Comparison of results on the Composition-1k testing dataset
Comparison of results on the Composition-1k testing dataset
Comparison of results on the Composition-1k testing dataset

The Alphamatting.com Dataset

The researchers submitted results generated by AlphaGAN to alphamatting.com benchmark and got to the top positions for some of the images:

SAD and gradient results for the top five methods on the alphamatting.com dataset. Best results are shown in bold
SAD and gradient results for the top five methods on the alphamatting.com dataset. Best results are shown in bold

Specifically, they achieved the best results for Troll and Doll images, and the first place overall on the gradient evaluation metric. Their high results on these particular images demonstrate the advantage of using the adversarial loss from the discriminator to correctly predict the alpha values for such fine structures as hair.

The worst results of the proposed method come from the Net image. However, even though AlphaGAN approach appears low in the rankings for this image, the results still look very close to the top-performing approaches:

Alpha matting predictions for the "Troll" and "Doll" images (best results) and the "Net" image (worst result) taken from the alphamatting.com dataset. From left to right: DCNN [7], IF [1], DI [33], ‘Ours’

matting

Alpha matting predictions for the "Troll" and "Doll" images (best results) and the "Net" image (worst result) taken from the alphamatting.com dataset. From left to right: DCNN [7], IF [1], DI [33], ‘Ours’
Alpha matting predictions for the “Troll” and “Doll” images (best results) and the “Net” image (worst result) taken from the alphamatting.com dataset. From left to right: DCNN [7], IF [1], DI [33], ‘Ours’

Bottom Line

AlphaGAN is the first algorithm that uses GANs for natural image matting. Its generator is trained to predict alpha mattes from input images while the discriminator is trained to distinguish good images composited from the ground-truth alpha from images composited with the predicted alpha.

Such network architecture produces visually appealing compositions with state-of-the-art or comparable results for the primary metrics. Exceptional performance is achieved for the images with such fine structures as hair. That is of great importance in practical matting applications, including film and TV production.

Facial Surface and Texture Synthesis via GAN

3 September 2018
face texture synthesis

Facial Surface and Texture Synthesis via GAN

Deep networks can be extremely powerful and effective in answering complex questions. But it is also well-known that in order to train a really complex model, you’ll need lots and…

Deep networks can be extremely powerful and effective in answering complex questions. But it is also well-known that in order to train a really complex model, you’ll need lots and lots of data, which closely approximates the complete data distribution.

With the lack of real-world data, many researchers choose data augmentation as a method for extending the size of a given dataset. The idea is to modify the training examples in such a way that keeps their semantic properties intact. That’s not an easy task when dealing with human faces.

The method should account for such complex transformations of data as pose, lighting and non-rigid deformations, yet create realistic samples that follow the real-world data statistics.

So, let’s see how the latest state-of-the-art methods approach this challenging task…

Previous approaches

Generative adversarial networks (GANs) have demonstrated their effectiveness in making synthetic data more realistic. Taking the simulated data as input, GAN produces samples that appear more realistic. However, the semantic properties of these samples might be altered, even with a loss penalizing the change in the parameters of the output.

3D morphable model (3DMM) is the most commonly used method for representation and synthesis of geometries and textures, and it was originally proposed in the context of 3D human faces. By this model, the geometric structure and the texture of human faces are linearly approximated as a combination of principal vectors.

Recently, the 3DMM model was combined with the convolutional neural networks for data augmentation. However, the generated samples tend to be smooth and unrealistic in appearance as you can observe in the figure below.

Faces synthesized using the 3DMM linear model
Faces synthesized using the 3DMM linear model

Moreover, 3DMM generates samples following a Gaussian distribution, which rarely reflects the true distribution of the data. For instance, see below the first two PCA coefficients plotted for real faces vs the synthesized 3DMM faces. This gap between the real and synthesized distributions may easily result in non-plausible samples.

First two PCA coefficients of real (left) and 3DMM generated (right) faces
First two PCA coefficients of real (left) and 3DMM generated (right) faces

State-of-the-art idea

Slossberg, Shamai, and Kimmel from Technion – Israel Institute of Technology propose a new realistic data synthesis approach for human faces by combining GAN and 3DMM model.

In particular, the researchers employ a GAN to imitate the space of parametrized human textures and generate corresponding facial geometries by learning the best 3DMM coefficients for each texture. The generated textures are mapped back onto the corresponding geometries to obtain new generated high-resolution 3D faces.

This approach produces realistic samples, and it:

  • doesn’t suffer from indirect control over such desired attributes as pose and lighting;
  • is not limited to producing new instances of existing individuals.

Let’s have a closer look at their data processing pipeline…

Data processing pipeline

The process includes aligning 3D scans of human faces vertex to vertex and mapping their textures onto a 2D plane using a predefined universal transformation.

Data preparation pipeline
Data preparation pipeline

The data preparation pipeline contains four main stages:

  • Data acquisition: the researchers collected about 5000 scans from a wide variety of ethnic, gender, and age groups; each subject was asked to perform five distinct expressions including a neutral one.
  • Landmark annotation: 43 landmarks were added to the meshes semi-automatically by rendering the face and using a pre-trained facial landmark detector on the 2D images.
  • Mesh alignment: this was conducted by deforming a template face mesh according to the geometric structure of each scan, guided by the previously obtained facial landmark points.
  • Texture transfer: the texture is transferred from the scan to the template using a ray casting technique built into the animation rendering toolbox of Blender; then, the texture is mapped from the template to a 2D plane using the predefined universal mapping.

See the resulting mapped textures below:

Flattened aligned facial textures
Flattened aligned facial textures

The next step is to train GAN to learn and imitate these aligned facial textures. For this purpose, the researchers use a progressive growing GAN with the generator and discriminator constructed as symmetric networks. In this implementation, the generator progressively increases the resolution of the feature maps until reaching the output image size, while the discriminator gradually reduces the size back to a single output.

See below the new synthetic facial textures generated by the aforementioned GAN:

Facial textures synthesized by GAN
Facial textures synthesized by GAN

The final step is to synthesize the geometries of the faces. The researchers explored several approaches to finding plausible geometry coefficients for a given texture. You can observe the qualitative and quantitative (L2 geometric error) comparison between the various methods in the next figure:

Two synthesized textures mapped onto different geometries
Two synthesized textures mapped onto different geometries

Apparently, the least squares approach produces the lowest distortion results. Considering also its simplicity, this method was chosen for all the subsequent experiments.

Experimental results

The proposed method can generate many new identities, and each one of them can be rendered under varying pose, expression, and lighting. Different expressions are added to the neutral geometry using the Blend Shapes model. The resulting images with different pose and lighting are shown below:

Identities generated by the proposed method with different pose and lighting
Identities generated by the proposed method with different pose and lighting

For quantitative evaluation of the results, the researchers used the sliced Wasserstein distance (SWD) to measure distances between distributions of their training and generated images in different scales:

The table demonstrates that the textures generated by the proposed model are statistically closer to the real data than those generated by 3DMM.

The next experiment was designed to evaluate if the proposed model is capable of generating samples that diverge significantly from the original training set and resemble previously unseen data. Thus, 5% of the identities were held out for evaluation. The researchers measured the L2 distance between each real identity from the test set to the closest identity generated by the GAN, as well as to the closest real identity from the training set.

The distance between the generated and real identities
The distance between the generated and real identities

As it can be seen from the figure, the test set identities are closer to the generated identities than the training set identities. Moreover, the “Test to fake” distances are not significantly larger than the “Fake to real” distances. That implies that generated samples do not just produce IDs that are very close to the training set, but also novel IDs that resemble previously unseen examples.

Finally, a qualitative evaluation was performed to check if the proposed pipeline is able to generate original data samples. Thus, facial textures generated by the model were compared to their closest real neighbors in terms of L2 norm between identity descriptors.

Synthesized facial textures (top) vs. corresponding closest real neighbors (bottom)
Synthesized facial textures (top) vs. corresponding closest real neighbors (bottom)

As you can see, the nearest real textures are far enough to be visually distinguished as different people, which confirms the model’s ability to produce novel identities.

Bottom Line

The suggested model is probably the first to realistically synthesize both texture and geometry of human faces. It can be useful for training face detection, face recognition or face reconstruction models. In addition, it can be applied in cases where many different realistic faces are needed like for instance, film industry or computer games. Moreover, this framework is not limited to synthesizing human faces but can be actually employed to other classes of objects where alignment of the data is possible.

Vid2Vid – Conditional GANs for Video-to-Video Synthesis

3 September 2018
vid2vid-video-to-video-synthesis-e1535641547242

Vid2Vid – Conditional GANs for Video-to-Video Synthesis

Researchers from NVIDIA and MIT’s Computer Science and Artificial Intelligence Lab have proposed a novel method for video-to-video synthesis, showing impressive results. The proposed method – Vid2Vid – can synthesize…

Researchers from NVIDIA and MIT’s Computer Science and Artificial Intelligence Lab have proposed a novel method for video-to-video synthesis, showing impressive results. The proposed method – Vid2Vid – can synthesize high-resolution, photorealistic, temporally coherent videos on a diverse set of input formats including segmentation masks, sketches, and poses.

Previous works

Due to the inherent complexity of the problem of video-to-video synthesis, this topic remained relatively unexplored in the past. Compared to its image counterpart – image-to-image synthesis, much fewer studies have explored and tackled this problem.

Arguing that a general-purpose solution to video-to-video synthesis has not yet been explored in the prior work (as opposed to the image counterpart – image-to-image translation), the researchers compare and benchmark this approach with a strong baseline that combines a state-of-the-art video style transfer algorithm with a state-of-the-art image-to-image translation approach.

State-of-the-art idea

The general idea is to learn a mapping function that can convert an input video to a realistic output video. They frame this problem as a distribution matching problem, and they leverage the generative adversarial learning framework to produce a method that can generate photorealistic videos given an input video (such as a sequence of segmentation masks, poses, sketches, etc.).

vid2vid

Vid2Vid Method

As I mentioned before, the authors proposed a method based on Generative Adversarial Networks. They tackle the complex problem of video-to-video translation or video-to-video synthesis in a really impressive way by carefully designing an adversarial learning framework.

Their goal is to learn a mapping function that will map a sequence of input (source) images to a series of realistic output images, where the conditional distribution of the generated sequence given the source sequence is identical to the distribution of the original sequence given the source sequence:

vid2vid

The matching of the distributions enforces the method to learn to output realistic and temporally coherent output videos. In a generative adversarial learning context, a Generator-Discriminator framework is designed to learn the mapping function. The generator is trained by solving an optimization problem – minimizing the Jensen-Shannon divergence between the two distributions. A minimax optimization is applied to a defined objective function:

vid2vid

As mentioned in their paper, also widely known, optimizing for an objective function as the given one is a very challenging task. Often, the training of a generator and discriminator models becomes very unstable or even impossible depending on the optimization problem being solved. Therefore, in their research, they propose a simplified sequential generator making a few assumptions. They make the Markov property assumption to factorize the conditional distribution and decouple the dependencies across frames in the sequences.

In simple words, they simplify the problem by assuming that the video frames can be generated sequentially and the generation of the t-th frame only depends on three things: 1) the current source image, 2) the past L source images, and 3) the past L generated images.

From this point, they design a feed-forward network F to learn a mapping from the past L source images, and the past L-1 generated images to a newly generated output image.


By the way, Neurohive is creating the new app for professional business headshots based on neural network. We are going to release it in September.


To model such a network, researchers make another assumption based on the fact that if the optical flow from the current frame to the next frame is known, it can be used to warp the current frame to generate an estimation of the next frame. Arguing that this will be largely true except for occluded areas (where it is unknown what is happening), they are considering a specific model.

The estimation network F is modeled in such a way to take into account a given occlusion mask, the estimated optical flow between the previous and the current image (which is given by an optical flow estimation function) and a hallucinated image (generated from scratch). The hallucinated image is necessary to fill the areas under occlusions.

Similarly as in images, wherein the context of generative adversarial learning many methods exploit local discriminators besides a global one, here the authors propose an interesting approach utilizing conditional image discriminator and conditional video discriminator.

Additionally, to make their method even better and more impressive, the researchers extend their approach to multimodal synthesis. They propose a generative method based on feature embedding scheme and using a Gaussian Mixture Models, to output several videos with different visual appearances depending on sampling different feature vectors.

Results

The proposed method yields impressive results. It was tested on several datasets such as Cityscapes, Apolloscape, Face video dataset, Dance video dataset. Moreover, two strong baselines were used for comparison: pix2pix method and modified video style transfer method (CONVST) where they changed the stylization network to pix2pix.

Face vid2vid

pose-to-body vid2vi2

vid2vid

edge-to-face

Comparison with other state-of-the-art

They use both subjective and objective evaluation metrics for performance evaluation: Human preference score, Fréchet Inception Distance (FID). A comparison between the proposed and other methods is given in the tables.

Future video prediction results. Top left: ground truth. Top right: PredNet. Bottom left: MCNet. Bottom right: proposed.
Future video prediction results. Top left: ground truth. Top right: PredNet. Bottom left: MCNet. Bottom right: proposed.
Example output from the multimodal video synthesis.
Example output from the multimodal video synthesis.

Conclusion

A new state-of-the-art method in video synthesis has been proposed. The conditional GAN-based approach shows impressive results in several different tasks within the scope of video-to-video translation. There are numerous applications of this kind of methods in computer vision, robotics, and computer graphics. Using a learned video synthesis model, one can generate realistic videos for many different purposes and applications.

Everybody Dance Now: a New Approach to “Do As I Do” Motion Transfer

30 August 2018
everybody dance now

Everybody Dance Now: a New Approach to “Do As I Do” Motion Transfer

Not very good at dancing? Not a problem anymore! Now you can easily impress your friends with a stunning video, where you dance like a superstar. Researchers from UC Berkeley…

Not very good at dancing? Not a problem anymore! Now you can easily impress your friends with a stunning video, where you dance like a superstar.

Researchers from UC Berkeley propose a new method of transferring motion between human subjects in different videos. They claim that given a source video of a person dancing they can transfer that performance to a novel target after only a few minutes of the target subject performing standard moves.

But let’s first check what were the previous approaches to this kind of tasks.

Previous Works

Motion transfer or retargeting received a considerable attention from researchers over the last two decades. Early methods were creating new content by manipulating existing video footage.

So, what is the idea behind this new approach?

State-of-the-art Idea

This method poses the problem as a per-frame image-to-image translation with spatiotemporal smoothing. The researchers use pose detection, represented with the pose stick figures, as an intermediate representation between source and target. The aligned data is used to learn an image-to-image translation model between pose stick figures and images of the target person in a supervised way.

Two additional components improve the results: conditioning the prediction at each frame on that of the previous time step for temporal smoothness and a specialized GAN for realistic face synthesis.

Before diving deeper into the architecture of the suggested approach, let’s check the results with this short video:

So, in essence, the model is trained to produce personalized videos of a specific target subject. Motion transfer occurs when the pose stick figures go into the trained model to obtain images of the target subject in the same pose as the source.

Method

The pipeline of suggested approach includes three stages:

  1. Pose detection – using a pretrained state-of-the-art pose detector to create pose stick figures from the source video.
  2. Global pose normalization – accounting for differences between the source and target subjects in body shapes and locations within the frame.
  3. Mapping from normalized pose stick figures to the target subject.

Here is an overview of the method:

Method overview
Method overview

For the training purposes, the model uses a pose detector P to create pose stick figures from video frames of the target subject. Then, the mapping G is learned alongside an adversarial discriminator D, which attempts to distinguish between the “real” correspondence pair (x, y) and the “fake” pair (G(x), y).

Next, for the transferring purposes, a pose detector P helps to obtain pose joints for the source person. These are then transformed with the normalization process Norm into the joints for the target person for which pose stick figures are created. Finally, the trained mapping G is applied.

The researchers base their method on the objective presented in pix2pixHD with some extensions to produce temporally coherent video frames and synthesize realistic face images.

Temporal Smoothing

To create video sequences, they modify the single image generation set up to enforce temporal coherence between adjacent frames as shown in the figure below.

Temporal smoothing setup
Temporal smoothing setup

In brief, the current frame G(xt) is conditioned on its corresponding pose stick figure xt and the previously synthesized frame G(xt−1) to obtain temporally smooth outputs. Discriminator D then attempts to differentiate the “real” temporally coherent sequence (xt−1, xt , yt−1, yt ) from the “fake” sequence (xt−1, xt , G(xt−1), G(xt)).

Face GAN setup

The researchers further extend the model with a specialized GAN setup designed to add more detail and realism to the face region as shown in the figure below.  To be specific, the model uses a single 70×70 Patch-GAN discriminator for the face discriminator.

Face GAN setup

Now let’s move to the results of the experiments…

Results

The target subjects were recorded for around 20 minutes of real time footage at 120 frames per second. Moreover, considering that the model does not encode information about clothes, the target subjects wear tight clothing with minimal wrinkling.

Videos of the source subjects were found online – these videos just need to be of the reasonably high quality and include the subject performing a dance.

Here are the results with the top row showing the source subject, the middle row showing the normalized pose stick figures, and the bottom row depicting the model outputs of the target person:

Transfer results for five consecutive frames
Transfer results for five consecutive frames

The tables below demonstrate the performance of the full model (with both temporal smoothing and Face GAN setups) in comparison to the baseline model (pix2pixHD) alone and the baseline model with a temporal smoothing setup. The quality of individual frames was assessed with the measure of Structural Similarity (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS).

table 01

Comparison of synthesis results for different models (T.S.: a model with temporal smoothing, T.S. + Face: a full model with both temporal smoothing setup and Face GAN)
Comparison of synthesis results for different models (T.S.: a model with temporal smoothing, T.S. + Face: a full model with both temporal smoothing setup and Face GAN)

To further analyze the quality of the results, the researchers run the pose detector P on the outputs of each model and compare these reconstructed key points to the pose detections of the original input video. If all body parts are synthesized correctly, then the reconstructed pose should be close to the input pose on which the output was conditioned. See the results in the tables below:

table 1

As you can see from the tables, the temporal smoothing setup doesn’t seem to add much to the baseline if looking only on the quantitative results. However, qualitatively, the temporal smoothing setup helps with smooth motion, color consistency across frames, and also in individual frame synthesis.

Face GAN setup, on the other hand, improves both the quantitative and qualitative results of the model. As obvious from the pictures below, this component adds considerable detail to the output video and encourages synthesizing realistic body parts.

Face image comparison from different models on the validation set
Face image comparison from different models on the validation set

Conclusion

The presented model is able to create reasonable and arbitrarily long videos of a target person dancing given body movements to follow through an input video of another subject dancing. However, the results still often suffer from jittering. This is especially the case when the input movements or movement speed is different from the movements seen at the training time.

Considering that jittering and shakiness remain even if the target person tries to copy the movements of the source subject during the training sequence, the researchers suppose that jittering could also result from the underlying difference between how the source and target subjects move given their unique body structures. Still, this approach to motion transfer is able to generate compelling videos given a variety of inputs.

DeepWrinkles: Accurate and Realistic Clothing Modeling

28 August 2018

DeepWrinkles: Accurate and Realistic Clothing Modeling

Realistic garment reconstruction is notoriously a complex problem and its importance is undeniable in many research work and applications, such as accurate body shape and pose estimation in the wild…

Realistic garment reconstruction is notoriously a complex problem and its importance is undeniable in many research work and applications, such as accurate body shape and pose estimation in the wild (i.e., from observations of clothed humans), realistic AR/VR experience, movies, video games, virtual try-on, etc. For the past decades, physics-based simulations have been setting the standard in movie and video game industries, even though they require hours of labor by experts.

Facebook Research present a novel approach called Deep wrinkles to generate accurate and realistic clothing deformation from real data capture. It consists of two complementary modules:

  • A statistical model is learned from 3D scans of clothed people in motion, from which clothing templates are precisely non-rigidly aligned.
  • Fine geometric details are added to normal maps generated using a conditional adversarial network whose architecture is designed to enforce realism and temporal consistency.

The goal is to recover all observable geometric details. Assuming the finest details are captured at sensor image pixel resolution and are reconstructed in 3D, all existing geometric details can then be encoded in a normal map of the 3D scan surface at the lower resolution as shown in figure below.

clothes modelling

Cloth deformation is model by learning a linear subspace model that factors out body pose and shape. However, our model is learned from real data.


By the way, Neurohive is creating new app for business photos based on neural network. We are going to release it in September.


The strategy ensures deformations are represented compactly and with high realism. First, we compute robust template-based non-rigid registrations from a 4D scan sequence, then a clothing deformation statistical model is derived and finally, a regression model is learned to pose retargeting.

Data Preparation

Data capture: For each type of clothing, 4D scan sequences are captured at 60 fps (e.g., 10.8k frames for 3 min) of a subject in motion, and dressed in a full-body suit with one piece of clothing with colored boundaries on top. Each frame consists of a 3D surface mesh with around 200k vertices yielding very detailed folds on the surface but partially corrupted by holes and noise. In addition, capturing only one garment prevents occlusions where clothing normally overlaps (e.g., waistbands) and items of clothing can be freely combined with each other.

Registration. The template of clothing  T  is defined by choosing a subset of the human template with consistent topology. T should contain enough vertices to model deformations (e.g., 5k vertices for a T-shirt). The clothing template is then registered to the 4D scan sequence using a variant of
non-rigid ICP based on grid deformation.

Statistical model

The statistical model is computed using linear subspace decomposition by PCA. Poses of all n registered meshes are factored out from the model by pose-normalization using inverse skinning. Each registration Ri can be represented by a mean shape and vertex offsets oi, such that Ri = M+ oi, where the mean shape M belongs to R3*v is obtained by averaging vertex positions. Finally, each Ri can be compactly represented by a linear blend shape function
B,

 the blend shape can simply be replaced

Pose-to-shape prediction

Predictive model f is learned that that takes as inputs joint poses and outputs a set of k shape parameters (A). This allows powerful applications where deformations are induced by the pose. To take into account deformation dynamics that occur during human motion, the model is also trained with pose velocity, acceleration, and shape parameter history.

DeepWrinkles Accurate and Realistic Clothing Modeling
Outline of DeepWrinkles

Architecture

neural network design clothing modelling

The goal is to recover all observable geometric details. Assuming the nest details are captured at sensor image pixel resolution and are reconstructed in 3D all existing geometric details can then be encoded in a normal map of the 3D scan surface at a lower resolution. To automatically add fine details on the fly to reconstructed clothing, the generative adversarial network is proposed to leverage normal maps.

The proposed network is based on a conditional Generative Adversarial Network (cGAN) inspired by image transfer. A convolution batchnorm-ReLu structure and a U-Net is used in the generative network since it transferred all the information across the network layers and the overall structure of the image to be preserved. Temporal consistency is achieved by extending the L1 network loss term. For compelling animations, it is not only important that each frame looks realistic, but also no sudden jumps in the rendering should occur. To ensure a smooth transition between consecutively generated images across time, we introduce an additional loss L(loss) to the GAN objective that penalizes discrepancies between generated images at t and expected images (from training dataset) at  t – 1:

loss function

where L(data) helps to generate images near to ground truth in an L1 sense (for less blurring). The temporal consistency term L(temp) is meant to capture global fold movements over the surface.

The cGAN network is trained on a dataset of 9213 consecutive frames. The first 8000 images compose the training data set, the next 1000 images the test data set and the remaining 213 images the validation set. Test and validation sets contain poses and movements not seen in the training set. The U-Net auto-encoder is constructed with 2 x 8 layers, and 64 filters in each of the first convolutional layers. The discriminator uses patches of size 70 x 70. L(data) weight is set to 100, L(temp) weight is 50, while GAN weight is 1. The images have a resolution of 256 x 256, although our early experiments also showed promising results on 512 x 512.

Result

DeepWrinkles is an entirely data-driven framework to capture and reconstruct clothing in motion out from 4D scan sequences. The evaluations show that high-frequency details can be added to low-resolution normal maps using a conditional adversarial neural network. The temporal loss is also introduced to the GAN objective that preserves geometric consistency across time, and show qualitative and quantitative evaluations on different datasets.

Results
a) Physics-based simulation, b) Subspace (50 coefficients) c) Registration d) DeepWrinkles e) 3D scan (ground truth)

A Style-Aware Content Loss for Real-time HD Style Transfer

14 August 2018

A Style-Aware Content Loss for Real-time HD Style Transfer

A picture may be worth a thousand words, but at least it contains a lot of very diverse information. This not only comprises what is portrayed, e.g., a composition of…

A picture may be worth a thousand words, but at least it contains a lot of very diverse information. This not only comprises what is portrayed, e.g., a composition of a scene and individual objects but also how it is depicted, referring to the artistic style of a painting or filters applied to a photo. Especially when considering artistic images, it becomes evident that not only content but also style is a crucial part of the message an image communicates (just imagine van Gogh’s Starry Night in the style of Pop Art). A vision system then faces the challenge to decompose and separately represent the content and style of an image to enable a direct analysis based on each individually. The ultimate test for this ability is style transfer, exchanging the style of an image while retaining its content.

Neural Style Transfer Example
Neural Style Transfer Example

Recent work has been done using neural networks and the crucial representation in all these approaches has been based on a VGG16 or VGG19 network, pretrained on ImageNet. However, a recent trend in deep learning has been to avoid supervised pre-training on a million images with tediously labeled object bounding boxes. In the setting of style transfer, this has the particular benefit of avoiding from the outset any bias introduced by ImageNet, which has been assembled without artistic consideration. Rather than utilizing a separate pre-trained VGG network to measure and optimize the quality of the stylistic output, an encoder-decoder architecture with adversarial discriminator is used, to stylize the input content image and also use the encoder to measure the reconstruction loss.

State of the Art

To enable a fast style transfer that instantly transfers a content image or even frames of a video according to a particular style, a feed-forward architecture is required rather than the slow optimization-based approach. To this end, t an encoder-decoder architecture that utilizes an encoder network E to map an input content image x onto a latent representation z = E(x). A generative decoder G then plays the role of a painter and generates the stylized output image y = G(z) from the sketchy content representation z. Stylization then only requires a single forward pass, thus working in real-time.

style-transfer-video-neural-network

1) Training with a Style-Aware Content Loss

Previous approaches have been limited in that training worked only with a single style image. In contrast, in this work, a single image y0 is given with a set Y of related style images yj ∈ Y. To train E and G, a standard adversarial discriminator D is used to distinguish the stylized output G(E(xi)) from real examples yj ∈ Y. The transformed image loss is defined as then:

content-loss

where C × H × W is the size of image x and for training T is initialized with uniform weights. Fig. 3 illustrates the full pipeline of approach. To summarize, the full objective of our model is:

full-network

where λ controls the relative importance of adversarial loss.

2) Style Image Grouping

Given a single style image y0 the task is to find a set Y of related style images yj ∈ Y. A VGG16 is trained from scratch on Wikiart dataset to predict an artist given the artwork. The network is trained on the 624 largest (by the number of works) artists from the Wikiart dataset. Artist classification, in this case, is the surrogate task for learning meaningful features in the artworks’ domain, which allows retrieving similar artworks to image y0.

Let φ(y) be the activations of the fc6 layer of the VGG16 network C for input image y. To get a set of related style images to y0 from the Wikiart dataset Y we retrieve all nearest neighbors of y0 based on the cosine distance δ of the activations φ(·), i.e.

wikiart-dataset

The basis for style transfer model is an encoder-decoder architecture. The encoder network contains 5 conv layers: 1×conv-stride-1 and 4×conv-stride-2. The decoder network has 9 residual blocks, 4 upsampling blocks, and 1×conv-stride-1. The discriminator is a fully convolutional network with 7×conv-stride-2 layers. During the training process sample 768 × 768 content image patches from the training set of Places365 [51] and 768×768 style image patches from the Wikiart dataset. We train for 300000 iterations with batch size 1, learning rate 0.0002 and Adam optimizer. The learning rate is reduced by a factor of 10 after 200000 iterations.

Table: 1
Table: 1
titan-x
Training time

Experts were asked to choose one image which best and most realistically reflects the current style. The score is computed as the fraction of times a specific method was chosen as the best in the group. Mean expert score is calculated for each method using 18 different styles and report them in Tab. 1.

Result

This paper has addressed major conceptual issues in state-of-the-art approaches for style transfer. The proposed style-aware content loss enables a real-time, high-resolution encoder-decoder based stylization of images and videos and significantly improves stylization by capturing how style affects content.

comparison-style-transfer-methods

High-resolution-style-transfer-result-e1533903958847 (1)
Result in high resolution

Can Computers See Outside the Box? – Realistic Image Outpainting Using GAN

14 August 2018
image outpainting

Can Computers See Outside the Box? – Realistic Image Outpainting Using GAN

Deep learning has been applied to a huge number of computer vision tasks and so far it has proven successful in many of them. Nevertheless, there are still some tasks…

Deep learning has been applied to a huge number of computer vision tasks and so far it has proven successful in many of them. Nevertheless, there are still some tasks where deep neural networks struggle and traditional computer vision approaches work better. Historically speaking, some of the tasks have been more appealing and therefore well studied and explored whereas others have attracted much less attention. Image outpainting (or image extrapolation) is one of the latter. While filling holes, or filling missing details in images i.e. image inpainting has been widely studied, image outpainting has been addressed only in a few studies and it is not a very popular topic among researchers.

However, researchers from Stanford have presented a deep learning approach towards the problem of image extrapolation (i.e. image outpainting). They take an interesting approach and address the problem of image extrapolation using adversarial learning.

Generative adversarial learning – DCGAN

Generative adversarial learning has received a lot of attention in the past few years and it has been applied to a variety of generative tasks. In this work, the researchers use Generative Adversarial Networks to outpaint an image by extrapolating and filling equal-sized parts on the sides of the input image.

As in many generative tasks in computer vision, the goal is to produce a realistic (and visually pleasing) image. The outpainting can be seen as hallucinating past the image boundaries and intuitively it is not a trivial task since (almost) anything might appear outside the boundaries of the image in reality. Therefore, a significant amount of additional content is needed which matches the original image, especially near its boundaries. While generating realistic content near the image boundaries is challenging because it has to match the original image, the generation of realistic content further from the boundaries is almost as challenging but mostly because of the opposite – lack of neighboring information.

In this work, a DCGAN architecture has been employed to tackle the problem of image extrapolating. The authors show that their method is able to generate realistic samples of 128×128 color images, and moreover, it allows recursive outpainting (up to some extent) to obtain larger images.

Data

Example images from the Places365 Dataset
Example images from the Places365 Dataset

The Places365 Dataset has been used to both train and evaluates the proposed method. The authors define a specific preprocessing which consists of three steps: normalizing the images, defining a binary mask to mask out the central part of the image (horizontally only) and compute the mean pixel intensity over the unmasked regions. After the preprocessing, each input images is represented as a pair of two images: the original image and the preprocessed image. The preprocessed image is obtained by masking the original image and concatenating with the mean pixel intensity images (per channel).

Method

As mentioned before, the generative model is a GAN network which is trained using a three-phase training procedure to account for stability in the training process. The generator network is a non-symmetric Convolutional Encoder-Decoder network, and the Discriminator accounts for global and local discriminators. The generator networks have 9 layers (8 convolutional and 1 deconvolutional layer), while the discriminator has 5 convolutional and 1 fully-connected layer, plus a concatenation layer that combines the outputs of the local discriminators to produce a single output.

The training pipeline used in the method
The training pipeline used in the method

All the layers are followed by ReLU activation except the output layers in both networks and dilated convolutions are used to further improve the outpainting. The authors argue that dilated convolutions actually affect a lot the quality of the generated image and the actual capability of outpainting the image. In fact, the improvement comes from the increased local receptive field that enables to outpaint the whole image and dilated convolutions are just an efficient way to increase the local receptive field in convolutional layers without increasing the computational complexity.

Generator
Generator
discriminator
Discriminator

Evaluation and conclusions

This approach shows promising results as it has proved that it is able to generate a relatively realistic image outpaintings. The authors evaluated the method mostly qualitatively as a consequence of the nature of the problem, and they also use RMSE as a reference quantitative evaluation metric. In fact, they use a modified RMSE where they account for simple image postprocessing by renormalizing the images. In the final part of the paper, they explain the recursive outpainting experiments that they conducted and they show that the recursively-outpainted images remain relatively realistic even though noise compounds with successive iterations. A recursively-outpainted image with 5 iterations is given as an example in the image below.

Recursively outpainted image with 5 iterations. It can be noticed that the noise compounds with the number of iterations
Recursively outpainted image with 5 iterations. It can be noticed that the noise compounds with the number of iterations
The effect of local discriminators along with the global one as opposed to using the only global discriminator