Pairwise Relational Network – New Method for Face Recognition

28 November 2018
Pairwise Relational Network face recognition

Pairwise Relational Network – New Method for Face Recognition

With the rapid progress of deep learning in the past few years, many computer vision problems have been tackled and solved with human or even beyond human performance. One of…

With the rapid progress of deep learning in the past few years, many computer vision problems have been tackled and solved with human or even beyond human performance. One of these tasks, in fact, a very popular one, is face recognition.

Until the recent past, face recognition was seen as something straight out of science fiction. But, over the past decade or two face recognition has not only become a solved problem but a widespread technology with applications in several industries.

Previous Works

Since face recognition is a challenging task it took some time for researchers in different domains to reach satisfactory results. Researchers the domain of pattern recognition, computer vision and artificial intelligence have proposed many solutions in the past. The main goal was to reduce difficulties such as highly variable face poses, image quality etc., so as to improve the robustness and recognition accuracy.

A number of deep learning-based face recognition methods have been proposed in the past few years. Starting with remarkable results with DeepFace (2014), then methods like DeepID (2014), FaceNet (2015), VGGFace (2015), all the way to more recent methods like Cosface (2018), ArcFace (2018).

State-of-the-art Idea

Recently, researchers from Pohang University of Science and Technology in Korea have proposed a novel face recognition method that achieved state-of-the-art results on some of the benchmark datasets. The new method, called pairwise relational network (PRN) takes local appearance features around landmark points on the feature map and captures unique pairwise relations with the same identity and discriminative pairwise relations between different identities.

Pairwise Relational Network

In fact, the idea is to build a method which is able to represent an image of a face in such a manner that the features extracted will be discriminative across faces of different people.

Method

The proposed method works by taking local appearance features as input by ROI projection around landmark points on the feature map. These features are used to train a PRN (pairwise relational network) to capture unique pairwise relations between pairs of local appearance features. Arguing that the existence of such pairwise relations is identity dependent, the researchers employ LSTM to learn additional facial identity state feature. The architecture of the method as well as of the pairwise relational network is given in the figure below.

The architecture of the proposed method
Learning face identity state feature

From a perspective of learning and optimization, the method uses combined triplet ratio loss, pairwise loss, and softmax loss. Stochastic Gradient Descent was used as the optimization method with an initial learning rate of 0.1.

Evaluation and Comparison

The proposed method was evaluated on the LFW dataset which reveals the state-of-the-art face verification in unconstrained environments. It is a popular and good dataset for evaluating face recognition methods. It contains 13, 233 of highly variable images of faces from 5, 749 different identities. The PRN method reaches 99.76% accuracy on this dataset which is almost the same accuracy as the state-of-the-art method ArcFace (99.78%).

Comparison with other methods on the LFW Dataset.
Comparison with other methods on the LFW Dataset.

However, when evaluated on the YTF dataset (with similar characteristics as LFW dataset) the PRN method achieves state-of-the-art results – 96.3%.

Comparison with other methods on the YTF Dataset. The PRN method achieves state-of-the-art performance.

Additionally, the method was evaluated on IJB-A and IJB-B datasets for evaluating face verification and face identification. The results obtained on these datasets as well as on LFW and YTF compared with other methods are reported in the tables below.

Comparison of performances of the proposed PRN method with the state-of-the-art on the IJB-B dataset.
Comparison of performances of the proposed PRN method with the state-of-the-art on the
IJB-A dataset.

Conclusion

The researchers proposed an interesting approach to a well-known problem – face recognition. In their paper, they show that capturing those kinds of unique and discriminative pairwise relations actually solves the problem of face identification to a high degree of accuracy. Extensive experiments have been done on popular datasets and the method achieves very good results on all of them and state-of-the-art performance on one of them.

Learning Physical Skills from Youtube Videos using Deep Reinforcement Learning

6 November 2018

Learning Physical Skills from Youtube Videos using Deep Reinforcement Learning

  Realistic, humanlike chracters represent a very important area of computer animation. These characters are vital components of many applications, such as cartoons, computer games, cinematic special effects, virtual reality,…

 

Realistic, humanlike chracters represent a very important area of computer animation. These characters are vital components of many applications, such as cartoons, computer games, cinematic special effects, virtual reality, artistic expression etc. However, character animation production typically goes through a number of creation stages and as such it represents a laborious task.

Previous Work

Such a labor-intensive task represents a bottleneck in the whole process of computer animation creation. In the past, there have been a number of attempts to overcome this problem and make this task supported by automatic tools, or even completely automated.

Many of the proposed approaches in the past have failed when it comes to producing robust and naturalistic motion controllers that enable virtual characters to perform complex skills in physically simulated environments. The first attempts and approaches have focused mostly on understanding the physics and biomechanics and trying to formulate and replicate motion patterns to virtual characters. More recently, data-driven approaches have caught the attention. However, most data-driven approaches, save for a few exceptions, are based on motion capture data, which often requires costly instrumentation and heavy pre-processing.

State-of-the-art Idea

Recently, researchers from Berkeley AI Research at the University of California have proposed a novel Reinforcement Learning-based approach for learning character animation from videos.

Combining motion estimation from videos and deep reinforcement learning, their method is able to synthesize a controller given a monocular video as input. Additionally, the proposed method is able to predict potential human motions from still images, by forward simulation of learned controllers initialized from the observed pose.

The proposed pipeline for learning acrobatic skills from Youtube videos.

Method

The researchers propose a framework that takes a monocular video and outputs motion imitation done by a simulated character model. The whole approach is based on pose estimation in the frames of the video, which is later used for motion reconstruction and motion imitation to achieve the final goal.

First, the input video is processed by the pose estimation stage, where a learned 2D and 3D pose estimators are applied to extract (estimate) the pose of the actor in each frame. Next, the set of predicted poses proceeds to the motion reconstruction stage, where a reference motion trajectory is optimized such that it is consistent with both the 2D and 3D pose predictions, while also enforcing temporal-consistency between frames. The reference motion is then utilized in the motion imitation stage, where a control policy is trained to enable the character to reproduce the reference motion in a physically simulated environment.

Pose Estimation Stage

The first module in the pipeline is the pose estimation module. At this stage, the goal is to estimate the pose of the actor from single still image i.e. each video frame. There are a number of challenges that have to be addressed at this point, in order to obtain accurate pose estimation. First, the variability in the body orientation among different actors performing the same movement is very high. Second, pose estimation is to be done at each frame independently from the previous or next frames, not accounting for temporal consistency.

To address both of those issues, the researchers propose to use an ensemble of already existing and proven methods for pose estimation. Along with that they use a simple data augmentation technique to improve pose predictions in the domain of acrobatic movements.

They train an ensemble of estimators on the augmented dataset and obtain 2D and 3D pose estimations for each frame, which define the 2D and 3D motion trajectories, respectively.

Comparison of the motions generated by different stages of the pipeline for backflip motion. Top-to-Bottom: Input video clip, 3D pose estimator, 2D pose estimator, simulated character.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Motion Reconstruction Stage

In the motion reconstruction stage, the independent predictions from the pose estimators are consolidated to form the final reference motion. The ultimate goal that the researchers were aiming for in this stage is to improve the quality of the reference motions by fixing errors and removing motion artifacts often manifested as nonphysical behaviours. According to the researchers, these motion artifacts appear due to inconsistent predictions across adjacent frames.

Again, an optimization technique is applied at this stage, optimizing for a common 3D pose trajectory for the pose estimators, while at the same time enforcing temporal consistency between consecutive frames. The optimization is done in the latent space, using the pose estimators leveraging their encoder-decoder architecture.

Motion Imitation Stage

In the final stage, deep reinforcement learning is applied to reach the final objective. From a machine learning perspective, the goal here is to learn a policy that enables the character to reproduce the demonstrated skill in (video) simulation. The reference motion extracted previously is used to define an imitation objective, and a policy is then trained to imitate the given motion.

The reward function is designed to incentivize the character to track the joint orientations from the reference motion. In fact, quaternion differences of joint rotations are computed between the character’s joint rotations and the joint rotations of the extracted reference motion.

Final result of the method: Character imitating a 2-handed vault.

Results

To demonstrate the proposed framework and evaluate the proposed method, the researchers employ a 3D humanoid character and a simulated Atlas robot. A qualitative evaluation was done by comparing snapshots of the simulated characters with the original video demonstrations. All video clips were collected from YouTube and they depict human actors performing various acrobatic motions. As they mention in the paper, since it is a difficult challenge to quantify the difference between the motion of the actor in the video and the simulated character, performance was evaluated with respect to the extracted reference motion. The figures below show overlapping snapshots of the real videos and the simulated characters for a qualitative evaluation.

The simulated Atlas robot performing skills learned from videos.

Qualitative evaluation using simulated characters performing different skills learned from video clips.

Conclusions

The proposed method for data-driven animation creation leverages the abundance of publicly available video clips from the web to learn full-body motion skills and as such it represents a significant contribution. The proposed framework shows the potential of combining multiple, different techniques to build a whole framework and reach a specific objective. As such, there exists a big advantage of the modular design since new advances, relevant to the various stages of the pipeline can be incorporated at later stages to further improve the overall effectiveness of the framework.

“The Sound of Pixels”: Self-supervised Method for Sound Localisation and Separation

29 October 2018

“The Sound of Pixels”: Self-supervised Method for Sound Localisation and Separation

We, as human beings are able to digest a sound video with almost no effort. Being able to detect and track objects within the frames of a video allows us…

We, as human beings are able to digest a sound video with almost no effort. Being able to detect and track objects within the frames of a video allows us to gain contextual information and understand more or less what is going on in the video. On the other hand, we are able to process audio information very tightly coupled with those frames and gain an even better understanding of the action happening in the video.

Previous work

Quite a lot of research has been done exploring the relationship between vision and sound. A number of problems arise from this relationship and a number of methods have been proposed to solve some of them. In the past and especially in the recent past, researchers have addressed problems such as sound localization in videos, sound generation in silent videos, self-supervision in videos using audio signal etc.

State-of-the-art idea

Recent work, conducted and published by researchers from Massachusetts Institute of Technology, MIT-IBM Watson AI Lab and Columbia University has explored the relationship between vision and sound in a different way. In fact, researchers proposed a novel self-supervised method that learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel.

The architecture of the proposed method.

Method

Their approach is focused on leveraging the natural synchronization of the visual and audio information to learn to separate and locate sound components within a video in a self-supervised manner. They introduce their system called PixelPlayer that is able to learn to recognize and localize objects in images and to separate the audio component coming from each object. Additionally, the researchers introduced new musical instrument video dataset, called MUSIC built for the purpose of this work.

Example frames and sounds from the created video dataset – MUSIC.
Dataset Statistics.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

As mentioned before, the proposed method localizes sound sources in a video and separates the audio into its components without supervision. It is composed of three major modules: Video analysis network, audio analysis network and audio synthesizer network. This kind of architecture allows extracting visual and audio features for the goal of audio-visual source separation and localization.

Video Analysis Network

The video analysis network is trying to extract visual features for each frame in the video. Temporal pooling is applied to each of the per-frame extracted features to obtain a visual feature vector for each pixel. To do so, the researchers employ a variation of the popular ResNet-18 network with dilated convolutions.

Audio Analysis Network

Parallel to the extraction of visual features, the audio analysis network is trying to split the input sound from the video into K components. To solve this problem, the researchers propose to use sound spectrograms instead of raw sound waveforms and employ a convolutional architecture that has proven successful with audio data in the past. Their choice here falls on Audio U-Net. Using this kind of encoder-decoder architecture, they are able to extract K feature maps out of the spectrogram containing features of different components of the input sound. Previously, they use STFT (Short-Time Fourier Transform) to obtain the sound spectrogram which is used as input to the network.

Audio Synthesizer Network

The core module of the proposed method is the synthesizer network which takes both the output from the video analysis network as well as the audio analysis network. More precisely, it takes the pixel-level visual feature and the audio feature and it outputs a mask that could separate the sound of that pixel from the input spectrogram. The way the final resulting sound for each pixel is obtained is by taking the corresponding mask and multiplying it with the input spectrogram. To get the resulting sound as a waveform, inverse STFT is applied.

The proposed “Mix-and-Separate” training framework.

 

 

 

 

 

 

 

 

 

 

 

 

In order to train such an architecture in an unsupervised (or self-supervised) manner, the researchers proposed a training framework that they call Mix-and-Separate. Their idea builds upon the assumption that sounds are approximately additive. Therefore, they mix sounds from different videos to generate a complex audio input signal and then the learning objective is to separate a sound source of interest conditioned on the visual input associated with it.

“Which pixels are making sounds?” Energy distribution of sound across image pixels.
“What sounds do these pixels make?” Clustering of sound in the pixel space.

Results

In order to perform an initial evaluation, the authors use again their training framework called Mix-and-Separate. They created a validation set of synthetic mixture audios and the separation is evaluated on this set. Since the goal is to learn and predict the spectrogram they do qualitative evaluation by comparing the ground-truth and the estimated spectrogram. For a quantitative evaluation, they use Normalized Signal-to-Distortion Ratio (NSDR), Signal-to-Interference Ratio (SIR), and Signal-to-Artifact Ratio (SAR) as evaluation metrics on the validation set of the synthetic videos.

Quantitative evaluation and comparison with traditional methods such as NMF (Non-negative Matrix Factorization) and Spectral Regression. Also, comparison between using a binary and ratio mask in the proposed method.
Qualitative evaluation. Comparison between the ground-truth and the estimated spectrogram.

Conclusion

Overall, the proposed method is interesting from several points of view. First, it shows that self-supervised learning can be applied to problems like this. Second, the method is able to jointly perform several tasks such as locating image regions which produce sounds and separating the input sounds into a set of components that represents the sound from each pixel. And third, it represents one of the first studies that explore the correspondence between single pixels and sound in videos.

3D Hair Reconstruction Out of In-the-Wild Videos

22 October 2018
hair reconstruction from video

3D Hair Reconstruction Out of In-the-Wild Videos

3D hair reconstruction is a problem with numerous applications in different areas such as Virtual Reality, Augmented Reality, video games, medical software, etc. As a non-trivial problem, researchers have proposed…

3D hair reconstruction is a problem with numerous applications in different areas such as Virtual Reality, Augmented Reality, video games, medical software, etc. As a non-trivial problem, researchers have proposed various solutions in the past, some of them more some of them less successful. Generating realistic 3D hair model represents a challenge even when done in controlled, relatively sterile environments. Therefore, the generation of 3D hair model in-the-wild out of ordinary photos or videos is a challenging task.

Previous works

Recently, we wrote about an approach for realistic 3D hair reconstruction out of a single image. This kind of methods work well, but fail to produce high-fidelity results of 3D hair reconstruction models due to the limitations and ambiguity of the problem. Other approaches use multiple images or views and yield improved results while increasing the complexity of the solution. These approaches require controlled environments with 360 views of the person and multiple images.

Additionally, some approaches require input such as hair segmentation, making the whole process of 3D hair reconstruction more cumbersome.

State-of-the-art idea

A new approach proposed by researchers from the University of Washington can take in an in-the-wild video and automatically output a full head model with a 3D hair-strand model. The input to the proposed method is a video whose frames are used by a few components to produce hair strands – estimated and deformed in 3D (rather than 2D as in state of the art) thus enabling superior results.

Method

The method is composed of 4 components which are shown in the illustration below:

A: Module which uses structure from motion to get relative camera poses, depth maps and a visual hull shape with view-confidence values.

B: Module in which hair segmentation and gradient direction networks are trained to apply on each frame and obtain 2D strands.

C: The segmentations from module B are used to recover the texture of the face area, and a 3D face morphable model is used to estimate face and bald head shapes.

D: The last module and the core of the algorithm where the depth maps and 2D strands are used to create 3D strands. These 3D strands are used to query a hair database and the strands from the best match are refined both globally and locally to fit the input frames from the video.

In this way, a robust and flexible method is obtained which can successfully recover 3D hair strands from in-the-wild video frames.

The proposed method’s architecture showing the four components

Module A: The first module is used to obtain a rough head shape. Each frame of the videos is pre-processed using semantic segmentation to separate the background from a person. The goal here is to estimate camera pose per frame and to create a rough initial structure from all the frames.

First, after pre-processing and removing the background, the head moving within all frames is extracted by using structure from motion approach – estimation of camera poses per frame and per-frame depth for all frames in the video. The output of this module is a rough visual hull shape.

Module B: The second module contains the trained hair segmentation and hair directional classifiers to label and predict the hair direction in hair pixels of each video frame inspired by strand direction estimation method of Chai et al. 2016.

Hair segmentation, directional labels and 2D hair strands of example video frames

Module C: In this module, the segmented frames are used to select the frame that is closest to frontal face (where yaw and pitch are approximately 0), and fed to a morphable-model-based face model estimator.

Module D: The last, in fact, the core module is estimating 3D hair strands using the outputs of modules A, B and C. Initially, since in this module each frame has an estimation of 2D strands, 3D hair strands are obtained by projecting them to depths to obtain the initial estimate of 3D strands. Then, those 3D hair strands are used to query a database of 3D hair models since the initial strands are incomplete. In their work, the researchers use the hair dataset created by Chai et al. 2016, which contains 35, 000 different hairstyles, each hairstyle model consisting of more than 10, 000 hair strands. A global and also a local deformation is applied in the end to refine the obtained 3D hair strands.

The local and global transformation applied to the 3D hair strands

Results

To evaluate the proposed approach, the researchers use both quantitative and qualitative evaluation as well as human study comparison. A quantitative comparison is made by projecting the reconstructed hair as lines onto the images, computing the intersection-over-union rate to the ground truth hair mask per frame. The results are shown in the table below. A larger IOU means that the reconstructed hair approximates the input better.

This figure shows the results compared to the state-of-the-art methods

The approach was evaluated qualitatively against some state-of-the-art methods. Moreover, human preference tests using Mechanical Turk have been done, and the results are shown in the tables.

This figure shows four example frames comparing the silhouettes of the reconstructed hairstyles to the hair segmentation results.
compared to Hu et al. 2017 based on Amazon Mechanical Turk tests.
The ratio of preference of the methods’ results over total compared to Hu et al. 2017 based on Amazon Mechanical Turk tests.
The ratio of preference of methods' results
The ratio of preference of methods’ results over total compared to Zhang et al. 2017 based on Amazon Mechanical Turk tests.

Conclusion

In this paper, researchers from the University of Washington proposed a fully automatic way of 3D hair reconstruction out of in-the-wild videos, which can have a wide variety of potential applications. Although the method itself is quite complex and involves many steps, the results are more than satisfactory. The approach shows that higher fidelity in the results can be obtained by incorporating information from multiple frames of videos where slightly different views are present. The proposed system is exploiting this to reconstruct 3D hair model while not restricted to specific views and head poses.

Head Reconstruction from Internet Photos

15 October 2018
head reconstruction internet photos

Head Reconstruction from Internet Photos

Methods that reconstruct 3D models of people’s heads from images need to account for varying 3D pose, lighting, non-rigid changes due to expressions, relatively smooth surfaces of faces, ears, and…

Methods that reconstruct 3D models of people’s heads from images need to account for varying 3D pose, lighting, non-rigid changes due to expressions, relatively smooth surfaces of faces, ears, and neck, and finally, the hair. Great reconstructions can be achieved nowadays in case the input photos are captured in a calibrated lab setting or semi-calibrated setup where the person has to participate in the capturing session (see related work).

Reconstructing from Internet photos, however, is an open problem due to the high degree of variability across uncalibrated images. Lighting, pose, cameras and resolution change dramatically across photos. In recent years, reconstruction of faces from the Internet has received a lot of attention. All face-focused methods, however, mask out the head using a fixed face mask and focus only on the face area.

Previous Works

Calibrated head modeling has achieved amazing results over the last decade. Reconstruction of people from Internet photos recently achieved good results.

  • Shlizerman et al. showed that it is possible to reconstruct a face from a single Internet photo using a template model of a different person. One way to approach the uncalibrated head reconstruction problem is to use the morphable model approach.
  • Hsieh et al. showed that with morphable models the face is fitted to a linear space of 200 face scans, and the head is reconstructed from the linear space as well. In practice, morphable model methods work well for face tracking.
  • Adobe Research proved that hair modeling could be done from a single photo by fitting to a database of synthetic hairs or by fitting helices.

State-of-the-art idea

This idea addresses the new direction of head reconstruction directly from Internet data. Given a photo collection, obtained by searching for photos of a specific person on Google image search, the task is to reconstruct a 3D model of that person’s head(the focus is only on the face area). If the given photos are only one or two per view, the problem is very challenging due to lighting inconsistency across views, difficulty in segmenting the face profile from the background, and challenges in merging the images across views. The key idea is that with many more (hundreds) of photos per 3D view, the problems can be overcome. For celebrities, one can quickly acquire such collections from the Internet; for others, we can extract such photos from Facebook or mobile photos.

The method works as follows: a person’s photo collection is divided into clusters of approximately the same azimuth angle of the 3D pose. Given the clusters, a depth map of the frontal face is reconstructed, and the method gradually grows the reconstruction by estimating surface normals per view cluster and then constraining using boundary conditions coming from neighboring views. The final result is a head mesh of the person that combines all the views.

head reconstruction 3D
Figure 2

The given photos are divided into a view cluster as Vi. Photos in the same view cluster have approximately the same 3D pose and azimuth angle. The photos with 7 clusters with azimuths: i= 0,−30,30,−60,60,−90,90. Figure 2 shows the averages of each cluster after rigid alignment using fiducial points (1st row) and after subsequent alignment using the Collection Flow method (2nd row), which calculates optical flow for each cluster photo to the cluster average.

Head Mesh Initialization

The goal is to reconstruct the head mesh M. It starts with estimating a depth map and surface normals of the frontal cluster V0, and assign each reconstructed pixel to a vertex of the mesh. The algorithm is as follows:

  • Dense 2D alignment: Photos are first rigidly aligned using 2D fiducial points as the pipeline. The head region including neck and shoulder in each image is segmented using semantic segmentation. Then Collection Flow is run on all the photos in V0 to align them with the average photo of that set densely. The challenging photos do not affect the method; given that the majority of the images are segmented well, Collection Flow will correct for inconsistencies. Also, Collection Flow helps to overcome differences in hairstyle by warping all the photos to the dominant style.
  • Surface normals estimation: A template face mask is used to find the face region on all the photos. Photometric Stereo (PS) is then applied to the face region of the flow-aligned photos. The face region of the images are arranged in a n×pk matrix Q, where n is the number of pictures and pk is the number of face pixels determined by the template facial mask. Rank-4 PCA is computed to factorize into lighting and normals: Q=LN. After getting the lighting estimation L for each photo, calculate N for all p head pixels including ear, chin and hair regions. Two key components that made PS work on uncalibrated head photos are:
    1. Resolving the Generalized Bas-Relief (GBR) ambiguity using a template 3D face of a different individual.
    2. Using a per-pixel surface normal estimation, where each point uses a different subset of photos to estimate the normal.
  • Depth map estimation: The surface normals are integrated to create a depth map Do by solving a linear system of equations that satisfy gradient constrains dz/dx=−nx/ny and dz/dy=−nx/ny where (nx,ny,nz) are components of the surface normal of each point. Combining these constraints, for the z-value on the depth map:
    head reconstructionThis generates a sparse matrix of 2p×2p matrix M, and solve for:minimum

Boundary-Value Growing

To complete the side view of Mesh, boundary value growing is introduced. Starting from the frontal view mesh V0, we gradually complete more regions of the head in the order of V30, V60, V90 and V−30, V−60, V−90  with two additional key constraints.

  • Ambiguity recovery: Rather than recovering the ambiguity A that arises from Q=LA^(−1)AN using the template model, already computed neighboring cluster is used, i.e., for V±30, N(zero) is used, for V±60, N±30 is used, and for V±90, N±60 is used. Specifically, it estimates the out-of-plane pose from the 3D initial mesh V0 to the average image of pose cluster V30.
  • Depth constraint: In addition to the gradient constraints, boundary constraints are also modified. Let Ω0 be the boundary of D′0. Then the part of Ω0 that intersects the mask of D30 will have the same depth values: D30(Ω0) =D′0(Ω0).  With both boundary constraints and gradient constraints, the optimization function can be written as:

After each depth stage reconstruction (0,30,60,.. degrees), the estimated depth is projected to the head mesh. By this process, the head is gradually filled in by gathering vertices from all the views.

Result

Below fig shows the reconstruction per view that was later combined to a single mesh. For example, the ear in 90 and -90 views is reconstructed well, while the other views are not able to reconstruct the ear.

Individual reconstructions per view cluster, with depth and ambiguity constraints

In Figure 5, it is shown how two key constraints work well in the degree 90 view reconstruction result. Without the correct reference normals and depth constraint, the reconstructed shape is flat, and the profile facial region is blurred, which increased the difficulty of aligning it back to the frontal view.

Figure 5. Comparison between without and with two key constraints
Figure 5. Comparison between without and with two key constraints

The left two shapes show the two views of 90-degree view shape reconstructed independently without two key constraints. The right two shapes show the two views of the result with two key constraints. Figure 6 shows the reconstruction result for 4 subjects; each mesh is rotated to five different perspectives.


Fig:6 Final reconstructed mesh rotated to 5 views to show the reconstruction from all sides. Each color image is an example image among around 1,000 photo collection for each person.

Comparison with other models

In Figure 6 a comparison is shown to the software FaceGen that implements a morphable model approach.


Figure 6

For a quantitative comparison, for each person, the reprojection error is calculated of the shapes from three methods (suggested approach, Space Carving and FaceGen) to 600 photos in different poses and lighting variations. The 3D shape comes from each reconstruction method.

Comparison with space carving Method
Comparison with space carving Method

The average reprojection error is shown in below Table.

head reconstruction machine learning
Reprojection error from 3 reconstruction methods

The error map of an example image is shown in Figure 7. Notice that the shapes from FaceGen and Space Carving might look good from the frontal view, but they are not correct when rotating to the target view. See how different the ear part is in the figure.

head reconstruction neural networks
Figure 7: Visualisation of the re-projection error for 3 methods

Conclusion

This approach shows that it is possible to reconstruct head from internet photos. However, this approach has the number of limitations. First, it assumes a Lambertian model for surface reflectance. While this works well, accounting for specularities should improve results. Second, fiducials for side views were labeled manually. Third, the complete model is not constructed; the top of the head is missing. To solve this more photos need to be added with different elevation angles, rather than just focusing on the azimuth change.

True Face Super-Resolution Upscaling with the Facial Component Heatmaps

1 October 2018
face resolution upscaling

True Face Super-Resolution Upscaling with the Facial Component Heatmaps

The performance of the most facial analysis techniques relies on the resolution of the corresponding image. Face alignment or face identification is not going to work correctly when the resolution…

The performance of the most facial analysis techniques relies on the resolution of the corresponding image. Face alignment or face identification is not going to work correctly when the resolution of a face is adversely low.

What’s Face Super-Resolution?

Face super-resolution (FSR) or face hallucination, provides a viable way to recover a high-resolution (HR) face image from its low-resolution (LR) counterpart. This research area has attracted increasing interest in the recent years, and the most advanced deep learning methods achieve state-of-the-art performance in face super-resolution.

However, even these methods often produce the results with the distorted face structure and only partially recovered facial details. Deep learning based FSR methods fail to super-resolve LR faces under large pose variations.

How can we solve this problem?

  • Augmenting training data with large pose variations still leads to suboptimal results where facial details are missing or distorted.
  • Directly detecting facial components or landmarks in LR faces is also suboptimal and may lead to ghosting artifacts in the final result.

But what about a method that super-resolves LR faces images while collaboratively predicting face structure? Can we use heatmaps to represent the probability of the appearance of each facial component?

We are going to discover this very soon, but let’s first check the previous approaches to the problem of face super-resolution.

Related Work

Face hallucination methods can be roughly grouped into three categories:

  • ‘Global model’ based approaches aim at super-resolving an LR input image by learning a holistic appearance mapping such as PCA. For instance, Wang and Tang reconstruct an HR output from the PCA coefficients of the LR input; Liu et al. develop a Markov random field (MRF) to reduce ghosting artifacts caused by the misalignments in LR images; Kolouri and Rohde employ optimal transport techniques to morph an HR output by interpolating exemplary HR faces.
  • Part based methods are proposed to super-resolve individual facial regions separately. For instance, Tappen and Liu super-resolve HR facial components by warping the reference HR images; Yang et al. localize facial components in the LR images by a facial landmark detector and then reconstruct missing high-frequency details from similar HR reference components.
  • Deep learning techniques can be very different: Xu et al. employ the framework of generative adversarial networks to recover blurry LR face images; Zhu et al. present a cascade bi-network, dubbed CBN, to localize LR facial components first and then upsample the facial components.

State-of-the-art-idea

Xin Yu and his colleagues propose a multi-task deep neural network that not only super-resolves LR images but also estimates the spatial positions of their facial components. Their convolutional neural network (CNN) has two branches: one for super-resolving face images and the other – for predicting salient regions of a face coined facial component heatmaps.

The whole process looks like this:

  1. Super-resolving features of input LR images.
  2. Employing a spatial transformer network to align the feature maps.
  3. Estimating the heatmaps of facial components with the upsampled feature maps.
  4. Concatenating estimated heatmaps of facial components with the upsampled feature maps.

This method can super-resolve tiny unaligned face images (16 x 16 pixels) with the upscaling factor of 8x while preserving face structure.

(a) LR image; (b) HR image; (c) Nearest Neighbors; (d) CBN, (e) TDAE, (f) TDAE trained on better dataset, (g) suggested approach
(a) LR image; (b) HR image; (c) Nearest Neighbors; (d) CBN, (e) TDAE, (f) TDAE trained on better dataset, (g) suggested approach

Now let’s learn the details of the proposed method.

Model overview

The network has the following structure:

  1. A multi-task upsampling network (MTUN):
    1. an upsampling branch (composed of a convolutional autoencoder, deconvolutional layers, and a spatial transformer network);
    2. a facial component heatmap estimation branch (HEB).
  2. Discriminative network, which is constructed by convolutional layers and fully connected layers.
The pipeline of the suggested network Face super resolution
The pipeline of the suggested network

Facial Component Heatmap Estimation. Even the state-of-the-art facial landmark detectors cannot accurately localize facial landmarks in very low-resolution images. So, the researchers propose to predict facial component heatmaps from super-resolved feature maps.

2D photos may exhibit a wide range of poses. Thus, to reduce the number of training images required for learning HEB, they suggest employing a spatial transformer network (STN) to align the upsampled features before estimating heatmaps.

In total, four heatmaps are estimated to represent four components of a face: eyes, nose, mouth, and chain (see the image below).

Visualization of estimated facial component heatmaps: (a) Unaligned LR image; (b) HR image; (c) Heatmaps; (d) Result; (e) The estimated heatmaps overlying the results
Visualization of estimated facial component heatmaps: (a) Unaligned LR image; (b) HR image; (c) Heatmaps; (d) Result; (e) The estimated heatmaps overlying the results

Loss Function. The results of using different combinations of losses are provided below:

Comparison of different losses
Comparison of different losses

On the above image:

  1. unaligned LR image,
  2. original HR image,
  3. pixel-wise loss only,
  4. pixel-wise and feature-wise losses combined,
  5. pixel-wise, feature-wise, and discriminative losses,
  6. pixel-wise and face structure losses,
  7. pixel-wise, feature-wise, and face structure losses,
  8. pixel-wise, feature-wise, discriminative, and face structure losses.

In training their multi-task upsampling network, the researchers have selected to use the last option (h).

Qualitative and Quantitative Comparisons

See the qualitative comparison of the suggested approach with the state-of-the-art methods:

Comparisons with the state-of-the-art methods: (a) Unaligned LR image; (b) HR image; (c) Bicubic interpolation; (d) VDSR; (e) SRGAN; (f) Ma et al.’s method; (g) CBN; (h) TDAE; (i) Suggested approach
Comparisons with the state-of-the-art methods: (a) Unaligned LR image; (b) HR image; (c) Bicubic interpolation; (d) VDSR; (e) SRGAN; (f) Ma et al.’s method; (g) CBN; (h) TDAE; (i) Suggested approach

As you can see, most of the existing methods fail to generate realistic face details, while the suggested approach outputs realistic and detailed images, which are very close to the original HR image.

Quantitative comparison with the state-of-the-art methods leads us to the same conclusions. All methods were evaluated on the entire test dataset by the average PSNR and the structural similarity (SSIM) scores.

Quantitative comparisons on the entire test dataset
Quantitative comparisons on the entire test dataset

The results in the table show that the approach presented here outperforms the second best with a large margin of 1.75 dB in PSNR. This confirms that estimating heatmaps helps in localizing facial components and aligning LR faces more accurately.

Bottom Line

Let’s summarize the contributions of this work:

  • It presents a novel multi-task upsampling network that can super-resolve very small LR face images (16 x 16 pixels) by an upscaling factor of 8x.
  • The method not only exploits image intensity similarity but also estimates the face structure with the help of facial component heatmaps.
  • The estimated facial component heatmaps provide not only spatial information of facial components but also their visibility information.
  • Thanks to the aligning of feature maps before heatmap estimation, the number of images required for training the model is largely reduced.

The method is good at super-resolving very low-resolution faces in different poses and generates realistic and detailed images free from distortions and artifacts.

Deforming Autoencoders (DAEs) – Learning Disentangled Representations

21 September 2018
DAE deforming autoencoders

Deforming Autoencoders (DAEs) – Learning Disentangled Representations

Generative Models are drawing a lot of attention within the Machine Learning research community. This kind of models has practical applications in different domains. Two of the most commonly used…

Generative Models are drawing a lot of attention within the Machine Learning research community. This kind of models has practical applications in different domains. Two of the most commonly used and efficient approaches recently are Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN).

While vanilla autoencoders can learn to generate compact representations and reconstruct their inputs well, they are pretty limited when it comes to practical applications. The fundamental problem of the standard autoencoders is that the latent space (in which they encode the input data distribution) may not be continuous therefore may not allow smooth interpolation. A different type of autoencoders called Variational Autoencoders (VAEs) can solve this problem, and their latent spaces are, by design, continuous, allowing easy random sampling and interpolation. This allowed VAEs to become very popular and used for many different tasks, especially in Computer Vision.

However, controlling and understanding deep neural networks, especially deep autoencoders is a difficult task and being able to control what the networks are learning is of crucial importance.

Previous works

The problem of feature disentanglement has been explored in the literature, for image and video processing and text analysis. Disentangling factors of variation are necessary for the goal of controlling and understanding deep networks and many attempts have been made to solve this problem.

Past work has explored the separation of the latent image representations into dimensions that account for different factors of variation. For example identity, illumination and spatial support, then low-dimensional transformations, such as rotations, translation, or scaling or more descriptive levels of variation such as age, gender, wearing glasses.

State-of-the-art idea

Recently, Zhixin Shu et al. introduced Deforming Autoencoders or shortly DAEs – a generative model for images that disentangles shape from the appearance in an unsupervised manner. In their paper, researchers propose a way to disentangle shape and appearance by assuming that object instances are obtained by deforming a prototypical object or ‘template’. This means that the object’s variability can be separated into variations associated with spatial transformations linked to the object’s shape, and variations that are associated with appearance. As simple as the idea sounds, this kind of disentanglement using deep autoencoders and unsupervised learning proved to be quite powerful.

Method

The proposed method can disentangle shape and appearance as factors of variation in a learned lower-dimensional latent space. The technique employs a deep learning architecture comprising of an encoder network that encodes the input image into two latent vectors (one for each shape and appearance) and two decoder networks taking the latent vectors as input and outputting generated texture and deformation, respectively.

The proposed Deforming Autoencoder architecture comprising of one encoder and two decoder networks
The proposed Deforming Autoencoder architecture comprising of one encoder and two decoder networks

Independent decoder networks learn the appearance and deformation functions. The generated spatial deformation is used to warp the texture to the observed image coordinates. In this way, the Deforming Autoencoder can reconstruct the input image and at the same time disentangle the shape and appearance as different features. The whole architecture is trained in an unsupervised manner using only simple image reconstruction loss.

In addition to Deforming Autoencoders (DAEs), the researchers propose Class-aware Deforming Autoencoders, which learn to reconstruct an image while at the same time disentangle the shape and appearance factors of variation conditioned by the class. To make this possible, they introduce a classifier network that takes a latent vector (third latent vector used to encode the class, besides the latent vectors for shape and appearance). This kind of architecture allows learning a mixture model conditioned on the class of the input image (rather than a joint multi-modal distribution).

The proposed class-aware Deforming Autoencoder
The proposed class-aware Deforming Autoencoder

They show that introducing class-aware learning drastically improves the performance and stability of the training. Intuitively, this can be explained as the network learning to separate the spatial deformation that is different among different classes.

Also, the researchers propose a Deforming Autoencoder to learn to disentangle albedo and shading (a widespread problem in computer vision) from facial images. They call this architecture Intrinsic Deforming Autoencoder, and it is shown in the picture below.

Intrinsic Deforming Autoencoder (Intrinsic DAE)
Intrinsic Deforming Autoencoder (Intrinsic DAE)

Results

It is shown that this method can successfully disentangle shape and appearance while learning to reconstruct an input image in an unsupervised manner. They show that class-aware Deforming Autoencoders provide better results in both reconstruction and appearance learning.

Results of the image reconstruction of MNIST images using Deforming Autoencoder
Results of the image reconstruction of MNIST images using Deforming Autoencoder

Besides the qualitative evaluation, the proposed Deforming Autoencoder architecture is evaluated quantitatively concerning landmark localization accuracy. The method was evaluated on

  1. unsupervised image alignment/appearance inference;
  2. learning semantically meaningful manifolds for shape and appearance;
  3. unsupervised intrinsic decomposition
  4. unsupervised landmark detection.
Results of the image reconstruction of MNIST images using Class-aware Deforming Autoencoder
Results of the image reconstruction of MNIST images using Class-aware Deforming Autoencoder
deforming autoencoders result
Unsupervised alignment on images of palms of left hands. (a) The input images; (b) reconstructed images; (c) texture images warped with the average of the decoded deformation; (d) the average input image; and (e) the average texture

defermation interpolation

Smooth interpolation of the latent space representation
Smooth interpolation of the latent space representation

Watch the video:

Comparison with other state-of-the-art

The proposed method was evaluated on the MAFL test – mean error on unsupervised landmark detection. It outperforms the self-supervised approach proposed by Thewlis et al.

Mean error on unsupervised landmark detection on the MAFL test set.

Conclusion

Lighting interpolation with Intrinsic-DAE
Lighting interpolation with Intrinsic-DAE

As I mentioned previously, being able to disentangle factors of variation can be of crucial importance for many tasks. Disentanglement allows complete control and understanding of deep neural network models, and it may be a key to solving problems. This approach introduced Deforming Autoencoders as a specific architecture able to disentangle particular factors of variation (in this case shape and appearance). The results show that this method can successfully disentangle the factors of variability by employing an autoencoder architecture.

Facial Surface and Texture Synthesis via GAN

3 September 2018
face texture synthesis

Facial Surface and Texture Synthesis via GAN

Deep networks can be extremely powerful and effective in answering complex questions. But it is also well-known that in order to train a really complex model, you’ll need lots and…

Deep networks can be extremely powerful and effective in answering complex questions. But it is also well-known that in order to train a really complex model, you’ll need lots and lots of data, which closely approximates the complete data distribution.

With the lack of real-world data, many researchers choose data augmentation as a method for extending the size of a given dataset. The idea is to modify the training examples in such a way that keeps their semantic properties intact. That’s not an easy task when dealing with human faces.

The method should account for such complex transformations of data as pose, lighting and non-rigid deformations, yet create realistic samples that follow the real-world data statistics.

So, let’s see how the latest state-of-the-art methods approach this challenging task…

Previous approaches

Generative adversarial networks (GANs) have demonstrated their effectiveness in making synthetic data more realistic. Taking the simulated data as input, GAN produces samples that appear more realistic. However, the semantic properties of these samples might be altered, even with a loss penalizing the change in the parameters of the output.

3D morphable model (3DMM) is the most commonly used method for representation and synthesis of geometries and textures, and it was originally proposed in the context of 3D human faces. By this model, the geometric structure and the texture of human faces are linearly approximated as a combination of principal vectors.

Recently, the 3DMM model was combined with the convolutional neural networks for data augmentation. However, the generated samples tend to be smooth and unrealistic in appearance as you can observe in the figure below.

Faces synthesized using the 3DMM linear model
Faces synthesized using the 3DMM linear model

Moreover, 3DMM generates samples following a Gaussian distribution, which rarely reflects the true distribution of the data. For instance, see below the first two PCA coefficients plotted for real faces vs the synthesized 3DMM faces. This gap between the real and synthesized distributions may easily result in non-plausible samples.

First two PCA coefficients of real (left) and 3DMM generated (right) faces
First two PCA coefficients of real (left) and 3DMM generated (right) faces

State-of-the-art idea

Slossberg, Shamai, and Kimmel from Technion – Israel Institute of Technology propose a new realistic data synthesis approach for human faces by combining GAN and 3DMM model.

In particular, the researchers employ a GAN to imitate the space of parametrized human textures and generate corresponding facial geometries by learning the best 3DMM coefficients for each texture. The generated textures are mapped back onto the corresponding geometries to obtain new generated high-resolution 3D faces.

This approach produces realistic samples, and it:

  • doesn’t suffer from indirect control over such desired attributes as pose and lighting;
  • is not limited to producing new instances of existing individuals.

Let’s have a closer look at their data processing pipeline…

Data processing pipeline

The process includes aligning 3D scans of human faces vertex to vertex and mapping their textures onto a 2D plane using a predefined universal transformation.

Data preparation pipeline
Data preparation pipeline

The data preparation pipeline contains four main stages:

  • Data acquisition: the researchers collected about 5000 scans from a wide variety of ethnic, gender, and age groups; each subject was asked to perform five distinct expressions including a neutral one.
  • Landmark annotation: 43 landmarks were added to the meshes semi-automatically by rendering the face and using a pre-trained facial landmark detector on the 2D images.
  • Mesh alignment: this was conducted by deforming a template face mesh according to the geometric structure of each scan, guided by the previously obtained facial landmark points.
  • Texture transfer: the texture is transferred from the scan to the template using a ray casting technique built into the animation rendering toolbox of Blender; then, the texture is mapped from the template to a 2D plane using the predefined universal mapping.

See the resulting mapped textures below:

Flattened aligned facial textures
Flattened aligned facial textures

The next step is to train GAN to learn and imitate these aligned facial textures. For this purpose, the researchers use a progressive growing GAN with the generator and discriminator constructed as symmetric networks. In this implementation, the generator progressively increases the resolution of the feature maps until reaching the output image size, while the discriminator gradually reduces the size back to a single output.

See below the new synthetic facial textures generated by the aforementioned GAN:

Facial textures synthesized by GAN
Facial textures synthesized by GAN

The final step is to synthesize the geometries of the faces. The researchers explored several approaches to finding plausible geometry coefficients for a given texture. You can observe the qualitative and quantitative (L2 geometric error) comparison between the various methods in the next figure:

Two synthesized textures mapped onto different geometries
Two synthesized textures mapped onto different geometries

Apparently, the least squares approach produces the lowest distortion results. Considering also its simplicity, this method was chosen for all the subsequent experiments.

Experimental results

The proposed method can generate many new identities, and each one of them can be rendered under varying pose, expression, and lighting. Different expressions are added to the neutral geometry using the Blend Shapes model. The resulting images with different pose and lighting are shown below:

Identities generated by the proposed method with different pose and lighting
Identities generated by the proposed method with different pose and lighting

For quantitative evaluation of the results, the researchers used the sliced Wasserstein distance (SWD) to measure distances between distributions of their training and generated images in different scales:

The table demonstrates that the textures generated by the proposed model are statistically closer to the real data than those generated by 3DMM.

The next experiment was designed to evaluate if the proposed model is capable of generating samples that diverge significantly from the original training set and resemble previously unseen data. Thus, 5% of the identities were held out for evaluation. The researchers measured the L2 distance between each real identity from the test set to the closest identity generated by the GAN, as well as to the closest real identity from the training set.

The distance between the generated and real identities
The distance between the generated and real identities

As it can be seen from the figure, the test set identities are closer to the generated identities than the training set identities. Moreover, the “Test to fake” distances are not significantly larger than the “Fake to real” distances. That implies that generated samples do not just produce IDs that are very close to the training set, but also novel IDs that resemble previously unseen examples.

Finally, a qualitative evaluation was performed to check if the proposed pipeline is able to generate original data samples. Thus, facial textures generated by the model were compared to their closest real neighbors in terms of L2 norm between identity descriptors.

Synthesized facial textures (top) vs. corresponding closest real neighbors (bottom)
Synthesized facial textures (top) vs. corresponding closest real neighbors (bottom)

As you can see, the nearest real textures are far enough to be visually distinguished as different people, which confirms the model’s ability to produce novel identities.

Bottom Line

The suggested model is probably the first to realistically synthesize both texture and geometry of human faces. It can be useful for training face detection, face recognition or face reconstruction models. In addition, it can be applied in cases where many different realistic faces are needed like for instance, film industry or computer games. Moreover, this framework is not limited to synthesizing human faces but can be actually employed to other classes of objects where alignment of the data is possible.

DeepWrinkles: Accurate and Realistic Clothing Modeling

28 August 2018

DeepWrinkles: Accurate and Realistic Clothing Modeling

Realistic garment reconstruction is notoriously a complex problem and its importance is undeniable in many research work and applications, such as accurate body shape and pose estimation in the wild…

Realistic garment reconstruction is notoriously a complex problem and its importance is undeniable in many research work and applications, such as accurate body shape and pose estimation in the wild (i.e., from observations of clothed humans), realistic AR/VR experience, movies, video games, virtual try-on, etc. For the past decades, physics-based simulations have been setting the standard in movie and video game industries, even though they require hours of labor by experts.

Facebook Research present a novel approach called Deep wrinkles to generate accurate and realistic clothing deformation from real data capture. It consists of two complementary modules:

  • A statistical model is learned from 3D scans of clothed people in motion, from which clothing templates are precisely non-rigidly aligned.
  • Fine geometric details are added to normal maps generated using a conditional adversarial network whose architecture is designed to enforce realism and temporal consistency.

The goal is to recover all observable geometric details. Assuming the finest details are captured at sensor image pixel resolution and are reconstructed in 3D, all existing geometric details can then be encoded in a normal map of the 3D scan surface at the lower resolution as shown in figure below.

clothes modelling

Cloth deformation is model by learning a linear subspace model that factors out body pose and shape. However, our model is learned from real data.


By the way, Neurohive is creating new app for business photos based on neural network. We are going to release it in September.


The strategy ensures deformations are represented compactly and with high realism. First, we compute robust template-based non-rigid registrations from a 4D scan sequence, then a clothing deformation statistical model is derived and finally, a regression model is learned to pose retargeting.

Data Preparation

Data capture: For each type of clothing, 4D scan sequences are captured at 60 fps (e.g., 10.8k frames for 3 min) of a subject in motion, and dressed in a full-body suit with one piece of clothing with colored boundaries on top. Each frame consists of a 3D surface mesh with around 200k vertices yielding very detailed folds on the surface but partially corrupted by holes and noise. In addition, capturing only one garment prevents occlusions where clothing normally overlaps (e.g., waistbands) and items of clothing can be freely combined with each other.

Registration. The template of clothing  T  is defined by choosing a subset of the human template with consistent topology. T should contain enough vertices to model deformations (e.g., 5k vertices for a T-shirt). The clothing template is then registered to the 4D scan sequence using a variant of
non-rigid ICP based on grid deformation.

Statistical model

The statistical model is computed using linear subspace decomposition by PCA. Poses of all n registered meshes are factored out from the model by pose-normalization using inverse skinning. Each registration Ri can be represented by a mean shape and vertex offsets oi, such that Ri = M+ oi, where the mean shape M belongs to R3*v is obtained by averaging vertex positions. Finally, each Ri can be compactly represented by a linear blend shape function
B,

 the blend shape can simply be replaced

Pose-to-shape prediction

Predictive model f is learned that that takes as inputs joint poses and outputs a set of k shape parameters (A). This allows powerful applications where deformations are induced by the pose. To take into account deformation dynamics that occur during human motion, the model is also trained with pose velocity, acceleration, and shape parameter history.

DeepWrinkles Accurate and Realistic Clothing Modeling
Outline of DeepWrinkles

Architecture

neural network design clothing modelling

The goal is to recover all observable geometric details. Assuming the nest details are captured at sensor image pixel resolution and are reconstructed in 3D all existing geometric details can then be encoded in a normal map of the 3D scan surface at a lower resolution. To automatically add fine details on the fly to reconstructed clothing, the generative adversarial network is proposed to leverage normal maps.

The proposed network is based on a conditional Generative Adversarial Network (cGAN) inspired by image transfer. A convolution batchnorm-ReLu structure and a U-Net is used in the generative network since it transferred all the information across the network layers and the overall structure of the image to be preserved. Temporal consistency is achieved by extending the L1 network loss term. For compelling animations, it is not only important that each frame looks realistic, but also no sudden jumps in the rendering should occur. To ensure a smooth transition between consecutively generated images across time, we introduce an additional loss L(loss) to the GAN objective that penalizes discrepancies between generated images at t and expected images (from training dataset) at  t – 1:

loss function

where L(data) helps to generate images near to ground truth in an L1 sense (for less blurring). The temporal consistency term L(temp) is meant to capture global fold movements over the surface.

The cGAN network is trained on a dataset of 9213 consecutive frames. The first 8000 images compose the training data set, the next 1000 images the test data set and the remaining 213 images the validation set. Test and validation sets contain poses and movements not seen in the training set. The U-Net auto-encoder is constructed with 2 x 8 layers, and 64 filters in each of the first convolutional layers. The discriminator uses patches of size 70 x 70. L(data) weight is set to 100, L(temp) weight is 50, while GAN weight is 1. The images have a resolution of 256 x 256, although our early experiments also showed promising results on 512 x 512.

Result

DeepWrinkles is an entirely data-driven framework to capture and reconstruct clothing in motion out from 4D scan sequences. The evaluations show that high-frequency details can be added to low-resolution normal maps using a conditional adversarial neural network. The temporal loss is also introduced to the GAN objective that preserves geometric consistency across time, and show qualitative and quantitative evaluations on different datasets.

Results
a) Physics-based simulation, b) Subspace (50 coefficients) c) Registration d) DeepWrinkles e) 3D scan (ground truth)