3D Hair Reconstruction Out of In-the-Wild Videos

22 October 2018
hair reconstruction from video

3D Hair Reconstruction Out of In-the-Wild Videos

3D hair reconstruction is a problem with numerous applications in different areas such as Virtual Reality, Augmented Reality, video games, medical software, etc. As a non-trivial problem, researchers have proposed…

3D hair reconstruction is a problem with numerous applications in different areas such as Virtual Reality, Augmented Reality, video games, medical software, etc. As a non-trivial problem, researchers have proposed various solutions in the past, some of them more some of them less successful. Generating realistic 3D hair model represents a challenge even when done in controlled, relatively sterile environments. Therefore, the generation of 3D hair model in-the-wild out of ordinary photos or videos is a challenging task.

Previous works

Recently, we wrote about an approach for realistic 3D hair reconstruction out of a single image. This kind of methods work well, but fail to produce high-fidelity results of 3D hair reconstruction models due to the limitations and ambiguity of the problem. Other approaches use multiple images or views and yield improved results while increasing the complexity of the solution. These approaches require controlled environments with 360 views of the person and multiple images.

Additionally, some approaches require input such as hair segmentation, making the whole process of 3D hair reconstruction more cumbersome.

State-of-the-art idea

A new approach proposed by researchers from the University of Washington can take in an in-the-wild video and automatically output a full head model with a 3D hair-strand model. The input to the proposed method is a video whose frames are used by a few components to produce hair strands – estimated and deformed in 3D (rather than 2D as in state of the art) thus enabling superior results.

Method

The method is composed of 4 components which are shown in the illustration below:

A: Module which uses structure from motion to get relative camera poses, depth maps and a visual hull shape with view-confidence values.

B: Module in which hair segmentation and gradient direction networks are trained to apply on each frame and obtain 2D strands.

C: The segmentations from module B are used to recover the texture of the face area, and a 3D face morphable model is used to estimate face and bald head shapes.

D: The last module and the core of the algorithm where the depth maps and 2D strands are used to create 3D strands. These 3D strands are used to query a hair database and the strands from the best match are refined both globally and locally to fit the input frames from the video.

In this way, a robust and flexible method is obtained which can successfully recover 3D hair strands from in-the-wild video frames.

The proposed method’s architecture showing the four components

Module A: The first module is used to obtain a rough head shape. Each frame of the videos is pre-processed using semantic segmentation to separate the background from a person. The goal here is to estimate camera pose per frame and to create a rough initial structure from all the frames.

First, after pre-processing and removing the background, the head moving within all frames is extracted by using structure from motion approach – estimation of camera poses per frame and per-frame depth for all frames in the video. The output of this module is a rough visual hull shape.

Module B: The second module contains the trained hair segmentation and hair directional classifiers to label and predict the hair direction in hair pixels of each video frame inspired by strand direction estimation method of Chai et al. 2016.

Hair segmentation, directional labels and 2D hair strands of example video frames

Module C: In this module, the segmented frames are used to select the frame that is closest to frontal face (where yaw and pitch are approximately 0), and fed to a morphable-model-based face model estimator.

Module D: The last, in fact, the core module is estimating 3D hair strands using the outputs of modules A, B and C. Initially, since in this module each frame has an estimation of 2D strands, 3D hair strands are obtained by projecting them to depths to obtain the initial estimate of 3D strands. Then, those 3D hair strands are used to query a database of 3D hair models since the initial strands are incomplete. In their work, the researchers use the hair dataset created by Chai et al. 2016, which contains 35, 000 different hairstyles, each hairstyle model consisting of more than 10, 000 hair strands. A global and also a local deformation is applied in the end to refine the obtained 3D hair strands.

The local and global transformation applied to the 3D hair strands

Results

To evaluate the proposed approach, the researchers use both quantitative and qualitative evaluation as well as human study comparison. A quantitative comparison is made by projecting the reconstructed hair as lines onto the images, computing the intersection-over-union rate to the ground truth hair mask per frame. The results are shown in the table below. A larger IOU means that the reconstructed hair approximates the input better.

This figure shows the results compared to the state-of-the-art methods

The approach was evaluated qualitatively against some state-of-the-art methods. Moreover, human preference tests using Mechanical Turk have been done, and the results are shown in the tables.

This figure shows four example frames comparing the silhouettes of the reconstructed hairstyles to the hair segmentation results.
compared to Hu et al. 2017 based on Amazon Mechanical Turk tests.
The ratio of preference of the methods’ results over total compared to Hu et al. 2017 based on Amazon Mechanical Turk tests.
The ratio of preference of methods' results
The ratio of preference of methods’ results over total compared to Zhang et al. 2017 based on Amazon Mechanical Turk tests.

Conclusion

In this paper, researchers from the University of Washington proposed a fully automatic way of 3D hair reconstruction out of in-the-wild videos, which can have a wide variety of potential applications. Although the method itself is quite complex and involves many steps, the results are more than satisfactory. The approach shows that higher fidelity in the results can be obtained by incorporating information from multiple frames of videos where slightly different views are present. The proposed system is exploiting this to reconstruct 3D hair model while not restricted to specific views and head poses.

Head Reconstruction from Internet Photos

15 October 2018
head reconstruction internet photos

Head Reconstruction from Internet Photos

Methods that reconstruct 3D models of people’s heads from images need to account for varying 3D pose, lighting, non-rigid changes due to expressions, relatively smooth surfaces of faces, ears, and…

Methods that reconstruct 3D models of people’s heads from images need to account for varying 3D pose, lighting, non-rigid changes due to expressions, relatively smooth surfaces of faces, ears, and neck, and finally, the hair. Great reconstructions can be achieved nowadays in case the input photos are captured in a calibrated lab setting or semi-calibrated setup where the person has to participate in the capturing session (see related work).

Reconstructing from Internet photos, however, is an open problem due to the high degree of variability across uncalibrated images. Lighting, pose, cameras and resolution change dramatically across photos. In recent years, reconstruction of faces from the Internet has received a lot of attention. All face-focused methods, however, mask out the head using a fixed face mask and focus only on the face area.

Previous Works

Calibrated head modeling has achieved amazing results over the last decade. Reconstruction of people from Internet photos recently achieved good results.

  • Shlizerman et al. showed that it is possible to reconstruct a face from a single Internet photo using a template model of a different person. One way to approach the uncalibrated head reconstruction problem is to use the morphable model approach.
  • Hsieh et al. showed that with morphable models the face is fitted to a linear space of 200 face scans, and the head is reconstructed from the linear space as well. In practice, morphable model methods work well for face tracking.
  • Adobe Research proved that hair modeling could be done from a single photo by fitting to a database of synthetic hairs or by fitting helices.

State-of-the-art idea

This idea addresses the new direction of head reconstruction directly from Internet data. Given a photo collection, obtained by searching for photos of a specific person on Google image search, the task is to reconstruct a 3D model of that person’s head(the focus is only on the face area). If the given photos are only one or two per view, the problem is very challenging due to lighting inconsistency across views, difficulty in segmenting the face profile from the background, and challenges in merging the images across views. The key idea is that with many more (hundreds) of photos per 3D view, the problems can be overcome. For celebrities, one can quickly acquire such collections from the Internet; for others, we can extract such photos from Facebook or mobile photos.

The method works as follows: a person’s photo collection is divided into clusters of approximately the same azimuth angle of the 3D pose. Given the clusters, a depth map of the frontal face is reconstructed, and the method gradually grows the reconstruction by estimating surface normals per view cluster and then constraining using boundary conditions coming from neighboring views. The final result is a head mesh of the person that combines all the views.

head reconstruction 3D
Figure 2

The given photos are divided into a view cluster as Vi. Photos in the same view cluster have approximately the same 3D pose and azimuth angle. The photos with 7 clusters with azimuths: i= 0,−30,30,−60,60,−90,90. Figure 2 shows the averages of each cluster after rigid alignment using fiducial points (1st row) and after subsequent alignment using the Collection Flow method (2nd row), which calculates optical flow for each cluster photo to the cluster average.

Head Mesh Initialization

The goal is to reconstruct the head mesh M. It starts with estimating a depth map and surface normals of the frontal cluster V0, and assign each reconstructed pixel to a vertex of the mesh. The algorithm is as follows:

  • Dense 2D alignment: Photos are first rigidly aligned using 2D fiducial points as the pipeline. The head region including neck and shoulder in each image is segmented using semantic segmentation. Then Collection Flow is run on all the photos in V0 to align them with the average photo of that set densely. The challenging photos do not affect the method; given that the majority of the images are segmented well, Collection Flow will correct for inconsistencies. Also, Collection Flow helps to overcome differences in hairstyle by warping all the photos to the dominant style.
  • Surface normals estimation: A template face mask is used to find the face region on all the photos. Photometric Stereo (PS) is then applied to the face region of the flow-aligned photos. The face region of the images are arranged in a n×pk matrix Q, where n is the number of pictures and pk is the number of face pixels determined by the template facial mask. Rank-4 PCA is computed to factorize into lighting and normals: Q=LN. After getting the lighting estimation L for each photo, calculate N for all p head pixels including ear, chin and hair regions. Two key components that made PS work on uncalibrated head photos are:
    1. Resolving the Generalized Bas-Relief (GBR) ambiguity using a template 3D face of a different individual.
    2. Using a per-pixel surface normal estimation, where each point uses a different subset of photos to estimate the normal.
  • Depth map estimation: The surface normals are integrated to create a depth map Do by solving a linear system of equations that satisfy gradient constrains dz/dx=−nx/ny and dz/dy=−nx/ny where (nx,ny,nz) are components of the surface normal of each point. Combining these constraints, for the z-value on the depth map:
    head reconstructionThis generates a sparse matrix of 2p×2p matrix M, and solve for:minimum

Boundary-Value Growing

To complete the side view of Mesh, boundary value growing is introduced. Starting from the frontal view mesh V0, we gradually complete more regions of the head in the order of V30, V60, V90 and V−30, V−60, V−90  with two additional key constraints.

  • Ambiguity recovery: Rather than recovering the ambiguity A that arises from Q=LA^(−1)AN using the template model, already computed neighboring cluster is used, i.e., for V±30, N(zero) is used, for V±60, N±30 is used, and for V±90, N±60 is used. Specifically, it estimates the out-of-plane pose from the 3D initial mesh V0 to the average image of pose cluster V30.
  • Depth constraint: In addition to the gradient constraints, boundary constraints are also modified. Let Ω0 be the boundary of D′0. Then the part of Ω0 that intersects the mask of D30 will have the same depth values: D30(Ω0) =D′0(Ω0).  With both boundary constraints and gradient constraints, the optimization function can be written as:

After each depth stage reconstruction (0,30,60,.. degrees), the estimated depth is projected to the head mesh. By this process, the head is gradually filled in by gathering vertices from all the views.

Result

Below fig shows the reconstruction per view that was later combined to a single mesh. For example, the ear in 90 and -90 views is reconstructed well, while the other views are not able to reconstruct the ear.

Individual reconstructions per view cluster, with depth and ambiguity constraints

In Figure 5, it is shown how two key constraints work well in the degree 90 view reconstruction result. Without the correct reference normals and depth constraint, the reconstructed shape is flat, and the profile facial region is blurred, which increased the difficulty of aligning it back to the frontal view.

Figure 5. Comparison between without and with two key constraints
Figure 5. Comparison between without and with two key constraints

The left two shapes show the two views of 90-degree view shape reconstructed independently without two key constraints. The right two shapes show the two views of the result with two key constraints. Figure 6 shows the reconstruction result for 4 subjects; each mesh is rotated to five different perspectives.


Fig:6 Final reconstructed mesh rotated to 5 views to show the reconstruction from all sides. Each color image is an example image among around 1,000 photo collection for each person.

Comparison with other models

In Figure 6 a comparison is shown to the software FaceGen that implements a morphable model approach.


Figure 6

For a quantitative comparison, for each person, the reprojection error is calculated of the shapes from three methods (suggested approach, Space Carving and FaceGen) to 600 photos in different poses and lighting variations. The 3D shape comes from each reconstruction method.

Comparison with space carving Method
Comparison with space carving Method

The average reprojection error is shown in below Table.

head reconstruction machine learning
Reprojection error from 3 reconstruction methods

The error map of an example image is shown in Figure 7. Notice that the shapes from FaceGen and Space Carving might look good from the frontal view, but they are not correct when rotating to the target view. See how different the ear part is in the figure.

head reconstruction neural networks
Figure 7: Visualisation of the re-projection error for 3 methods

Conclusion

This approach shows that it is possible to reconstruct head from internet photos. However, this approach has the number of limitations. First, it assumes a Lambertian model for surface reflectance. While this works well, accounting for specularities should improve results. Second, fiducials for side views were labeled manually. Third, the complete model is not constructed; the top of the head is missing. To solve this more photos need to be added with different elevation angles, rather than just focusing on the azimuth change.

Learning 3D Face Morphable Model Out of 2D Images

5 September 2018
3D morphable model out of single image

Learning 3D Face Morphable Model Out of 2D Images

The 3D Morphable Model (3DMM) is a statistical model of 3D facial shape and texture. 3D Morphable Models have various applications in many fields including computer vision, computer graphics, human…

The 3D Morphable Model (3DMM) is a statistical model of 3D facial shape and texture. 3D Morphable Models have various applications in many fields including computer vision, computer graphics, human behavioral analysis, craniofacial surgery.

In essence, 3D Morphable Models are used to model facial shapes and textures and modeling human faces is not a trivial task at all. Different identities, highly-variable face shapes, and postures make the modeling of the human face a challenging task. In this context, a 3D Morphable Model is trying to learn a model of facial shape and texture in a space where there are explicit correspondences. This means, there has to be a point-to-point correspondence between the reconstruction and all other models, enabling morphing, and second, it has to model the underlying transformations between types of faces (male to female, neutral to smile, etc.).

3D morphable model

Researchers from Michigan State University propose a novel Deep Learning-based approach to learning a 3D Morphable Model. Exploiting the power of Deep Neural Networks to learn non-linear mappings, they suggest a method for learning 3D Morphable Model out of just 2D images from in-the-wild (images not taken in a controlled environment like a lab).

Previous Approaches

A conventional 3DMM is learned from a set of 3D face scans with associated well-controlled 2D face images. Traditionally, 3DMM is learned through supervision by performing dimension reduction, typically Principal Component Analysis (PCA), on a training set of co-captured 3D face scans and 2D images. By employing a linear model such as PCA, non-linear transformations and facial variations cannot be captured by the 3D Morphable Model. Moreover, large amounts of high-quality 3D data are needed to model highly variable 3D face shapes.

State of the art idea

The idea of the proposed approach is to leverage the power of Deep Neural Networks or more specifically Convolutional Neural Networks (which are more suitable for the task and less expensive than multilayer perceptrons) to learn the 3D Morphable Model with an encoder network that takes a face image as input and generates the shape and albedo parameters, from which two decoders estimate shape and albedo.

Method

As mentioned before a linear 3DMM has the problems such as the need of 3D face scans for supervised learning, unable to leverage massive in-the-wild face images for learning, and the limited representation power due to the linear model (PCA). The proposed method learns a nonlinear 3 DMM model using only large-scale in-the-wild 2D face images.

UV Space Representation

In their method, the researchers use an unwrapped 2D texture (where 3 D vertex v is projected onto the UV space) as a texture representation for the shape and the albedo. They argue that keeping the spatial information is very important as they employ Convolutional Networks in their method and frontal face-images contain little information about the two sides. Therefore their choice falls on UV-space representation.

Three albedo representations. (a) Albedo value per vertex, (b) Albedo as a 2D frontal face, (c) UV space 2D unwarped albedo.
Three albedo representations. (a) Albedo value per vertex, (b) Albedo as a 2D frontal face, (c) UV space 2D unwarped albedo.
UV space shape representation. x, y, z, and a combined shape representation.
UV space shape representation. x, y, z, and a combined shape representation

Network architecture

They designed an architecture that given an input image it encodes it into shape, albedo and lightning parameters (vectors). The encoded latent vectors for albedo and shape are decoded using two different Decoder networks (again Convolutional Neural Networks) to obtain face skin reflectance, image (for the albedo) and 3D face mash (for the shape). Then a differentiable rendering layer was designed to generate the reconstructed face by fusing the 3D face, albedo, lighting, and the camera projection parameters estimated by the encoder. The whole architecture is nicely presented in the figure below.

The proposed method's architecture for learning a non-linear 3DMM
The proposed method’s architecture for learning a non-linear 3DMM

The presented robust, learning of a non-linear 3D Morphable Model is applied to 2D Face Alignment, and 3D Face Reconstruction problems. It can also have many applications since it represents a model learning method, which can solve different problems.

The proposed rendering layer
The proposed rendering layer

Comparison with other methods

The method was evaluated against other methods on the following tasks: 2D Face Alignment, 3D Face Reconstruction and Face Editing. The suggested technique outperforms other state-of-the-art methods on these tasks. Some of the results of the evaluation are presented below.

2D Face Alignment

One of the critical applications of this kind of approach can become face alignment. Alignment naturally should improve facial analysis in a range of tasks (for example face recognition). However, alignment is not a straightforward task, and this method proves successful in face alignment.

2D face alignment results. Invisible landmarks are marked as red. The technique can well handle extreme pose, lighting, and expression
2D face alignment results. Invisible landmarks are marked as red. The technique can well handle extreme pose, lighting, and expression

3D Face Reconstruction

The approach was also evaluated on another task: 3D Face Reconstruction, yielding outstanding results compared to other methods.

Quantitative evaluation of the 3D reconstruction
Quantitative evaluation of the 3D reconstruction
3D reconstruction results comparison to Sela et al. The proposed method handles facial hair and occlusions far better than this method
3D reconstruction results comparison to Sela et al. The proposed method handles facial hair and occlusions far better than this method
3D reconstruction results comparison to VRN by Jack- son et al. on the popular CelebA dataset
3D reconstruction results comparison to VRN by Jack- son et al. on the popular CelebA dataset
3D reconstruction results comparison to Tewari et al. This result shows that the proposed method overcomes the problem of face shrinking when dealing with a different texture (like facial hair)
3D reconstruction results comparison to Tewari et al. This result shows that the proposed method overcomes the problem of face shrinking when dealing with a different texture (like facial hair)

Face Editing

A method that learns a model and decomposes a face image into individual components allows the image to be modified and the face to be edited by manipulating different elements. The method was also evaluated on face editing tasks such as relighting and attribute manipulation.

Growing mustache editing results. The first column shows original images, the following columns show edited images with increasing magnitudes.
Growing mustache editing results. The first column shows original images, the following columns show edited images with increasing magnitudes.
Comparing to Shu et al. results (last row), the proposed method produces more realistic images, and the identity is better preserved.
Comparing to Shu et al. results (last row), the proposed method produces more realistic images, and the identity is better preserved.

Conclusions

In conclusion, the proposed method will have a potentially high impact since it improves the way of learning a 3D Morphable Model. This kind of model has been widely adopted in the past since its introduction, but there was not an efficient, robust way of learning this model from in-the-wild data.

The proposed approach exploits the power of deep neural networks as very good function approximator to model the highly variable human face robustly. The unusual path of learning a 3DMM allows different manipulations and many applications of this method, some of which are presented in the paper, and many others are expected.

Facial Surface and Texture Synthesis via GAN

3 September 2018
face texture synthesis

Facial Surface and Texture Synthesis via GAN

Deep networks can be extremely powerful and effective in answering complex questions. But it is also well-known that in order to train a really complex model, you’ll need lots and…

Deep networks can be extremely powerful and effective in answering complex questions. But it is also well-known that in order to train a really complex model, you’ll need lots and lots of data, which closely approximates the complete data distribution.

With the lack of real-world data, many researchers choose data augmentation as a method for extending the size of a given dataset. The idea is to modify the training examples in such a way that keeps their semantic properties intact. That’s not an easy task when dealing with human faces.

The method should account for such complex transformations of data as pose, lighting and non-rigid deformations, yet create realistic samples that follow the real-world data statistics.

So, let’s see how the latest state-of-the-art methods approach this challenging task…

Previous approaches

Generative adversarial networks (GANs) have demonstrated their effectiveness in making synthetic data more realistic. Taking the simulated data as input, GAN produces samples that appear more realistic. However, the semantic properties of these samples might be altered, even with a loss penalizing the change in the parameters of the output.

3D morphable model (3DMM) is the most commonly used method for representation and synthesis of geometries and textures, and it was originally proposed in the context of 3D human faces. By this model, the geometric structure and the texture of human faces are linearly approximated as a combination of principal vectors.

Recently, the 3DMM model was combined with the convolutional neural networks for data augmentation. However, the generated samples tend to be smooth and unrealistic in appearance as you can observe in the figure below.

Faces synthesized using the 3DMM linear model
Faces synthesized using the 3DMM linear model

Moreover, 3DMM generates samples following a Gaussian distribution, which rarely reflects the true distribution of the data. For instance, see below the first two PCA coefficients plotted for real faces vs the synthesized 3DMM faces. This gap between the real and synthesized distributions may easily result in non-plausible samples.

First two PCA coefficients of real (left) and 3DMM generated (right) faces
First two PCA coefficients of real (left) and 3DMM generated (right) faces

State-of-the-art idea

Slossberg, Shamai, and Kimmel from Technion – Israel Institute of Technology propose a new realistic data synthesis approach for human faces by combining GAN and 3DMM model.

In particular, the researchers employ a GAN to imitate the space of parametrized human textures and generate corresponding facial geometries by learning the best 3DMM coefficients for each texture. The generated textures are mapped back onto the corresponding geometries to obtain new generated high-resolution 3D faces.

This approach produces realistic samples, and it:

  • doesn’t suffer from indirect control over such desired attributes as pose and lighting;
  • is not limited to producing new instances of existing individuals.

Let’s have a closer look at their data processing pipeline…

Data processing pipeline

The process includes aligning 3D scans of human faces vertex to vertex and mapping their textures onto a 2D plane using a predefined universal transformation.

Data preparation pipeline
Data preparation pipeline

The data preparation pipeline contains four main stages:

  • Data acquisition: the researchers collected about 5000 scans from a wide variety of ethnic, gender, and age groups; each subject was asked to perform five distinct expressions including a neutral one.
  • Landmark annotation: 43 landmarks were added to the meshes semi-automatically by rendering the face and using a pre-trained facial landmark detector on the 2D images.
  • Mesh alignment: this was conducted by deforming a template face mesh according to the geometric structure of each scan, guided by the previously obtained facial landmark points.
  • Texture transfer: the texture is transferred from the scan to the template using a ray casting technique built into the animation rendering toolbox of Blender; then, the texture is mapped from the template to a 2D plane using the predefined universal mapping.

See the resulting mapped textures below:

Flattened aligned facial textures
Flattened aligned facial textures

The next step is to train GAN to learn and imitate these aligned facial textures. For this purpose, the researchers use a progressive growing GAN with the generator and discriminator constructed as symmetric networks. In this implementation, the generator progressively increases the resolution of the feature maps until reaching the output image size, while the discriminator gradually reduces the size back to a single output.

See below the new synthetic facial textures generated by the aforementioned GAN:

Facial textures synthesized by GAN
Facial textures synthesized by GAN

The final step is to synthesize the geometries of the faces. The researchers explored several approaches to finding plausible geometry coefficients for a given texture. You can observe the qualitative and quantitative (L2 geometric error) comparison between the various methods in the next figure:

Two synthesized textures mapped onto different geometries
Two synthesized textures mapped onto different geometries

Apparently, the least squares approach produces the lowest distortion results. Considering also its simplicity, this method was chosen for all the subsequent experiments.

Experimental results

The proposed method can generate many new identities, and each one of them can be rendered under varying pose, expression, and lighting. Different expressions are added to the neutral geometry using the Blend Shapes model. The resulting images with different pose and lighting are shown below:

Identities generated by the proposed method with different pose and lighting
Identities generated by the proposed method with different pose and lighting

For quantitative evaluation of the results, the researchers used the sliced Wasserstein distance (SWD) to measure distances between distributions of their training and generated images in different scales:

The table demonstrates that the textures generated by the proposed model are statistically closer to the real data than those generated by 3DMM.

The next experiment was designed to evaluate if the proposed model is capable of generating samples that diverge significantly from the original training set and resemble previously unseen data. Thus, 5% of the identities were held out for evaluation. The researchers measured the L2 distance between each real identity from the test set to the closest identity generated by the GAN, as well as to the closest real identity from the training set.

The distance between the generated and real identities
The distance between the generated and real identities

As it can be seen from the figure, the test set identities are closer to the generated identities than the training set identities. Moreover, the “Test to fake” distances are not significantly larger than the “Fake to real” distances. That implies that generated samples do not just produce IDs that are very close to the training set, but also novel IDs that resemble previously unseen examples.

Finally, a qualitative evaluation was performed to check if the proposed pipeline is able to generate original data samples. Thus, facial textures generated by the model were compared to their closest real neighbors in terms of L2 norm between identity descriptors.

Synthesized facial textures (top) vs. corresponding closest real neighbors (bottom)
Synthesized facial textures (top) vs. corresponding closest real neighbors (bottom)

As you can see, the nearest real textures are far enough to be visually distinguished as different people, which confirms the model’s ability to produce novel identities.

Bottom Line

The suggested model is probably the first to realistically synthesize both texture and geometry of human faces. It can be useful for training face detection, face recognition or face reconstruction models. In addition, it can be applied in cases where many different realistic faces are needed like for instance, film industry or computer games. Moreover, this framework is not limited to synthesizing human faces but can be actually employed to other classes of objects where alignment of the data is possible.

DeepWrinkles: Accurate and Realistic Clothing Modeling

28 August 2018

DeepWrinkles: Accurate and Realistic Clothing Modeling

Realistic garment reconstruction is notoriously a complex problem and its importance is undeniable in many research work and applications, such as accurate body shape and pose estimation in the wild…

Realistic garment reconstruction is notoriously a complex problem and its importance is undeniable in many research work and applications, such as accurate body shape and pose estimation in the wild (i.e., from observations of clothed humans), realistic AR/VR experience, movies, video games, virtual try-on, etc. For the past decades, physics-based simulations have been setting the standard in movie and video game industries, even though they require hours of labor by experts.

Facebook Research present a novel approach called Deep wrinkles to generate accurate and realistic clothing deformation from real data capture. It consists of two complementary modules:

  • A statistical model is learned from 3D scans of clothed people in motion, from which clothing templates are precisely non-rigidly aligned.
  • Fine geometric details are added to normal maps generated using a conditional adversarial network whose architecture is designed to enforce realism and temporal consistency.

The goal is to recover all observable geometric details. Assuming the finest details are captured at sensor image pixel resolution and are reconstructed in 3D, all existing geometric details can then be encoded in a normal map of the 3D scan surface at the lower resolution as shown in figure below.

clothes modelling

Cloth deformation is model by learning a linear subspace model that factors out body pose and shape. However, our model is learned from real data.


By the way, Neurohive is creating new app for business photos based on neural network. We are going to release it in September.


The strategy ensures deformations are represented compactly and with high realism. First, we compute robust template-based non-rigid registrations from a 4D scan sequence, then a clothing deformation statistical model is derived and finally, a regression model is learned to pose retargeting.

Data Preparation

Data capture: For each type of clothing, 4D scan sequences are captured at 60 fps (e.g., 10.8k frames for 3 min) of a subject in motion, and dressed in a full-body suit with one piece of clothing with colored boundaries on top. Each frame consists of a 3D surface mesh with around 200k vertices yielding very detailed folds on the surface but partially corrupted by holes and noise. In addition, capturing only one garment prevents occlusions where clothing normally overlaps (e.g., waistbands) and items of clothing can be freely combined with each other.

Registration. The template of clothing  T  is defined by choosing a subset of the human template with consistent topology. T should contain enough vertices to model deformations (e.g., 5k vertices for a T-shirt). The clothing template is then registered to the 4D scan sequence using a variant of
non-rigid ICP based on grid deformation.

Statistical model

The statistical model is computed using linear subspace decomposition by PCA. Poses of all n registered meshes are factored out from the model by pose-normalization using inverse skinning. Each registration Ri can be represented by a mean shape and vertex offsets oi, such that Ri = M+ oi, where the mean shape M belongs to R3*v is obtained by averaging vertex positions. Finally, each Ri can be compactly represented by a linear blend shape function
B,

 the blend shape can simply be replaced

Pose-to-shape prediction

Predictive model f is learned that that takes as inputs joint poses and outputs a set of k shape parameters (A). This allows powerful applications where deformations are induced by the pose. To take into account deformation dynamics that occur during human motion, the model is also trained with pose velocity, acceleration, and shape parameter history.

DeepWrinkles Accurate and Realistic Clothing Modeling
Outline of DeepWrinkles

Architecture

neural network design clothing modelling

The goal is to recover all observable geometric details. Assuming the nest details are captured at sensor image pixel resolution and are reconstructed in 3D all existing geometric details can then be encoded in a normal map of the 3D scan surface at a lower resolution. To automatically add fine details on the fly to reconstructed clothing, the generative adversarial network is proposed to leverage normal maps.

The proposed network is based on a conditional Generative Adversarial Network (cGAN) inspired by image transfer. A convolution batchnorm-ReLu structure and a U-Net is used in the generative network since it transferred all the information across the network layers and the overall structure of the image to be preserved. Temporal consistency is achieved by extending the L1 network loss term. For compelling animations, it is not only important that each frame looks realistic, but also no sudden jumps in the rendering should occur. To ensure a smooth transition between consecutively generated images across time, we introduce an additional loss L(loss) to the GAN objective that penalizes discrepancies between generated images at t and expected images (from training dataset) at  t – 1:

loss function

where L(data) helps to generate images near to ground truth in an L1 sense (for less blurring). The temporal consistency term L(temp) is meant to capture global fold movements over the surface.

The cGAN network is trained on a dataset of 9213 consecutive frames. The first 8000 images compose the training data set, the next 1000 images the test data set and the remaining 213 images the validation set. Test and validation sets contain poses and movements not seen in the training set. The U-Net auto-encoder is constructed with 2 x 8 layers, and 64 filters in each of the first convolutional layers. The discriminator uses patches of size 70 x 70. L(data) weight is set to 100, L(temp) weight is 50, while GAN weight is 1. The images have a resolution of 256 x 256, although our early experiments also showed promising results on 512 x 512.

Result

DeepWrinkles is an entirely data-driven framework to capture and reconstruct clothing in motion out from 4D scan sequences. The evaluations show that high-frequency details can be added to low-resolution normal maps using a conditional adversarial neural network. The temporal loss is also introduced to the GAN objective that preserves geometric consistency across time, and show qualitative and quantitative evaluations on different datasets.

Results
a) Physics-based simulation, b) Subspace (50 coefficients) c) Registration d) DeepWrinkles e) 3D scan (ground truth)

New Approach to Recovering 3D Shape Structure from a Single 2D Image

27 July 2018
3D recovery

New Approach to Recovering 3D Shape Structure from a Single 2D Image

Single-view image-based 3D modeling is a topic of particular interest the last few years. That’s likely due to the tremendous success of deep convolutional neural networks (CNN) on image-based learning…

Single-view image-based 3D modeling is a topic of particular interest the last few years. That’s likely due to the tremendous success of deep convolutional neural networks (CNN) on image-based learning tasks. However, most of the deep models provide the only volumetric representation of 3D shapes as output. As a result, important information about shape topology or part structure is lost.

Figure 1. Results of 3D share structure recovery from a single image. Top-8 images, returned by Google, when searching for “chair”, “table” and “airplane” were used to test the new approach. Failure cases are marked with red

The alternative could be to recover 3D shape structure, which encompasses part composition and part relations. This task is quite challenging: inferring a part segmentation for a 3D shape is not an easy task by itself, but even if a segmentation is given, it is still challenging to reason about part relations such as connection, symmetry, parallelism, and others.

In fact, we can talk about several particular challenges here:

  • Part decomposition and relations are not as explicit in 2D images, as, for example, shape geometry. It should be also noted that compared to pixel-to-voxel mapping, recovering part structure from pixels would be a highly ill-posed task.
  • Many 3D CAD models of human-made objects contain diverse substructures, and recovery of those complicated 3D structures is far more challenging than shape synthesis modulated by a shape classification.
  • Objects from real images usually have different textures, lighting conditions, and backgrounds.

What’s Suggested

Chengjie Niu, Jun Li, and Kai Xu suggest learning a deep neural network that directly recovers 3D shape structure of an object from a single RGB image. To accomplish this task, they propose to learn and integrate two networks:

  • Structure masking network, which highlights multi-scale object structures in an input 2D image. It is designed as a multi-scale convolutional neural network (CNN) augmented with jump connections. Its task retains shape details while screening out the structure-irrelevant information such as background and textures.
  • Structure recovery network, which recursively recovers a hierarchy of object parts abstracted by cuboids. This network takes as input the features extracted in the structure masking network, adds the CNN features of the original 2D image, and then feeds all these features into a recursive neural network (RvNN) for 3D structure decoding. The output is a tree organization of 3D cuboids with plausible spatial configuration and reasonable mutual relations.

The two networks are trained jointly. The training data includes image-mask and cuboid-structure pairs that can be generated by rendering 3D CAD models and extracting the box structure based on the given parts of the shape.

Network Architecture

An overview of the suggested network architecture is depicted in the image below. As you can see from the resultant cuboid structure of the chair, symmetries between chair legs (highlighted by red arrows) were successfully recovered by this network.

Figure 2. Network architecture

Let’s check more closely the details of the suggested solution.

The structure masking network is a two-scale CNN trained to produce a contour mask for the object of interest. The authors decided to include this network as the first step since previous studies of the subject revealed that object contours provide strong cues for understanding shape structures in 2D images. However, instead of utilizing the extracted contour mask, they suggest taking the feature map of the last layer of the structure masking network and feeding it into the structure recovery network.

Next, the structure recovery network combines features from two convolutional channels. One channel takes as input the last feature map before the mask prediction layer from the structure masking network. Another channel is the CNN feature of the original image extracted by a VGG-16. Since it is hard for the masking network to produce perfect mask prediction, the CNN feature of the original image provides complementary information by retaining more object information.

So, the recursive neural network (RNN) starts from a root feature code and recursively decodes it into a hierarchy of features until reaching the leaf nodes, which can be further decoded into a vector of box parameters. The suggested solution uses three types of nodes in its hierarchy, including leaf node, adjacency node, and symmetry node, as well as the corresponding decoders such as box decoder, adjacency decoder, and symmetry decoder. Illustration of the decoder network at a given node is provided below.

Figure 3. Decoder network

Thus, during the decoding, two types of part relations are recovered as the class of internal nodes: adjacency and symmetry. In order to determine correctly type of the node and use the corresponding decoder, a separate node classifier is trained jointly with the three decoders. It is learned based on the training task of structure recovery, where the ground-truth box structure is known for a given training pair of image and shape structure.

The dataset for training the model included 800 3D shapes from three categories in ShapeNet: chairs (500), tables (200), airplanes (100). For each 3D shape, researchers created 36 rendered views around the shape for every 30° rotation with 3 elevations. Together with another 24 randomly generated views, there we 60 rendered RGB images in total for each shape. The 3D shapes were then complemented with randomly selected backgrounds from NYU v2 dataset.

Results and Application

Some results of recovering 3D shape structure from a single RGB image using the suggested approach can be observed in Figure 1, where top 8 images, returned by Google for the search of “chair”, “table” and “airplane”, were selected, and then for each image a 3D cuboid structure was recovered. From the results, it can be observed that the approach described here is able to recover 3D shape structures from real images in a detailed and accurate way. Moreover, it allows recovering the connection and symmetry relations of the shape parts from single view inputs.

The authors of this approach suggest two settings, where their method of recovering 3D shape structure can be used:

  • structure-aware image editing;
  • structure-assisted 3D volume refinement.

The results of applying their method to these problems are demonstrated in the image below.

Figure 4. Top row: The inferred 3D shape structure can be used to complete and refine the volumetric shape. Bottom row: The structure is used to assist structure-aware image editing.

Bottom Line

The suggested approach to recovering 3D shape structure from a single RGB image has several important strengths:

  • connection and symmetry relations are recovered quite accurately;
  • the overall result is sufficiently detailed;
  • the method can be useful for structure-aware image editing and structure-assisted 3D volume refinement.

However, the method fails to recover structures for object categories unseen from the training set. Moreover, it currently recovers 3D cuboids only but not the underlying part geometry, and so the roundtable appears like a square table in a recovered 3D shape structure.

Figure 5: Comparing single-view, part-based 3D shape reconstruction between our Im2Struct and two alternatives

To sum up, by combining 2 neural networks (structure masking network and structure recovery network) the researchers managed to recover faithful and detailed 3D shape structure of an object from a single 2D image, reflecting part connectivity and symmetries — something that has never been done before.

The main job was done by the second network (namely, reflecting part connectivity and symmetries) while combining it with the structure masking network allowed for more accurate results in general. From this point of view, we may say that structure recovery network, and, in particular, structure decoding part of this network is a key component of this research.

Realistic 3D Avatars from a Single Image

27 July 2018
3d avatar

Realistic 3D Avatars from a Single Image

Digital media needs realistic 3D avatars with faces. The recent surge in augmented and virtual reality platforms has created an even stronger demand for high-quality content, and rendering realistic faces…

Digital media needs realistic 3D avatars with faces. The recent surge in augmented and virtual reality platforms has created an even stronger demand for high-quality content, and rendering realistic faces plays a crucial role in achieving engaging face-to-face communication between digital avatars in simulated environments.

So, what could be a perfect algorithm? The person takes mobile “selfie” image, uploads the picture and gets an avatar in the simulated environment with accurately modeled facial shape and reflectance. In practice, however, significant compromises are made to balance the amount of input data to be captured, the amount of computation required, and the quality of the final output.

Figure 1. Inferring high-fidelity facial reflectance and geometry maps from a single image

Despite the high complexity of the task, group of researchers from USC Institute for Creative Technologies claims that their model allows to efficiently create accurate, high-fidelity 3D avatars from a single input image, captured in an unconstrained environment. Furthermore, the avatars will be close in quality to those created by professional capture systems but will require minimal computation and no special expertise on the part of the photographer.

So, let’s discover their approach to creating high-fidelity avatars from a single image without extensive computations or manual efforts.

Overview of the Suggested Approach

First of all, the model is trained with high-resolution facial scans obtained using a state-of-the-art multi-view photometric facial scanning system. This approach helps to get high-resolution and high-fidelity geometric and reflectance maps from a 2D input image, which can be captured under arbitrary illumination and contain partial occlusions of the face. The inferred maps can be next used to render a compelling and realistic 3D avatar in novel lighting conditions. The whole process can be accomplished in seconds.

Considering the complexity of the task, it was decomposed into several problems, which are addressed by separate convolutional neural networks:

· Stage 1 includes obtaining the coarse geometry by fitting a 3D template model to the input image, extracting an initial facial albedo map from this model, and then using networks that estimate illumination-invariant specular and diffuse albedo and displacement maps from this texture.

· Stage 2: the inferred maps, which may have missing regions due to occlusions in the input image, are passed through networks for texture completion. High-fidelity textures are obtained using a multi-resolution image-to-image translation network, in which latent convolutional features are flipped so as to achieve a natural degree of symmetry while maintaining local variations.

· Stage 3: another network is used to obtain additional details in the completed regions.

· Stage 4: a convolutional neural network performs super-resolution to increase the pixel resolution of the completed texture from 512 × 512 into 2048 × 2048.

Let’s discuss the architecture of the suggested model in more details.

Model Architecture

The pipeline of the proposed model is illustrated below. Given a single input image, the base mesh and corresponding facial texture map are extracted. This map is passed through two convolutional neural networks (CNNs) that perform inference to obtain the corresponding reflectance and displacement maps. Since these maps may contain large missing regions, the next step includes texture completion and refinement to fill these regions based on the information from the visible regions. And finally, super-resolution is performed. The resulting high-resolution reflectance and geometry maps may be used to render high-fidelity avatars in novel lighting environments.

Figure 2. The pipeline of the proposed model

Reflectance and geometry inference. The pixel-wise optimization algorithm is adopted to obtain the base facial geometry, head orientation, and camera parameters. This data is then used to project the face into a texture map in the UV space. The non-skin region is removed. The RGB texture extracted is fed into the model of U-net architecture with skip connections to obtain the corresponding diffuse and specular reflectance maps and the mid- and high-frequency displacement maps.

To obtain the best overall performance, two networks with identical architectures were employed: one operating on the diffuse albedo map (subsurface component), and the other on the tensor obtained by concatenating the specular albedo map with the mid- and high-frequency displacement maps (collectively surface components).

Symmetry aware texture completion. Again, the best results were obtained by training two network pipelines: one pipeline — to complete the diffuse albedo, and another one — to complete the other components (specular albedo, mid- and high-level displacement).

Next, it was discovered that completing large areas at high resolution doesn’t give satisfactory results due to the high complexity of the learning objective. Thus, the inpainting problem was divided into simpler sub-problems as shown on the picture below.

Figure 3. Texture completion pipeline

Furthermore, the researchers leveraged the spatial symmetry of UV parameterization and maximized the feature coverage by flipping intermediate features over the V-axis in UV space and concatenate them to the original features. This allowed completed textures to contain a natural degree of symmetry as seen in real faces instead of an uncanny degree of near-perfect symmetry.

Each network was trained using the Adam optimizer with a learning rate set to 0.0002.

Figure 4. Examples of resulting renderings in new lighting conditions

Results

Quantitative evaluations of the system’s ability to faithfully recover the reflectance and geometry data from a set of 100 test images are depicted in the Table below.

Table 1. Peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) of the inferred images for 100 test images compared to the ground truth

Even though we observe relatively large differences from the ground truth of the specular albedo results, qualitative evaluations demonstrate that the inferred data is still sufficient for rendering compelling and high-quality avatars.

Figure 5. Zoom-in results showing synthesized mesoscopic details

Furthermore, the results were compared quantitatively and qualitatively to other state-of-the-art methods. This comparison revealed that the new approach presented here results in more coherent and plausible facial textures than any of the alternative methods.

Figure 6. Comparison with PCA, Visio-lization [Mohammed et al. 2009], and a state-of-the-art diffuse albedo inference method [Saito et al. 2017] 

Table 2. Quantitative comparison of the suggested model with several alternative methods, measured using the PSNR and the root-mean-square error (RMSE)

Conclusion

In summary, the suggested approach makes it feasible to infer high-resolution reflectance and geometry maps using a single unconstrained image. Not only these maps are sufficient for rendering compelling and realistic avatars, but they can be obtained within only several seconds rather than several minutes like it’s required for alternative methods. These great results are possible in large part due to the use of high-quality ground truth 3D scans and the corresponding input images. Moreover, the technique of flipping and concatenating convolutional features encoded in the latent space of the model allowed to perform texture completion with preserving the natural degree of facial symmetry.

Figure 7. Demonstration of the model’s limitations

Still, the suggested approach has several limitations that are demonstrated in the figure above. The method produces artifacts in the presence of strong shadows and non-skin objects due to segmentation failures. Also, volumetric beards are not faithfully reconstructed, and strong dynamic wrinkles may cause artifacts in the inferred displacement maps.

Nevertheless, these limitations could not deny the great contribution that the suggested approach makes to the problem of creating high-fidelity avatars for the simulated environments.

3D Hair Reconstruction Out of a Single Image

10 July 2018
3D Hair Reconstruction Out of a Single Image

3D Hair Reconstruction Out of a Single Image

Generating a realistic 3D model of an object out from 2D data represents a challenging task and this problem has been explored by many researchers in the past. The creation…

Generating a realistic 3D model of an object out from 2D data represents a challenging task and this problem has been explored by many researchers in the past. The creation and rendering of a high-quality 3D model are itself challenging and estimating the 3D object shape out of a 2D image is a very difficult task. People have been trying to address this issue, especially while trying to digitize virtual humans (in many different contexts ranging from video games to medical applications). Although there has been an enormous success, the generation of high-quality, realistic 3D object models is still not a solved problem. Talking about human shape modeling, there has been a great success in constructing human face but for example much less in generating 3D hair models.

This problem (of generating 3D hair models) has been addressed recently by researchers from University of Southern California, USC Institute for Creative Technologies, Pinscreen, Microsoft Research Asia, who propose a deep learning based method for 3D hair reconstruction from a single 2D unconstrained image.

Different from previous approaches, the proposed method based on Deep Learning is, in fact, able to directly generate hair strands instead of volumetric grids or point cloud structures. The new approach, according to the authors achieves state-of-the-art performance on resolution and quality and brings significant improvement in speed and storage costs. Moreover, as a very important contribution, the model in the proposed method actually provides the smooth, compact and continuous representation of hair geometry and this enables smooth sampling and interpolation.

3D Hair Reconstruction
Data representation in the proposed method

The Method

The proposed approach consists of three steps:

  1. Preprocessing that calculates the 2D orientation field of the hair region.
  2. A deep neural network that takes the 2D orientation field and outputs generated hair strands (in a form of sequences of 3D points).
  3. A reconstruction step that generates a smooth and dense hair model.

As mentioned before, the first step is the actual preprocessing of the image where the authors want to obtain the 2D orientation field but only of the hair region part. Therefore, the first filter is actually extracting the hair region. It is done using a robust pixel-wise hair mask on the portrait image. After that Gabor filters are used to detect the orientation and construct the pixel-wise 2D orientation map. It is also worth to note that the researchers use undirected orientation being only interested in the orientation but not the actual hair growth direction. In order to better improve the result on segmenting the hair region, they also apply a human head and body segmentation masks. Finally, the output of the preprocessing step is 3 x 256 x 256 image where the first two image channels encode the colour-coded orientation and the third one is the actual segmentation.

Deep Neural Network

Data Representation

The output of the hair prediction network is a hair model which is represented with sequences of ordered 3D points corresponding to each modeled hair strand. In the experiments, the size of each sequence is 100 3D points each of them containing attributes of position and curvature. Thus, a hair model would contain N number of strands (sequences).

3D Hair Reconstruction

The input orientation image is first encoded into a high-level feature vector and then decoded to 32 x 32 individual strand-features. Then, each of these features is decoded to a hair geometry represented by positions and curvatures for each of the points in the strand.

Network Architecture

The employed network is taking the orientation image as input and gives two matrices as output: the positions and curvatures, as explained above. The network has an Encoder-Decoder convolutional architecture that deterministically encodes the input image into a latent vector of fixed size: 512. This latent vector, in fact, represents the hair feature which is then decoded by the decoder part. The encoder consists of 5 convolutional layers and a max pooling layer. The encoded latent vector is then decoded with the decoder which consists of 3 deconvolutional layers into multiple strand feature vectors (as mentioned above) and finally, an MLP is used to further decode the feature vectors into the desired geometry consisting of curvatures and positions.

Encoder Decoder architecture for hair reconstruction
The proposed Encoder-Decoder architecture that does the 3D hair reconstruction

To perform the optimization of such a defined architecture and the specific problem, the authors employ 3 loss functions: two of them are the L2 reconstruction loss of the geometry (3D position and curvature) and the third one is actually a collision loss measuring the collision between the hair strand and the human body.

The ellipsoids used for collision testing
The ellipsoids used for collision testing

Evaluation and Conclusions

To evaluate the defined method and approach towards the problem of 3D hair reconstruction, the researchers use quantitative as well as qualitative evaluation metrics. In fact, for the quantitative analysis, they compute the reconstruction loss of the visible and the non-visible part of the hair separately to make a comparison. They create a synthetic test set with 100 random hair models and 4 images rendered from random views for each hair model. The results and the comparison with existing methods are given in the following table.

3D Hair Reconstruction table1
Comparison of the proposed method with already existing methods. Divided into subgroups of visible and invisible parts
3D Hair Reconstruction
Space and time complexity of the method and comparison to Chai et al. approach

On the other hand, to be able to qualitatively evaluate the performance of the proposed approach, the researchers actually test a few real portrait photographs as input and they show that the method is able to handle different shapes (short, medium, long hair) as well as to reconstruct different levels curliness within hairstyles.

Comparison on 4 real portrait images
Comparison of 2 real portrait images

Moreover, they test also the smooth sampling and the interpolation. They show that their model is able to smoothly interpolate between hairstyles (from straight to curly or short to long).

Interpolation results between two hairstyles
Interpolation results between two hairstyles (short and long)

Overall, the proposed method is interesting in many ways. It shows that an end-to-end network architecture is able to successfully reconstruct 3D hair from the 2D image, which is impressive itself but also to smoothly transition between hairstyles via interpolation, thanks to the employed encoder-decoder architecture.

Dane Mitrev

Spherical CNN Kernels for 3D Point Clouds

30 May 2018
3d point clouds

Spherical CNN Kernels for 3D Point Clouds

From a data structure point of view, point clouds are unordered sets of vectors. They are specific and differ very much from other data types such as images and videos.…

From a data structure point of view, point clouds are unordered sets of vectors. They are specific and differ very much from other data types such as images and videos. Still, many sensors such as Microsoft’s Kinect, LIDAR (used in the autonomous driving industry) provide a point cloud as output data. This kind of specific data requires specific data processing techniques that will be able to exploit and extract as much information as possible.

Typically, a convolutional neural network operates using 2D image data, and it learns by taking pixels as input. Over time, researchers in the deep learning community have found ways to use a different kind of data such as videos, binary images, etc., with convolutional architectures. Convolutional neural networks are designed and proved to be very successful in exploiting the spatiality in the data, i.e., capturing the spatial information. They are able to learn a hierarchy of features directly from the pixel data by applying kernel operations in well-defined local regions (called local receptive fields).

Talking about the spatial information, we have seen that convolutional neural networks achieved greater success with 2D data than 3D spatial data and this raises a question — “Why CNNs are worse with 3D data?”.

Well, recently two types of CNN networks have been developed for learning over 3D data: volumetric representation-based CNNs and multi-view based CNNs. Empirical results have shown that there is a considerable gap between the two and that existing volumetric CNN architectures are unable to fully exploit the power of 3D representations. This comes mostly from the computational and storage costs of the network, that grows cubically with the input resolution. In this context, processing of point clouds (which represent 3D spatial data) is very computationally costly, and 3D-CNN architectures have been applied only to low input resolution point clouds ranging from 30x30x30 to 256x256x256. 3D-CNN kernels typically are applied to volumetric representations of 3D data, which makes the task of learning over point clouds even more difficult and unfeasible.

In a novel approach, researchers from the University of Western Australia proposed an innovative way of handling point clouds in CNNs by introducing spherical convolutions. The key idea is to traverse the 3D space with a spherical kernel and partition the space using octree data structure.

spherical convolution kernel
The proposed spherical kernel showing the uniformly partitioned sphere into bins.

According to the authors, spherical regions are suitable for computing geometrically meaningful features from unstructured 3D data. They propose an approach that takes each point in space (with x, y, z coordinates) and defines a spherical region around it. Then, they divide the sphere into n x p x qbins by partitioning the space uniformly along azimuth and elevation dimensions. For each of the bins, they define a weight matrix of learnable parameters(weights). Together these matrices from all the bins form a single spherical convolutional kernel. To compute the activation of a single point in the point cloud, they take the relevant weight matrices of all neighbouring points (which are defined as neighbourhood points if they exist in the sphere). Then they represent each of the nearby points with the relative spherical coordinates to the point of interest.

Since point clouds are not in a regular format, most researchers typically transform such data to regular 3D voxel grids or collections of images (e.g, views) before feeding them to a deep net architecture. In contrast, the authors use a different approach by representing the point cloud with octree structure. As mentioned before, this is less costly in terms of computation and storage than voxel grids as a volumetric representation and moreover, it can successfully handle irregular 3D point clouds (Notemost of the point clouds coming from sensors are irregular having highly variable point density). They use an octree of depth L, where each depth level represents a partitioning of the 3D space — from coarser to finer (top to bottom). The network is trained such that kernels are applied in the neighbourhood of each point. The matrices assigned to each bin when applying the kernel are learned during training and they represent the weights.

 resulting network architecture
The octree partitioning and the resulting network architecture.

Evaluation and Conclusions

This work shows promising results in the classification of objects in (irregular) 3D point clouds. The evaluation was done on the standard benchmark datasets comparing with state-of-the-art methods. The architecture outperforms ECC and OctNet but fails to outperform the PointNet which is current state-of-the-art network architecture evaluated on ModelNet10 and ModelNet40. Also, the training experiments presented show that data augmentation improves results significantly.

compartion with existing methods.
Comparison of the proposed spherical kernel method with existing methods.
improving accuracy
Improving accuracy by using more data.

Finally, this approach shows very good results for what seems to be a very difficult task: finding an efficient way to use convolutional neural networks with 3D point clouds. In fact, it shows that the difficulty in the learning of the point cloud structure can be reduced by keeping and learning a set of points that represent the skeleton of an object. Thus, a suitable data representation (that will capture this) is necessary and in this case, it is the octree. The authors show the evolution of the point cloud as a function of the depth of the octree.

point cloud evolution
Evolution of the point cloud representation with octrees with different depth (left). Learned pattern by a spherical kernel (right).

The novel approach opens the doors for further investigation and use of non-conventional deep learning techniques (like the spherical kernel) as well as efficient processing of irregular 3D point cloud data. It shows that a point cloud can be processed with a convolutional neural network in a scalable manner, proving that with the results on the task of object recognition.

Dane Mitriev