3D Scene Rendering From a Single Image

10 August 2018
6 randomly generated synthetic training samples

3D Scene Rendering From a Single Image

Unlike to the inception period of deep learning, techniques such as unsupervised, self-supervised and multi-view learning have started to receive more and more attention recently. Arguing that lack of supervision makes…

Unlike to the inception period of deep learning, techniques such as unsupervised, self-supervised and multi-view learning have started to receive more and more attention recently. Arguing that lack of supervision makes some problems much more difficult to be solved with Deep Learning techniques, researchers from the University of California, Berkeley and Google have proposed an interesting multi-view learning based approach for 3D scene rendering/prediction.

To overcome the key challenge – lack of available training data the researchers rely on a multi-view based approach which will allow learning a layered representation of the 3D scene. In fact, they use a representation known as layered depth image (LDI) and the proposed method is able to infer such representation of the 3D space from a single given image.

The proposed method for 3D scene rendering out of a single image
The proposed method for 3D scene rendering out of a single image

Data Representation

There have been many attempts to use deep learning methods to infer a depth map out of a single image. Many of them are trying to predict a single depth value per pixel both from a single image or monocular videos and calibrated stereo images. Unlike these approaches, in this paper, the goal is to learn a mapping from a single image to a layer-based representation. Therefore, multiple ordered depth values have to be estimated per pixel from the input image and the LDI representation allows this.

A simple explanation of the LDI representation. Layers correspond to distance from the visible surface
A simple explanation of the LDI representation. Layers correspond to distance from the visible surface

Method

Being able to reconstruct a 3D scene out of a single flat image, means also to be able to predict the part of the space which is not visible from the particular viewpoint. In this method, the occluded space behind the visible parts is also estimated as it is required for a full scene rendering and understanding. As mentioned before, inferring a depth map answers, for each pixel the question: “how far from the camera is the point imaged at this pixel?”. Additionally, in this approach, the LDI representation answers also the question: “what lies behind the visible content of this pixel?”, which pushes this approach a bit further in the context of scene description allowing for beyond 2.5D prediction.

6 randomly generated synthetic training samples
6 randomly generated synthetic training samples

Data

The Layer Depth Image representation is specific itself and it represents a 3D structure as layers of depth and color images. In fact, the 3D space is not sliced along the depth dimension but instead, the layers are defined by the surface – the visible part from the camera viewpoint. An LDI image has L tuples of color (i.e. texture) and disparity (inverse depth) images.

Therefore, the training dataset used in this method consists of N image pairs, each of them represented with source and target images (Is, It), camera intrinsics (Ks, Kt) and a relative camera transform given by the rotation and translation (R, t).

View Synthesis as Supervision Signal

Given a source image Is, the method is estimating the corresponding LDI representation using a Convolutional Neural Network. The supervision signal comes from the target image (a flat image from a different view) by employing a view-synthesis method and enforcing similarity between the target image and a rendered image using the LDI estimation and the viewpoint. The predicted target view i.e. the rendered image is done using a geometrically defined rendering function and the know camera transform (the method assumes that the camera transform is known). To do the rendering, the LDI image is treated as a textured point-cloud. Then, each source point is forward-projected onto the target frame, and after that occlusion are handled by a proposed ‘soft z-buffer’, to finally render the target image by a weighted average of the colors of projected points.

Network Architecture

The network architecture used for the LDI estimation is a DispNet Convolutional network, which given a color image computes spatial features at various resolutions and decodes back to the original resolution. Skip connections are also added to the network. To leverage the fact that not all the layers get same learning signal, the authors add separate prediction blocks (last part of the network) for each layer. As a training objective, they enforce that this rendered image should be similar to the observed image from that viewpoint.

The proposed network architecture (DispNet). The last part is separately defined per layer
The proposed network architecture (DispNet). The last part is separately defined per layer

Evaluation and Conclusions

View synthesis error on synthetic data
View synthesis error on synthetic data
View synthesis error on KITTI
View synthesis error on KITTI
Geometry prediction error on synthetic data
Geometry prediction error on synthetic data

To evaluate the proposed method, the researchers have used both synthetic data and data from a real outdoor driving dataset. By evaluating the separate modules and the solution as a whole both quantitatively and qualitatively, they conclude that the method is able to successfully capture occluded structure.

LDI prediction on the KITTI dataset
LDI prediction on the KITTI dataset

Although we are still far from full 3D scene understanding, approaches such as this one have proven that deep learning techniques can be successfully applied to scene understanding (even from a single image). In particular, this method pushes the boundaries of scene understanding and goes a step beyond 2.5D scene prediction.

Depth Estimation Using Encoder-Decoder Networks and Self-Supervised Learning

25 June 2018
depth estimation using neural networks

Depth Estimation Using Encoder-Decoder Networks and Self-Supervised Learning

Modern autonomous mobile robots (including self-driving cars) require a strong understanding of their environment in order to operate safely and effectively. Comprehensive and accurate models of the surrounding environment are…

Modern autonomous mobile robots (including self-driving cars) require a strong understanding of their environment in order to operate safely and effectively. Comprehensive and accurate models of the surrounding environment are crucial for solving the challenges of autonomous operation. However, only a limited amount of information is perceived through the sensors which are limited regarding their capabilities, the field of view and the kind of data they provide.

While sensors like LIDAR, Radar, Kinect provide 3D data including all spatial dimensions, cameras on the other hand only provide a 2D view of the surrounding. In the past, many attempts have been made to actually extract the 3D data out of 2D images coming from the camera. The human visual system is remarkably successful in solving this task, while algorithms very often fail to reconstruct and infer a depth map out of an image.

A novel approach proposes using deep learning in a self-supervised manner to tackle the problem of monocular depth estimation. In fact, researchers from the University College of London have developed an architecture for depth estimation that beats the current state-of-the-art depth estimation on the KITTI dataset challenge. Arguing that large-scale, varied datasets with ground truth training data are scarce, they propose a self-supervision based approach using monocular videos. Their approach and improvements
in-depth estimation, work well with monocular video data as well as with stereo pairs (Note: Synchronized pairs of data from stereo camera) or even with a combination of both.

Comparison of existing methods with the proposed method (down-right) on estimating depth from 2D image
Comparison of existing methods with the proposed method (down-right) on estimating depth from 2D image

The method

A simple way to address the depth estimation problem from a deep learning perspective is to train a network using depth images as ground truth in a supervised manner. However, as mentioned before having enough labeled data (in this case paired 2D-3D data) to be able to train sufficiently large (and deep) network architecture represents a challenge. As a consequence, the authors explore a self-supervised training approach. They frame the problem as a view-synthesis, where the network learns to predict a target image from the viewpoint of another. The proposed method is able to give depth estimation, given only a single color image.

An important problem that has to be taken into account when dealing with depth estimation is ego-motion. Especially in the cases with autonomous mobile robots, ego-motion estimation is crucial for obtaining good results in different tasks, not excluding depth estimation. In order to compensate ego-motion existing approaches have proposed a separate pose estimation network. In fact, this task of this network is to learn and to be able to estimate the relative camera transformation between subsequent sensor measurements (images).

Unlike these previously existing approaches who make use of a separate pose estimation network besides the depth estimation network, the novel method is using the encoder part of the depth estimation network as a transformer in the pose estimation network. To explain this more precisely, the pose estimation network (shown in the figure below) is concatenating features obtained from the encoder instead of concatenating the frames i.e. the raw data directly. The authors state that this significantly improves the results of pose estimation while reducing the number of parameters that have to be learned. They argue that this comes from the abstract features from the encoder which carry an important understanding of the geometry of the input images. The depth estimation network proposed in this paper is based on a U-net architecture (an encoder-decoder U shaped network with skip connections) and it ELU activations together with sigmoids. The encoder in the network is a pre-trained ResNet18.

 The Encoder-Decoder depth estimation network (left). The usual pose estimation network (right). The proposed pose estimation network using the encoder from the depth estimation network (middle)

The Encoder-Decoder depth estimation network (left). The usual pose estimation network (right). The proposed pose estimation network using the encoder from the depth estimation network (middle)

Besides the novel architecture, several improvements have been proposed. Firstly, the authors use a specifically designed loss function incorporating both L1 loss and SSIM (Structural Similarity Index). Secondly, they introduce an interesting approach where they compute the photometric error (the loss) in higher resolution images by up-sampling the low-resolution depth maps. Intuitively, this avoids the problem of creating holes in some parts of the image due to the error computation on down-sampled (encoded) depth maps. Finally, they add a smoothness term to their loss function.

Experiments

The implementation is done in PyTorch and the training was conducted using the KITTI dataset of 128×416 input image resolution. The size of the dataset is around 39 000 triplets for training and evaluation, and data augmentation is extensively used in the form of horizontal mirroring, changes in brightness, contrast, saturation, hue jitter etc. Some of the results obtained are given in the table below.

Comparison of the proposed method to existing methods on KITTI 2015 Dataset. S use stereo, M use monocular supervision, D refers to methods that use KITTI depth supervision at training time
Comparison of the proposed method to existing methods on KITTI 2015 Dataset. S use stereo, M use monocular supervision, D refers to methods that use KITTI depth supervision at training time
Results for different variants of the proposed model that use monocular training on the KITTI 2015 Dataset
Results for different variants of the proposed model that use monocular training on the KITTI 2015 Dataset

Conclusion

This novel approach shows promising results in depth estimation from images. As an important task in the framework of future autonomous robots, 3D depth estimation is getting more and more attention thus making
it questionable if we actually have a need of (often expensive and complex) 3D sensors. Accurate depth estimation can have many applications besides mobile robots and self-driving cars, namely in many fields ranging from simple image editing to complete image understanding. Last but not least, this approach confirms again the power of deep learning, especially the power of encoder-decoder convolutional networks in a wide range of tasks.