Spherical CNN Kernels for 3D Point Clouds

30 May 2018
3d point clouds

Spherical CNN Kernels for 3D Point Clouds

From a data structure point of view, point clouds are unordered sets of vectors. They are specific and differ very much from other data types such as images and videos.…

From a data structure point of view, point clouds are unordered sets of vectors. They are specific and differ very much from other data types such as images and videos. Still, many sensors such as Microsoft’s Kinect, LIDAR (used in the autonomous driving industry) provide a point cloud as output data. This kind of specific data requires specific data processing techniques that will be able to exploit and extract as much information as possible.

Typically, a convolutional neural network operates using 2D image data, and it learns by taking pixels as input. Over time, researchers in the deep learning community have found ways to use a different kind of data such as videos, binary images, etc., with convolutional architectures. Convolutional neural networks are designed and proved to be very successful in exploiting the spatiality in the data, i.e., capturing the spatial information. They are able to learn a hierarchy of features directly from the pixel data by applying kernel operations in well-defined local regions (called local receptive fields).

Talking about the spatial information, we have seen that convolutional neural networks achieved greater success with 2D data than 3D spatial data and this raises a question — “Why CNNs are worse with 3D data?”.

Well, recently two types of CNN networks have been developed for learning over 3D data: volumetric representation-based CNNs and multi-view based CNNs. Empirical results have shown that there is a considerable gap between the two and that existing volumetric CNN architectures are unable to fully exploit the power of 3D representations. This comes mostly from the computational and storage costs of the network, that grows cubically with the input resolution. In this context, processing of point clouds (which represent 3D spatial data) is very computationally costly, and 3D-CNN architectures have been applied only to low input resolution point clouds ranging from 30x30x30 to 256x256x256. 3D-CNN kernels typically are applied to volumetric representations of 3D data, which makes the task of learning over point clouds even more difficult and unfeasible.

In a novel approach, researchers from the University of Western Australia proposed an innovative way of handling point clouds in CNNs by introducing spherical convolutions. The key idea is to traverse the 3D space with a spherical kernel and partition the space using octree data structure.

spherical convolution kernel
The proposed spherical kernel showing the uniformly partitioned sphere into bins.

According to the authors, spherical regions are suitable for computing geometrically meaningful features from unstructured 3D data. They propose an approach that takes each point in space (with x, y, z coordinates) and defines a spherical region around it. Then, they divide the sphere into n x p x qbins by partitioning the space uniformly along azimuth and elevation dimensions. For each of the bins, they define a weight matrix of learnable parameters(weights). Together these matrices from all the bins form a single spherical convolutional kernel. To compute the activation of a single point in the point cloud, they take the relevant weight matrices of all neighbouring points (which are defined as neighbourhood points if they exist in the sphere). Then they represent each of the nearby points with the relative spherical coordinates to the point of interest.

Since point clouds are not in a regular format, most researchers typically transform such data to regular 3D voxel grids or collections of images (e.g, views) before feeding them to a deep net architecture. In contrast, the authors use a different approach by representing the point cloud with octree structure. As mentioned before, this is less costly in terms of computation and storage than voxel grids as a volumetric representation and moreover, it can successfully handle irregular 3D point clouds (Notemost of the point clouds coming from sensors are irregular having highly variable point density). They use an octree of depth L, where each depth level represents a partitioning of the 3D space — from coarser to finer (top to bottom). The network is trained such that kernels are applied in the neighbourhood of each point. The matrices assigned to each bin when applying the kernel are learned during training and they represent the weights.

 resulting network architecture
The octree partitioning and the resulting network architecture.

Evaluation and Conclusions

This work shows promising results in the classification of objects in (irregular) 3D point clouds. The evaluation was done on the standard benchmark datasets comparing with state-of-the-art methods. The architecture outperforms ECC and OctNet but fails to outperform the PointNet which is current state-of-the-art network architecture evaluated on ModelNet10 and ModelNet40. Also, the training experiments presented show that data augmentation improves results significantly.

compartion with existing methods.
Comparison of the proposed spherical kernel method with existing methods.
improving accuracy
Improving accuracy by using more data.

Finally, this approach shows very good results for what seems to be a very difficult task: finding an efficient way to use convolutional neural networks with 3D point clouds. In fact, it shows that the difficulty in the learning of the point cloud structure can be reduced by keeping and learning a set of points that represent the skeleton of an object. Thus, a suitable data representation (that will capture this) is necessary and in this case, it is the octree. The authors show the evolution of the point cloud as a function of the depth of the octree.

point cloud evolution
Evolution of the point cloud representation with octrees with different depth (left). Learned pattern by a spherical kernel (right).

The novel approach opens the doors for further investigation and use of non-conventional deep learning techniques (like the spherical kernel) as well as efficient processing of irregular 3D point cloud data. It shows that a point cloud can be processed with a convolutional neural network in a scalable manner, proving that with the results on the task of object recognition.

Dane Mitriev

How Has MS Voxel Deep Network Managed to Improve 3D Objects Recognition Using Cloud Map Only

12 April 2018
Objects Recognition Using Cloud Map Only

How Has MS Voxel Deep Network Managed to Improve 3D Objects Recognition Using Cloud Map Only

Mobile Laser Scanning (MLS) systems can now scan large areas, like cities or even countries. The produced 3D point clouds can be used as maps for autonomous systems. To do…

Mobile Laser Scanning (MLS) systems can now scan large areas, like cities or even countries. The produced 3D point clouds can be used as maps for autonomous systems. To do so, the automatic classification of the data is necessary and is still challenging, regards to the number of objects present in an urban scene. Xavier Roynard, Jean-Emmanuel Deschaud and François Goulette propose both a training method that balances the number of points per class during each epoch and a 3D CNN capable of effectively learning how to classify scenes containing objects at multiple scales.

MS Voxel Deep Network — A new convolutional neural network (CNN) to classify 3D point clouds of urban or indoor scenes. On the reduced-8 Semantic3D benchmark, this network ranked second, beats the state of the art of point classification methods (those not using a regularization step).

Network Learning Difficulties

Training on scenes point cloud leads to some difficulties. For the point classification task, each point is a sample, so the number of samples per class is very unbalanced (from thousands of points for the class “pedestrian” to tens of millions for the class “ground”). Also with the training method of deep-learning, an Epoch would be to pass through all points of the cloud, which would take a lot of time. Indeed, two very close points have the same neighbourhood, and will, therefore, be classified in the same way.

Authors propose a training method that solves these two problems. We randomly select N (for example 1000) points in each class, then we train on these points mixed randomly between classes, and we renew this mechanism at the beginning of each Epoch.

Once a point p to classify is chosen, we compute a grid of voxels given to the convolutional network by building an occupancy grid centered on p whose empty voxels contain 0 and occupied voxels contain 1. We only use NxNxN cubic grids where n is pair, and we only use isotropic space discretization steps ∆.

Network Training

Some classic data augmentation steps are performed before projecting the 3D point clouds into the voxels grid:

• Flip x and y axis, with probability 0.5

• Random rotation around z-axis

• Random scale, between 95% and 105%

• Random occlusions (randomly removing points), up to 5%

• Random artefacts (randomly inserting points), up to 5%

Random noise in position of points, the noise follows a normal distribution centered in 0 with standard deviation 0.01m

The cost function used is cross-entropy, and the optimizer used is ADAM with a learning rate of 0.001 and ε = 10−8, which are the default settings in most deep-learning libraries.

Architecture — Layers

3D Essential Layers

  • Conv(n, k, s, p) a convolutional layer that transforms feature maps from the previous layer into n new feature maps, with a kernel of size k × k × k and stride s and pads p on each side of the grid.
  • DeConv(n, k, s, p) a transposed convolutional layer that transforms feature maps from the previous layer into n new feature maps, with a kernel of size k × k × k and stride s and pads p on each side of the grid.
  • FC(n) a fully-connected layer that transforms the feature maps from the previous layer into n feature maps.
  • MaxPool(k) a layer that aggregates on each feature map every group of 8 neighbouring voxels.
  • MaxU nPool(k) a layer that computes an inverse of MaxPool(k).
  • ReLU, LeakyReLU and PReLU common non-linearities used after linear layers as Conv and FC. ReLU (x) returns the positive part of x, and to avoid null gradient if x is negative, we can add a slight slope which is fixed (LeakyReLU ) or can be learned (PReLU ).
  • SoftMax a non-linearity layer that rescales a tensor in the range [0, 1] with sum
  • BatchNorm a layer that normalizes samples over a batch.
  • DropOut(p) a layer that randomly zeroes some of the elements of the input tensor with probability p.

The chosen network architecture is inspired from this one that works well in 2D.

Multi-Scale Voxel Network architecture
Multi-Scale Voxel Network architecture: MS3_DeepVoxScene (all tensors are represented as 2D tensors instead of 3D for simplicity).

Datasets

Authors compared 3 different datasets. Paris-Lille-3D contains 50 classes but for our experimentations, we keep only 9 coarser classes. In brackets is indicated the number of points after subsampling at 2 cm.

Among the 3D point cloud scenes datasets, these are those with the most area covered and the most variability

number of points

The covered area is obtained by projecting each cloud on a horizontal plane in pixels of size 10cm × 10cm, then summing the area of all occupied pixels.

A finer resolution of 5 cm was added to better capture the local surface near the point, and a coarser resolution of 15 cm to better understand the context of the object to which the point belongs. This method achieves better results than all methods that classify cloud by points (i. e. without regularization). Even better results could probably be achieved by adding, for example, a CRF after classification.

Example of classified point cloud on S3DIS dataset:

classified point cloud
Classified with MS3_DVS
classified point cloud
Right: ground truth (blue: ground, cerulean blue: buildings, dark green: poles, green: bollards, light green: trash cans, yellow: barriers, dark yellow: pedestrians, orange: cars, red: natural).

Results are very close to the truth. This is achieved by both focusing on the local shape of the object around a point and by taking into account the context of the object.

Quote:

We observe a confusion between the classes wall and board (and more slightly with beam, column, window and door), this is explained mainly because these classes are very similar geometrically and we do not use color. To improve these results, we should not sub-sample the clouds to keep the geometric information thin (such as the table slightly protruding from the wall) and add a 2 cm scale in input to the network, but looking for neighborhoods would then take an unacceptable amount of time.

For a comparison with the state-of-the-art methods on S3DIS 5th fold see table below:

table 3

To evaluate architecture choices, this classification task was tested by one of the first 3D convolutional networks: VoxNet.

Comparison to VoxNet
Comparison to VoxNet

Comparison with the state-of-the-art methods on reduced-8 Semantic3D benchmark:

Per class IoU

Comparison per class between MS1_DeepVoxScene and MS3_DeepVoxScene on Paris-Lille-3D dataset. This shows that the use of multi-scale networks improves the results on some classes, in particular the buildings, barriers and pedestrians classes are greatly improved (especially in Recall), while the car class loses a lot of Precision.

Comparison per class
Comparison per class between MS1_DeepVoxScene and MS3_DeepVoxScene on Paris-Lille-3D dataset

Conclusion

Proposed training method — MS3_DVS, balances the number of points per class seen during each epoch, as well as a multi-scale CNN that is capable of learning to classify point cloud scenes. You may follow this on Semantic3D benchmark. Now it’s number 2 among all — very good result. This is achieved by both focusing on the local shape of the object around a point and by taking into account the context of the object.