New Datasets for 3D Object Recognition

6 November 2018

New Datasets for 3D Object Recognition

Robotics, augmented reality, autonomous driving – all these scenarios rely on recognizing 3D properties of objects from 2D images. This puts 3D object recognition as one of the central problems…

Robotics, augmented reality, autonomous driving – all these scenarios rely on recognizing 3D properties of objects from 2D images. This puts 3D object recognition as one of the central problems in computer vision.

Remarkable progress has been achieved in this field after the introduction of several databases that provide 3D annotations to 2D objects (e.g., IKEA, Pascal3D+). However, these datasets are limited in scale and include only about a dozen object categories.

This is not even close to the large-scale image datasets such as ImageNet or Microsoft COCO, while these are  huge datasets that stay behind the significant progress in image classification task in recent years. Consequently, large-scale datasets with 3D annotations are likely to significantly benefit 3D object recognition.

In this article, we present one large-scale dataset, ObjectNet3D, and also several specialized datasets for 3D object recognition: MVTec ITODD and T-LESS – for industry settings and Falling Things dataset – for object recognition tasks in the context of robotics.

ObjectNet3D

Number of images: 90,127

Number of objects: 201,888

Number of categories: 100

Number of 3D shapes: 44,147

Year: 2016

An example image from ObjectNet3D with 2D objects aligned with 3D shapes

ObjectNet3D is a large-scale database, where objects in the images are aligned with the 3D shapes, and the alignment provides both accurate 3D pose annotation and the closest 3D shape annotation for each 2D object. The scale of this dataset allows for significant progress with such computer vision tasks as recognizing 3D pose and 3D shape of objects from 2D images.

Examples of 3D shape retrieval. Green boxes indicate the selected shape. Bottom row illustrates two cases where a similar shape was not found among the top 5 shapes

To construct this database, researchers from Stanford University resort to images from existing image repositories and propose an approach to align 3D shapes (available from existing 3D shapes repositories) to the objects in these images.

In their work, the researchers consider only rigid object categories, for which they can collect a large number of 3D shapes from the web. Here is the full list of categories:

Object categories in ObjectNet3D

 

2D images were collected from the ImageNet dataset and additionally, through Google Image Search for categories that are not sufficiently covered by the ImageNet dataset. 3D shapes were acquired from Trimble 3D Warehouse and ShapeNet repository. Then, objects in the image were aligned with the 3D shapes using a camera model, which is described in detail in the corresponding paper. Finally, 3D annotations were provided to objects in 2D images.

The resulting dataset can be used for object proposal generation, 2D object detection, joint 2D detection and 3D object pose estimation, image-based 3D shape retrieval.

MVTec ITODD

Number of scenes: 800

Number of objects: 28

Number of 3D transformations: 3500

Year: 2017

Example scene of the dataset from all sensors. Top row: grayscale cameras. Bottom row: Z and grayscale image of the High-Quality (left) and Low-Quality (right) 3D sensor

MVTec ITODD is a dataset for 3D object detection and pose estimation with a strong focus on industrial settings and applications. It contains 28 objects arranged in over 800 scenes and labeled with their rigid 3D transformation as ground truth. The scenes are observed by two industrial 3D sensors and three grayscale cameras, allowing to evaluate methods that work on 3D, image, or combined modalities. The dataset’s creators from MVTec Software GmbH have chosen to use grayscale cameras because they are much more prominent in industrial setups.

As mentioned in the dataset description, the objects were selected such that they cover a range of different values with respect to surface reflectance, symmetry, complexity, flatness, detail, compactness, and size. Here are the images of all objects included to the MVTec ITODD along with their names:

Images of 28 objects used in the dataset

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

For each object, scenes with only a single instance and scenes with multiple instances (e.g., to simulate bin picking) are available. Each scene was acquired once with each of the 3D sensors, and twice with each of the grayscale cameras: once with and once without a random projected pattern.

Finally, for all objects, manually created CAD models are available for training the detection methods. The ground truth was labeled using a semi-manual approach based on the 3D data of the high-quality 3D sensor.

This dataset provides a great benchmark for the detection and pose estimation of 3D objects in the industrial scenarios.

T-LESS

Number of images: 39K training + 10K test images from each of three sensors

Number of objects: 30

Year: 2017

Data Examples of T-LESS test images (left) overlaid with colored 3D object models at the ground truth 6D poses (right). Instances of the same object have the same color

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

T-LESS is a new public dataset for estimating the 6D pose, i.e. translation and rotation, of texture-less rigid objects. This dataset includes 30 industry-relevant objects with no significant texture and no discriminative color or reflectance properties. Another unique property of this dataset is that some of the objects are parts of others.

Researchers behind T-LESS have chosen different approaches to the training images and test images. Thus, training images in this dataset depict individual objects against a black background, while test images originated from twenty scenes with varying degree of complexity. Here are the examples of training and test images:

Top: training images and 3D models of 30 objects. Bottom: test images of 20 scenes overlaid with colored 3D object models at the ground truth poses

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

All the training and test images were captured with three synchronized sensors, including a structured-light and a time-of-flight RGB-D sensor and a high-resolution RGB camera.

Finally, two types of 3D models are provided for each object: 1) manually created CAD model, and 2) a semi-automatically reconstructed one.

This dataset can be very useful for evaluating approaches to 6D object pose estimation, 2D object detection and segmentation, 3D object reconstruction. Considering the availability of images from three sensors, it is also possible to study the importance of different input modalities for a given problem.

Falling Things

Number of images: 61, 500

Number of objects: 21 household objects

Year: 2018

A sample from FAT dataset

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The Falling Things (FAT) dataset is a synthetic dataset for 3D object detection and pose estimation, created by NVIDIA team. It was generated by placing 3D household object models (e.g., mustard bottle, soup can, gelatin box, etc.) in virtual environments.

Each snapshot in this dataset consists of per-pixel class segmentation, 2D/3D bounding box coordinates for all objects, mono and stereo RGB images, dense depth images, and of course, 3D poses. Most of these elements are illustrated in the above image.

Sample images from the FAT dataset

The FAT dataset includes the variety of object poses, backgrounds, composition, and lighting conditions. See some examples below:

For more details on the process of building the FAT dataset, check our article dedicated entirely to this dataset.

The Falling Things dataset provides a great opportunity to accelerate research in object detection and pose estimation, as well as segmentation, depth estimation, and sensor modalities.

Bottom Line

3D object recognition has multiple important applications, but progress in this field is limited by the available datasets. Fortunately, there were several new 3D object recognition datasets introduced in recent years. While they have different scale, focus and characteristics, each of these datasets makes a significant contribution to the improvement of current 3D object recognition systems.

“Falling Things”: A Synthetic Dataset by Nvidia for Pose Estimation

26 April 2018
A Synthetic Dataset by Nvidia for Pose Estimation

“Falling Things”: A Synthetic Dataset by Nvidia for Pose Estimation

Deep learning made a lot of progress in image recognition and visual arts. The deep learning approach is used to solve the problems of image recognition, image captioning, image segmentation,…

Deep learning made a lot of progress in image recognition and visual arts. The deep learning approach is used to solve the problems of image recognition, image captioning, image segmentation, etc. One of the main problems is that you need a lot of data to train a model and these datasets are not easily available. Lots of research teams working to create different kind of datasets.

Similarly, one of the problem in robotics manipulation requires the detection and pose estimation of multiple object categories. It’s a two-fold challenge for designing robotic perception algorithms. The difficult is creating a model with the less number of training data. Another problem is acquiring ground truth data, which are time-consuming, error-prone and potentially expensive. Existing techniques for obtaining real-world data do not scale. As a result, they are not capable of generating the large datasets that are needed for training deep neural networks.

“Falling Things” Dataset

These problem has been founded when using synthetically generated data. Synthetic data is any production data applicable to a given situation that is not obtained by direct measurement. Researchers have been using synthetic data as an efficient means of both training and validating deep neural networks for which getting ground truth data is very hard such as for segmentation task. 3D pose estimation and detection task lie within this category where acquiring ground truth is complicated and fit for synthetic data.

 Falling Things (FAT) dataset which consists of more than 61,000 images for training and validating a robotics scene understanding algorithms in a household environment. There are only two datasets are present with accurate ground truth poses of multiple objects, i.e. T-LESS and YCB-Video. But the problem with of these datasets is that they didn’t contain extreme lightning condition and or multiple modalities. But state of the art solution is FAT which incorporates the capabilities which were not present in the other two datasets.

“Falling Things”
Figure 1: The Falling Things (FAT) dataset was generated by placing 3D household object models. Pixelwise segmentation of the objects (bottom left), depth (bottom center), 2D/3D bounding box coordinates (bottom right)

Unreal Engine

The state of the art FAT dataset is generated by using Unreal Engine 4(UE4). The data is generated for three virtual environments within UE4: a kitchen, sun temple, and forest. These environments were chosen for their high-fidelity modeling and quality, as well as for the variety of indoor and outdoor scenes. For every environment, five specific manually selected locations are covering a range of terrain and lighting conditions (e.g., on a kitchen counter or tile floor, next to a rock, above a grassy field, and so forth). By doing this 15 different locations consisting of a variety of 3D backgrounds, lighting conditions, and shadows are yielded.

There are 21 household objects from the YCB. The objects were placed in random positions and orientations within a vertical cylinder of radius 5 cm and height of 10 cm placed at a fixation point. As the objects fell, the virtual camera system was rapidly teleported to random azimuths, elevations, and distances with respect to the fixation point to collect data. Azimuth ranged from –120◦ to +120◦ (to avoid collision with the wall, when present), elevation from 5◦ to 85◦, and distance from 0.5 m to 1.5 m. The virtual camera used in data generation is consist of a pair of stereo RGBD camera. This design decision allows the dataset to support at least three different sensor modalities. Whereas single RGBD sensors are commonly used in robotics, stereo sensors have the potential to yield higher quality output with fewer distortions, and a monocular RGB camera has distinct advantages regarding cost, simplicity, and availability.

The dataset consists of 61,500 unique images having an image resolution of 960 x 540, and it’s divided into two parts:

1. Single Objects: The first part of the dataset was generated by dropping each object model in isolation ∼5 times at each of the 15 locations.

2. Mixed objects: the second part of the dataset was generated in the same manner except for a random number of objects sampled uniformly from 2 to 10. To allow multiple instances of the same category in an image, the object has been sampled with replacement.

To split the dataset for training and testing is to hold out one location per scene as the test sets, and leave the other data for training. Figure 2 shows the total number of occurrences of each object class in the FAT dataset.

object visibility
Figure 2: Total appearance count of the 21 YCB objects in the FAT dataset. Light color bars indicate object visibility higher than 25%, while solid bars indicate visibility higher than 75%.

Bottom Line

This new dataset will help to accelerate research in object detection and pose estimation, segmentation and depth estimation. The proposed dataset focuses on household items from the YCB dataset.

object detection and pose estimation dataset
Figure 3: Datasets for object detection and pose estimation. The FAT datasets provide all the capabilities.

This dataset helps researchers to find solutions for open problems like object detection, pose estimation, depth estimation from monocular and/or stereo cameras, and depth-based segmentation, to advance the field of robotics.

Falling things examples
Figure 4: Some examples from the dataset

Note: The dataset will be publicly available no later than June 2018.

Muneeb ul Hassan

How Has MS Voxel Deep Network Managed to Improve 3D Objects Recognition Using Cloud Map Only

12 April 2018
Objects Recognition Using Cloud Map Only

How Has MS Voxel Deep Network Managed to Improve 3D Objects Recognition Using Cloud Map Only

Mobile Laser Scanning (MLS) systems can now scan large areas, like cities or even countries. The produced 3D point clouds can be used as maps for autonomous systems. To do…

Mobile Laser Scanning (MLS) systems can now scan large areas, like cities or even countries. The produced 3D point clouds can be used as maps for autonomous systems. To do so, the automatic classification of the data is necessary and is still challenging, regards to the number of objects present in an urban scene. Xavier Roynard, Jean-Emmanuel Deschaud and François Goulette propose both a training method that balances the number of points per class during each epoch and a 3D CNN capable of effectively learning how to classify scenes containing objects at multiple scales.

MS Voxel Deep Network — A new convolutional neural network (CNN) to classify 3D point clouds of urban or indoor scenes. On the reduced-8 Semantic3D benchmark, this network ranked second, beats the state of the art of point classification methods (those not using a regularization step).

Network Learning Difficulties

Training on scenes point cloud leads to some difficulties. For the point classification task, each point is a sample, so the number of samples per class is very unbalanced (from thousands of points for the class “pedestrian” to tens of millions for the class “ground”). Also with the training method of deep-learning, an Epoch would be to pass through all points of the cloud, which would take a lot of time. Indeed, two very close points have the same neighbourhood, and will, therefore, be classified in the same way.

Authors propose a training method that solves these two problems. We randomly select N (for example 1000) points in each class, then we train on these points mixed randomly between classes, and we renew this mechanism at the beginning of each Epoch.

Once a point p to classify is chosen, we compute a grid of voxels given to the convolutional network by building an occupancy grid centered on p whose empty voxels contain 0 and occupied voxels contain 1. We only use NxNxN cubic grids where n is pair, and we only use isotropic space discretization steps ∆.

Network Training

Some classic data augmentation steps are performed before projecting the 3D point clouds into the voxels grid:

• Flip x and y axis, with probability 0.5

• Random rotation around z-axis

• Random scale, between 95% and 105%

• Random occlusions (randomly removing points), up to 5%

• Random artefacts (randomly inserting points), up to 5%

Random noise in position of points, the noise follows a normal distribution centered in 0 with standard deviation 0.01m

The cost function used is cross-entropy, and the optimizer used is ADAM with a learning rate of 0.001 and ε = 10−8, which are the default settings in most deep-learning libraries.

Architecture — Layers

3D Essential Layers

  • Conv(n, k, s, p) a convolutional layer that transforms feature maps from the previous layer into n new feature maps, with a kernel of size k × k × k and stride s and pads p on each side of the grid.
  • DeConv(n, k, s, p) a transposed convolutional layer that transforms feature maps from the previous layer into n new feature maps, with a kernel of size k × k × k and stride s and pads p on each side of the grid.
  • FC(n) a fully-connected layer that transforms the feature maps from the previous layer into n feature maps.
  • MaxPool(k) a layer that aggregates on each feature map every group of 8 neighbouring voxels.
  • MaxU nPool(k) a layer that computes an inverse of MaxPool(k).
  • ReLU, LeakyReLU and PReLU common non-linearities used after linear layers as Conv and FC. ReLU (x) returns the positive part of x, and to avoid null gradient if x is negative, we can add a slight slope which is fixed (LeakyReLU ) or can be learned (PReLU ).
  • SoftMax a non-linearity layer that rescales a tensor in the range [0, 1] with sum
  • BatchNorm a layer that normalizes samples over a batch.
  • DropOut(p) a layer that randomly zeroes some of the elements of the input tensor with probability p.

The chosen network architecture is inspired from this one that works well in 2D.

Multi-Scale Voxel Network architecture
Multi-Scale Voxel Network architecture: MS3_DeepVoxScene (all tensors are represented as 2D tensors instead of 3D for simplicity).

Datasets

Authors compared 3 different datasets. Paris-Lille-3D contains 50 classes but for our experimentations, we keep only 9 coarser classes. In brackets is indicated the number of points after subsampling at 2 cm.

Among the 3D point cloud scenes datasets, these are those with the most area covered and the most variability

number of points

The covered area is obtained by projecting each cloud on a horizontal plane in pixels of size 10cm × 10cm, then summing the area of all occupied pixels.

A finer resolution of 5 cm was added to better capture the local surface near the point, and a coarser resolution of 15 cm to better understand the context of the object to which the point belongs. This method achieves better results than all methods that classify cloud by points (i. e. without regularization). Even better results could probably be achieved by adding, for example, a CRF after classification.

Example of classified point cloud on S3DIS dataset:

classified point cloud
Classified with MS3_DVS
classified point cloud
Right: ground truth (blue: ground, cerulean blue: buildings, dark green: poles, green: bollards, light green: trash cans, yellow: barriers, dark yellow: pedestrians, orange: cars, red: natural).

Results are very close to the truth. This is achieved by both focusing on the local shape of the object around a point and by taking into account the context of the object.

Quote:

We observe a confusion between the classes wall and board (and more slightly with beam, column, window and door), this is explained mainly because these classes are very similar geometrically and we do not use color. To improve these results, we should not sub-sample the clouds to keep the geometric information thin (such as the table slightly protruding from the wall) and add a 2 cm scale in input to the network, but looking for neighborhoods would then take an unacceptable amount of time.

For a comparison with the state-of-the-art methods on S3DIS 5th fold see table below:

table 3

To evaluate architecture choices, this classification task was tested by one of the first 3D convolutional networks: VoxNet.

Comparison to VoxNet
Comparison to VoxNet

Comparison with the state-of-the-art methods on reduced-8 Semantic3D benchmark:

Per class IoU

Comparison per class between MS1_DeepVoxScene and MS3_DeepVoxScene on Paris-Lille-3D dataset. This shows that the use of multi-scale networks improves the results on some classes, in particular the buildings, barriers and pedestrians classes are greatly improved (especially in Recall), while the car class loses a lot of Precision.

Comparison per class
Comparison per class between MS1_DeepVoxScene and MS3_DeepVoxScene on Paris-Lille-3D dataset

Conclusion

Proposed training method — MS3_DVS, balances the number of points per class seen during each epoch, as well as a multi-scale CNN that is capable of learning to classify point cloud scenes. You may follow this on Semantic3D benchmark. Now it’s number 2 among all — very good result. This is achieved by both focusing on the local shape of the object around a point and by taking into account the context of the object.