R-CNN – Neural Network for Object Detection and Semantic Segmentation

29 November 2018
r-cnn object detection

R-CNN – Neural Network for Object Detection and Semantic Segmentation

Computer vision is an interdisciplinary field that has been gaining huge amounts of traction in recent years (since CNN), and self-driving cars have taken center stage. One of the most…

Computer vision is an interdisciplinary field that has been gaining huge amounts of traction in recent years (since CNN), and self-driving cars have taken center stage. One of the most important part of computer vision is object detection. Object detection helps in solving the problem in pose estimation, vehicle detection, surveillance, etc.

Object detection
Object detection

The difference between object detection algorithms and classification algorithms is that in detection algorithms, we try to draw a bounding box around the object of interest to locate it within the image. With object detection, it is possible to draw many bounding boxes around different objects which represent different objects or may be same objects.

object detection algorithm
Object detection algorithm

The main problem with standard convolutional network followed by a fully connected layer is that the size of the output layer is variable — not constant, which means the number of occurrences of the objects appears in the image is not fixed. A very simple approach to solving this problem would be to take different regions of interest from the image and use a CNN to classify the presence of the object within that region.

DataSet

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using a crowd-sourcing tool like Amazon’s Mechanical Turk. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC2013) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories.

At all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images. ImageNet consists of variable-resolution images. Therefore, the images have been down-sampled to a fixed resolution of 256×256. Given a rectangular image, the image is rescaled and cropped out the central 256×256 patch from the resulting image.

The PASCAL VOC provides standardized image data sets for object class recognition. It also provides a standard set of tools for accessing the data sets and annotations, enables evaluation and comparison of different methods and ran challenges evaluating performance on object class recognition.

The Architecture

r-cnn architecture

The goal of R-CNN is to take in an image, and correctly identify where the primary objects (via a bounding box) in the picture.

  • Inputs: Image;
  • Outputs: Bounding boxes and labels for every object in images.

R-CNN detection system consists of three modules. The first generates category-independent region proposals. These proposals identify the set of candidate detections present in an image. The second module is a deep convolutional neural network that extracts a feature vector from each region. The third module is a set of class-specific classifier i.e. linear SVMs.

R-CNN does what we might intuitively do as well – propose a bunch of boxes in the image and see if any of them correspond to an object. R-CNN creates these bounding boxes, or region proposals, using a process called Selective Search. At a high level, Selective Search (shown in Fig:1 below) looks at the image through windows of different sizes, and for each size tries to group adjacent pixels by texture, color, or intensity to identify objects.

Fig: 1 Selective Search looks through windows of multiple scales and looks for adjacent pixels that share textures, colors, or intensities.
Fig: 1 Selective Search looks through windows of multiple scales and looks for adjacent pixels that share textures, colors, or intensities.

As soon as the proposals are created, R-CNN enclosed the region to a standard square size and passed it through to a modified version of AlexNet. On the last layer of the CNN, R-CNN adds a Support Vector Machine (SVM) that classifies whether this is an object and if so what object. This is step 4 in the image above.

Improving the Bounding Boxes

After founding the object in the box, we can tighten the box to fit the object to its true dimension. This is the final step of R-CNN. R-CNN runs a simple linear regression on the region proposal to generate bounding box coordinates to get the final result. The inputs and outputs of this regression model are:

  • Inputs: sub-regions of the image corresponding to objects.
  • Outputs: New bounding box coordinates for the object in the sub-region.

So, to summarize, R-CNN is just the following steps:

  • Generate a set of region proposals for bounding boxes.
  • Run the images in the bounding boxes through a pre-trained AlexNet and finally an SVM to see what object the image in the box is.
  • Run the box through a linear regression model to output tighter coordinates for the box once the object has been classified.

r-cnn implementation

Implementation

Time taken to train the network is very huge as the network have to classify 2000 region proposals per image. It cannot be implemented real time as it takes around 47 seconds for each test image. The particular search algorithm is a fixed algorithm. Therefore, no learning is happening at that stage. This will lead to a generation of bad region of proposal.

[Tensorflow][Keras]

Result

R-CNN provides the state of the art results. Previous systems were complex ensembles combining multiple low-level image features with high-level context from object detectors and scene classifiers. R-CNN presents a simple and scalable object detection algorithm that gives a 30% relative improvement over the best previous results on ILSVRC2013.

Per-class average precision (%) on the ILSVRC2013 detection test set.
Mean average precision (%) per-class on the ILSVRC2013 detection test set.

R-CNN achieved this performance through two insights. The first is to apply high-capacity convolutional neural networks to bottom-up region proposals to localize and segment objects. The second is to train large CNNs when labels of training data are scarce. R-CNN results show that it is highly useful to pre-train the network with supervision.

r-cnn neural network

U-Net: Image Segmentation Network

23 November 2018
u-net

U-Net: Image Segmentation Network

U-Net is considered one of the standard CNN architectures for image classification tasks, when we need not only to define the whole image by its class but also to segment areas of…

U-Net is considered one of the standard CNN architectures for image classification tasks, when we need not only to define the whole image by its class but also to segment areas of an image by class, i.e. produce a mask that will separate an image into several classes. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization.

The network is trained in end-to-end fashion from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC), U-Net won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512×512 image takes less than a second on a modern GPU.

u-net medical segmentation
Brain segmentation

Key Points

  1. Achieve Good performance on various real-life tasks especially biomedical application;
  2. Computational difficulty (how many and which GPUs you need, how long it will train);
  3. Uses a small number of data to achieve good results.

The U-net Architecture

u-net architecture
Fig. 1. U-net architecture (example for 32×32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.

The network architecture is illustrated in Figure 1. It consists of a contracting path (left side) and an expansive path (right side). The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3×3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2×2 max pooling operation with stride 2 for downsampling.

At each downsampling step, feature channels are doubled. Every step in the expansive path consists of an upsampling of the feature map followed by a 2×2 convolution (up-convolution) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3×3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution.

u-net pipeline

At the final layer, a 1×1 convolution is used to map each 64-component feature vector to the desired number of classes. In total the network has 23 convolutional layers.

Training

The input images and their corresponding segmentation maps are used to train the network with the stochastic gradient descent. Due to the unpadded convolutions, the output image is smaller than the input by a constant border width. A pixel-wise soft-max computes the energy function over the final feature map combined with the cross-entropy loss function. The cross-entropy that penalizes at each position is defined as:

u-net training

The separation border is computed using morphological operations. The weight map is then computed as:

U-net formula

where wc is the weight map to balance the class frequencies, d1 denotes the distance to the border of the nearest cell and d2 denotes the distance to the border of the second nearest cell.

Use Cases and Implementation

U-net was applied to many real-time examples. Some of these are mentioned below:

As we see from the example, this network is versatile and can be used for any reasonable image masking task. High accuracy is achieved,  given proper training, adequate dataset and training time. If we consider a list of more advanced U-net usage examples we can see some more applied patters:

[Pytorch][Tensorflow][Keras]

Results

u-net comparison
Fig:2 Segmentation results (IOU) on the ISBI cell tracking challenge 2015.

U-Net is applied to a cell segmentation task in light microscopic images. This segmentation task is part of the ISBI cell tracking challenge 2014 and 2015. The dataset PhC-U373 contains Glioblastoma-astrocytoma U373 cells on a polyacrylamide substrate recorded by phase contrast microscopy. It contains 35 partially annotated training images. Here U-Net achieved an average IOU (intersection over union) of 92%, which is significantly better than the second-best algorithm with 83% (see Fig 2). The second data set DIC-HeLa are HeLa cells on a flat glass recorded by differential interference contrast (DIC) microscopy [See below figures]. It contains 20 partially annotated training images. Here U-Net achieved an average IOU of 77.5% which is significantly better than the second-best algorithm with 46%.

u-net results
Result on the ISBI cell tracking challenge. (a) part of an input image of the PhC-U373 data set. (b) Segmentation result (cyan mask) with the manual ground truth (yellow border) (c) input image of the DIC-HeLa data set. (d) Segmentation result (random colored masks) with the manual ground truth (yellow border).

The u-net architecture achieves outstanding performance on very different biomedical segmentation applications. It only needs very few annotated images and has a very reasonable training time of just 10 hours on NVidia Titan GPU (6 GB).

VGG16 – Convolutional Network for Classification and Detection

20 November 2018
vgg16

VGG16 – Convolutional Network for Classification and Detection

VGG16 is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”.…

VGG16 is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes. It was one of the famous model submitted to ILSVRC-2014. It makes the improvement over AlexNet by replacing large kernel-sized filters (11 and 5 in the first and second convolutional layer, respectively) with multiple 3×3 kernel-sized filters one after another. VGG16 was trained for weeks and was using NVIDIA Titan Black GPU’s.

vgg16 architecture

DataSet

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images. ImageNet consists of variable-resolution images. Therefore, the images have been down-sampled to a fixed resolution of 256×256. Given a rectangular image, the image is rescaled and cropped out the central 256×256 patch from the resulting image.

The Architecture

The architecture depicted below is VGG16.

VGG16 Artitecture
VGG16 Architecture

The input to cov1 layer is of fixed size 224 x 224 RGB image. The image is passed through a stack of convolutional (conv.) layers, where the filters were used with a very small receptive field: 3×3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations, it also utilizes 1×1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1-pixel for 3×3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv.  layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2.

Three Fully-Connected (FC) layers follow a stack of convolutional layers (which has a different depth in different architectures): the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.

All hidden layers are equipped with the rectification (ReLU) non-linearity. It is also noted that none of the networks (except for one) contain Local Response Normalisation (LRN), such normalization does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.

Configurations

The ConvNet configurations are outlined in figure 02. The nets are referred to their names (A-E). All configurations follow the generic design present in architecture and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.

Figure: 2

Use-Cases and Implementation

Unfortunately, there are two major drawbacks with VGGNet:

  1. It is painfully slow to train.
  2. The network architecture weights themselves are quite large (concerning disk/bandwidth).

Due to its depth and number of fully-connected nodes, VGG16 is over 533MB. This makes deploying VGG a tiresome task.VGG16 is used in many deep learning image classification problems; however, smaller network architectures are often more desirable (such as SqueezeNet, GoogLeNet, etc.). But it is a great building block for learning purpose as it is easy to implement.

[Pytorch]

[Tensorflow]

[Keras]

Result

VGG16 significantly outperforms the previous generation of models in the ILSVRC-2012 and ILSVRC-2013 competitions. The VGG16 result is also competing for the classification task winner (GoogLeNet with 6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2% with external training data and 11.7% without it. Concerning the single-net performance, VGG16 architecture achieves the best result (7.0% test error), outperforming a single GoogLeNet by 0.9%.

It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture with substantially increased depth.

AlexNet – ImageNet Classification with Deep Convolutional Neural Networks

29 October 2018

AlexNet – ImageNet Classification with Deep Convolutional Neural Networks

AlexNet is the name of a convolutional neural network which has had a large impact on the field of machine learning, specifically in the application of deep learning to machine vision. It famously won the…

AlexNet is the name of a convolutional neural network which has had a large impact on the field of machine learning, specifically in the application of deep learning to machine vision. It famously won the 2012 ImageNet LSVRC-2012 competition by a large margin (15.3% VS 26.2% (second place) error rates). The network had a very similar architecture as LeNet by Yann LeCun et al but was deeper, with more filters per layer, and with stacked convolutional layers. It consisted of 11×11, 5×5,3×3, convolutions, max pooling, dropout, data augmentation, ReLU activations, SGD with momentum. It attached ReLU activations after every convolutional and fully-connected layer. AlexNet was trained for 6 days simultaneously on two Nvidia Geforce GTX 580 GPUs which is the reason for why their network is split into two pipelines.

Key Points

  1. Relu activation function is used instead of Tanh to add non-linearity. It accelerates the speed by 6 times at the same accuracy.
  2. Use dropout instead of regularisation to deal with overfitting. However, the training time is doubled with the dropout rate of 0.5.
  3. Overlap pooling to reduce the size of the network. It reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively.

DataSet

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images. ImageNet consists of variable-resolution images. Therefore, the images have been down-sampled to a fixed resolution of 256×256. Given a rectangular image, the image is rescaled and cropped out the central 256×256 patch from the resulting image.

The Architecture

Figure: 01
The architecture depicted in Figure 1, the AlexNet contains eight layers with weights; the first five are convolutional and the remaining three are fully connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. The network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution. The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU. The kernels of the third convolutional layer are connected to all kernel maps in the second layer. The neurons in the fully-connected layers are connected to all neurons in the previous layer.
In short, AlexNet contains 5 convolutional layers and 3 fully connected layers. Relu is applied after very convolutional and fully connected layer. Dropout is applied before the first and the second fully connected year. The network has 62.3 million parameters and needs 1.1 billion computation units in a forward pass. We can also see convolution layers, which accounts for 6% of all the parameters, consumes 95% of the computation.

Training

AlexNet takes 90 epochs which were trained for 6 days simultaneously on two Nvidia Geforce GTX 580 GPUs which is the reason for why their network is split into two pipelines. SGD with learning rate 0.01, momentum 0.9 and weight decay 0.0005 is used. Learning rate is divided by 10 once the accuracy plateaus. The learning rate is decreased 3 times during the training process.

 

 

 

The update rule for w was where i is the iteration index, is the momentum variable and epsilon is the learning rate. Equal learning rate for all layers, which was adjusted manually throughout training. The heuristic which was followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate.

Use-Cases and Implementation

The results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning. Year after the publication of AlexNet was published, all the entries in ImageNet competition use the Convolutional Neural Network for the classification task. AlexNet was the pioneer in CNN and open the whole new research era. AlexNet implementation is very easy after the releasing of so many deep learning libraries.

Result

The network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%. The best performance achieved during the ILSVRC-2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features, and since then the best-published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features.
The results on ILSVRC-2010 are summarized in Table 1.
(Left)Eight ILSVRC-2010 test images and the five labels considered most probable by the model. The correct label is written under each image, and the probability assigned to the correct label is also shown with a red bar (if it happens to be in the top 5). (Right) Five ILSVRC-2010 test images in the first column. The remaining columns show the six training images that produce feature vectors in the last hidden layer with the smallest Euclidean distance from the feature vector for the test image.

Temporal Relational Reasoning in Videos

25 September 2018
temporal relation network

Temporal Relational Reasoning in Videos

The ability to reason about the relations between entities over time is crucial for intelligent decision-making. Temporal relational reasoning allows intelligent species to analyze the current situation relative to the…

The ability to reason about the relations between entities over time is crucial for intelligent decision-making. Temporal relational reasoning allows intelligent species to analyze the current situation relative to the past and formulate hypotheses on what may happen next. Figure 1 shows that given two observations of an event, people can easily recognize the temporal relation between two states of the visual world and deduce what has happened between the two frames of a video.

Figure 1
Figure 1

Humans can easily infer the temporal relations and transformations between these observations, but this task remains difficult for neural networks. Figure 1 shows

  • a – Poking a stack of cans, so it collapses;
  • b – Stack something;
  • c – Tidying up a closet;
  • d – Thumb up.

Activity recognition in videos has been one of the core topics in computer vision. However, it remains difficult due to the ambiguity of describing activities at appropriate timescales.

Previous Work

With the rise of deep convolutional neural networks (CNNs) which achieve state-of-the-art performance on image recognition tasks, many works have looked into designing effective deep convolutional neural networks for activity recognition. For instance, various approaches of fusing RGB frames over the temporal dimension are explored on the Sport1M data-set.

Another technique uses two stream CNNs with one stream of static images, and the other stream of optical flows are proposed to fuse the information of object appearance and short-term motions.

One more technique uses CNN+LSTM model. CNN is used to extract frame features and an LSTM to integrate features over time, is also used to recognize activities in videos. For temporal reasoning, instead of designing the temporal structures manually, it uses a more generic structure to learn the temporal relations in end-to-end training.

This suggestion uses a two-stream Siamese network to learn the transformation matrix between two frames, then uses the brute force search to infer the action category.

State-of-the-art idea

The idea is to use TRN(Temporal Relation Networks). The focus is to model the multi-scale temporal relations in videos. Time contrast networks are used for self-supervised limitation learning of object manipulation from third-person video observation. This work aims to learn various temporal relations in videos in a supervised learning setting. The proposed TRN can be extended to self-supervised learning for robot object manipulation.

The illustration of Temporal Relation Networks.
Figure 2: The illustration of Temporal Relation Networks.

TRN is simple and can be easily plugged into any existing convolutional neural network architecture to enable temporal relational reasoning. It is defined as the pairwise temporal relation as a composite function below:


where the input is the video V with n selected ordered frames as V={f1,f2,…,fn}, where fi is a representation of the ith frame of the video, e.g., the output activation from some standard CNN. To further extend the composite function of the 2-frame temporal relations to higher frame relations such as the 3-frame relation function are given below:

where the sum is again over sets of frames i,j,k that have been uniformly sampled and sorted.

Experiments

Evaluation has been done on a variety of activity recognition tasks using  TRN-equipped networks. For recognizing activities that depend on temporal relational reasoning, TRN-equipped networks outperform a baseline network without a TRN by a large margin. The TRN-equipped networks also obtain competitive results on activity classification in the Something-Something dataset for human-interaction recognition Charades dataset and on Jester Dataset for hand gesture recognition.

Statistics of the datasets used in evaluating the TRNs
Statistics of the datasets used in evaluating the TRNs

The networks used for extracting image features play an important factor in visual recognition tasks. Features from deeper networks such as ResNet usually perform better. The goal here is to evaluate the effectiveness of the TRN module for temporal relational reasoning in videos. Thus, the base network is fixed throughout all the experiments and compare the performance of the CNN model with and without the proposed TRN modules.

Result

Something-Something is a recent video dataset for human-object interaction recognition. There are 174 classes, some of the ambiguous activity categories are challenging, such as ‘Tearing Something into two pieces’ versus ‘Tearing Something just a little bit’, ‘Turn something upside down’ versus ‘Pretending to turn something upside down’. The results on the validation set and test set of Something-V1 and Something-V2 datasets are listed in Figure 3.

Fig:03 Results on the validation set and test set (LEFT), Comparison of TRN and TSN as the number of frames (RIGHT)
Fig:03 Results on the validation set and test set (LEFT), Comparison of TRN and TSN as the number of frames (RIGHT)

RN outperforms TSN in a large margin as the number of frames increases, showing the importance of temporal order.TRN equipped networks also evaluated on the Jester dataset, which is a video dataset for hand gesture recognition with 27 classes. The results on the validation set of the Jester dataset are shown in figure 4.

Fig:04 Jester Dataset Results on (left) the validation set and (right) the test set
Figure 4: Jester Dataset Results on (left) the validation set and (right) the test set
Prediction examples
Predictions example on a) Something-Something, b) Jester, and c) Charades. For each example drawn from Something-Something and Jester, the top two predictions with green text indicating a correct prediction and red indicating an incorrect one.

Comparison with other state-of-the-art

The approach was compared with other state-of-the-art methods.  They evaluate the MultiScale TRN on the Charades dataset for daily activity recognition. The results are listed in Fig 5. This method outperforms various methods such as 2-stream networks and the recent Asynchronous Temporal Field (TempField) method.

Fig:05 Results on Charades Activity Classification

TRN model is capable of correctly identifying actions for which the overall temporal ordering of frames is essential for a successful prediction. This outstanding performance shows the effectiveness of the TRN for temporal relational reasoning and its strong generalization ability across different datasets.

Conclusion

The proposed simple and interpret-able module Temporal Relation Network can do temporal relational reasoning in neural networks for videos. It is evaluated on several recent datasets and established competitive results using only discrete frames and also shown that TRN module discovers visual common sense knowledge in videos.

Temporal alignment of videos from the (a) Something-Something and (b)Jester datasets using the most representative frames as temporal anchor points.
Temporal alignment of videos from the (a) Something-Something and (b)Jester datasets using the most representative frames as temporal anchor points.

Inferring a 3D Human Pose out of a 2D Image with FBI

16 July 2018
3D pose estimation based on 2D joints and Forward-or-Backward Information (FBI) for each bone

Inferring a 3D Human Pose out of a 2D Image with FBI

Autonomous driving, virtual reality, human-computer interaction and video surveillance — these are all application scenarios, where you would like to derive a 3D human pose out of a single RGB…

Autonomous driving, virtual reality, human-computer interaction and video surveillance — these are all application scenarios, where you would like to derive a 3D human pose out of a single RGB image. Significant advances have been made in this area after Convolutional Neural Network has been employed to solve the problem of 3D pose inference. However, the task remains challenging for outdoor environments as it is very difficult to obtain 3D pose ground truth for in-the-wild images.

So, let’s see how this fancy “FBI” abbreviation helps with inferring a 3D human pose out of a single RGB image.

Suggested Approach

Group of researchers forms Shenzhen (China) proposed a novel framework for deriving a 3D human pose from a single image. In particular, they suggest exploiting the information of each bone indicating if it is forward or backward with respect to the view of the camera. They refer to this data as Forward-or-Backward Information (or simply, FBI).

Their method starts with training a Convolutional Neural Network with two branches: one is related to mapping 2D joint locations from an image and another comes from FBI of bones. In fact, several state-of-the-art methods use information on the 2D joint locations for predicting a 3D human pose. However, this is an ill-posed problem since different valid 3D poses can explain the same observed 2D joints. At the same time, information on whether each bone is forward or backward when combined with 2D joint locations provides a unique 3D joint position. So, the researchers claim that feeding both 2D joint locations and FBI of bones into a deep regression network will provide better predictions of the 3D positions of joints.

Distribution of out-of-plane angles for all bones marked as “uncertain”
Distribution of out-of-plane angles for all bones marked as “uncertain”

Furthermore, to support the training, they have developed an annotation user interface and labeled FBI for around 12,000 in-the-wild images. They simplified the problem by distinguishing 14 bones with each bone having one of the three states with respect to camera view: forward, backward or parallel to sight. Hired annotators were asked to label images randomly selected from MPII dataset, where the 2D bones are provided. For each of the bones, the annotator was asked to make a choice from three options: forward, backward or uncertain (considering the difficulty to give an accurate judgment for the “parallel to sight” option). It is reported that in total around 20% of bones were marked as uncertain. The figure above illustrates the distribution of out-of-plane angles for all uncertain bones. As expected, people show more uncertainty when the bone is closer to parallel with the view plane.

Network Architecture

Let’s now discover in more depth the network architecture of the suggested approach.

Network architecture
Network architecture

The network consists of three components:

1. 2D pose estimator. It takes an image of a human as input and outputs the 2D locations of 16 joints of the human.

2. FBI predictor. This component also takes an image as input but outputs the FBI of 14 bones with three possible statuses: forward, backward and uncertain. The network here starts from a sequence of convolutional layers, followed by two successive stacked hourglass modules. The extracted feature maps are then fed into a set of convolutional layers and followed by a fully connected layer with a softmax layer to output classification results.

3. 3D pose regressor. At this stage, a deep regression network is learned to infer the 3D coordinates of the joints by taking both their 2D locations and the FBI as input. To keep more information, the regressor takes the generated probability matrix of the softmax layer as input. Thus, 2D locations and the probability matrix are concatenated together and then mapped to the 3D pose by exploiting two cascaded blocks.

Comparisons against existing methods

The quantitative comparison was carried out based on Human3.6M, a dataset containing 3.6 million of RGB images that capture 7 professional actors performing 15 different activities (i.e., walking, eating, sitting etc.). The mean per joint position error (MPJPE) between the ground truth and prediction was used as the evaluation metric, and the results are presented in Table 1.

Table 1. Quantitative comparisons based on MPJPE. Ordinal [19] is a concurrent work with the method presented here. The best score without consideration of this work is marked in blue bold. Black bold is used to highlight the best score when taking this work for comparison.
Table 1. Quantitative comparisons based on MPJPE. Ordinal [19] is a concurrent work with the method presented here. The best score without consideration of this work is marked in blue bold. Black bold is used to highlight the best score when taking this work for comparison.

For some of the previous works, the prediction has been further aligned with the ground truth via a rigid transformation. The results are presented in the table below.

Table 2. Quantitative comparisons based on MPJPE after rigid transformation. Ordinal [19] is a concurrent work with the method presented here. The best score without consideration of this work is marked in blue bold. Black bold is used to highlight the best score when taking this work for comparison.
Table 2. Quantitative comparisons based on MPJPE after rigid transformation. Ordinal [19] is a concurrent work with the method presented here. The best score without consideration of this work is marked in blue bold. Black bold is used to highlight the best score when taking this work for comparison.

The results of the quantitative comparison demonstrate that the presented approach outperforms all previous works almost on all actions and makes considerable improvements in such complicated actions like sitting and sitting down. However, it worth noting that one of the works, marked as Ordinal [19] in the above tables, exploited a similar strategy and achieved comparable results. Specifically, it proposed an annotation tool for collecting the depth relations for all joints. However, their annotation procedure seems to be a much more tedious task comparing to the one presented in this article.

To confirm the efficiency of this method for in-the-wild images, the researchers took 1,000 images from their FBI dataset as a test data and conducted another comparison against the state-of-the-art method presented by Zhou et al. Here the correctness ratio of FBI derived from the 3D pose was used as the evaluation metric. Thus, the method of Zhou et al. had 75% correctness ratio while the presented approach reached 78%. You can also see the results of a qualitative comparison on the image below.

Qualitative comparison results of the suggested method on some in-the-wild (ITW) images
Qualitative comparison results of the suggested method on some in-the-wild (ITW) images

Bottom line

The proposed approach suggests exploiting a new information called Forward-Backward Information (FBI) of bones for 3D human pose estimation, and this piece of data, in fact, helps to get more 3D-aware features from images. As a result, this method outperforms all previous works. However, this is not the only contribution of this research team. They have also labeled the FBI for 12,000 in-the-wild images with a well-designed user interface. These images will become publicly available to benefit other researchers working in this area.

A New Method for Image De-blurring

21 June 2018
устранения размытия

A New Method for Image De-blurring

When we use a camera, we want the recorded image to be a faithful representation of what we see in front of us. However, very often images contain blur, and…

When we use a camera, we want the recorded image to be a faithful representation of what we see in front of us. However, very often images contain blur, and one of the major sources of blur is the camera or object motion. This blur is commonly known as motion blur, and in the past, there have been many attempts to remove the blur and reconstruct a sharp image.

Blurring and de-blurring?

The convolution operation is the process of applying a general purpose filter to an image (by applying a function using a kernel or convolution matrix to local receptive fields of the image). De-blurring, in essence, is trying to reverse convolution on an image (and it is often called deconvolution). This all comes from the fact that the complex image formation process generates an image and de-blurring is trying to remove the blur introduced to the image during this process.

There are two types of approaches generally, to image de-blurring: methods based on blind deconvolution and techniques based on non-blind deconvolution. Blind deconvolution refers to deconvolving the image without the explicit knowledge of the impulse response function, used in the convolution. The methods relying on blind de-convolution often make appropriate assumptions to estimate the impulse response function, while the others rely on the assumption that the kernel (the impulse response function) is known.

Image de-blurring

New de-blurring approach

Arguing that many of the previously existing approaches assume an over-simplistic image formation model, researchers from the Université Paris-Saclay and Universidad de la República propose a novel de-blurring method based on non-blind deconvolution. In their paper, named “Modeling realistic degradations in non-blind deconvolution,” they tackle the problem of motion de-blurring in images by giving a more realistic (and more complex) image formation model.

Starting from the simplest image acquisition model, which takes into account: the ideal non-blurred (sharp) image, the blurring kernel and a realization of Gaussian noise, the authors propose an extended, more realistic formation model. The simplest model used very often in many approaches is given as a linear combination of the sharp image, the kernel, and the noise. However, the authors argue that it is not powerful enough to capture the process of generating an image and modeling a realistic image acquisition pipeline. As they explain, this is due to non-invertible, non-linear degradations that can occur along the whole formation pipeline. Examples of these degradations addressed with the proposed model are saturation, quantization, and gamma correction.

In the novel approach, the authors approximate the motion blurring function with a model that includes pixel saturation operator, pixel quantization function and a gamma correction coefficient. Since imposing a model itself is not enough to solve a problem (especially of this degree), the authors also present a deconvolution method that works under real practical degradations. The technique is a non-blind deconvolution, so it assumes knowledge of the kernel function.

Image de-blurring method

The method

To explain it simply, the whole de-blurring method is based on defining each degradation as energy (that expresses the data fitting between the ideal (sharp) image and the blurred one) that is minimized using the Stochastic Deconvolution framework.

Based on coordinate descent algorithm, which is derivative-free and it can be used to minimize any energy (cost) minimization problem, the method is reducing the defined energies for the three image degradations: pixel saturation, quantization, and gamma correction.

In the paper, separate data fitting terms (defining the cost or energy) are given for the three degradations, and finally, a combined one is proposed that addresses the problem from the viewpoint of all three (realistic) degradations.

Experiments and Evaluation

The authors study each of the separate models for the degradations mentioned before (except the gamma correction). They apply the method to images from the dataset BSDS300 and calculate and record PSNR (peak signal-to-noise ratio) as an evaluation metric. They show that their models outperform the previous approaches.

To evaluate the complete method, tackling all three degradations at once, the authors created a realistic dataset of 8 sharp, natural images. They apply inverse gamma curve, synthetically blur the images and finally, they saturate the pixels by clipping them at the 98th percentile. They also add Gaussian noise and quantization. In this way, they generate degraded images applying all three degradations that the model can tackle. The results are shown in the figure.

Almeida
Results of the evaluation of the method using PSNR (peak signal-to-noise ratio)

This method shows that previous approaches to de-blurring are over-simplistic. Moreover, it shows that the image formation pipeline is a complex non-linear mapping and de-blurring is not a trivial task. However, addressing the common, known image degradations using energy minimization algorithm and well-defined functions gives excellent results despite the complexity of the problem. Anyway, this approach works with non-blind deconvolution, and the authors leave the extension of the method to blind deconvolution as future work.

Dane Mitrev

 

Human Rights Violations Recognition in Images Using CNN Features

18 May 2018
Human Rights Violations Recognition in Images Using CNN Features

Human Rights Violations Recognition in Images Using CNN Features

Human rights violations have been unfolding during the entire human history, while nowadays they increasingly appear in many different forms around the world. Human rights violations refer to the actions…

Human rights violations have been unfolding during the entire human history, while nowadays they increasingly appear in many different forms around the world. Human rights violations refer to the actions executed by state or non-state actors that breach any part of those rights that protect individuals and groups from behaviors that interfere with fundamental freedoms and human dignity.

Photos and videos have become an essential source of information for human rights investigations, including Commissions of Inquiry and Fact-finding Missions. Investigators often receive digital images directly from witnesses, providing high-quality corroboration of their testimonies. In most instances, investigators receive images from third parties (e.g. journalists or NGOs), but their provenance and authenticity are unknown.

A third source of digital images is social media, e.g. uploaded to Facebook, again with uncertainty regarding authenticity or source. That sheer volume of images means that to manually sift through the photos to verify if any abuse is taking place and then act on it, would be tedious and time-consuming work for humans. For this reason, a software tool aimed at identifying potential abuses of human rights, capable of going through images quickly to narrow down the field would greatly assist human rights investigators. Major contributions to this work are as follows:

  1. A new dataset of human rights abuses, containing approximately 3k images for 8 violation categories.
  2. Assess the representation capability of deep object-centric CNNs and scene-centric CNNs for recognizing human rights abuses.
  3. Attempt to enhance human rights violations recognition by combining object-centric and scene-centric CNN features over different fusion mechanisms
  4. Evaluate the effects of different feature fusion mechanisms for human rights violations recognition.

Human Rights Violations Database

Many organizations concerned with human rights advocacy use digital images as a tool for improving the exposure of human rights and international humanitarian law violations that may otherwise be impossible. To advance the automated recognition of human rights violations a well-sampled image database is required.

Human Rights Violations Database

Human Rights Archive (HRA) database, a repository of approximately 3k photographs of various human rights violations captured in real-world situations and surroundings, labeled with eight semantic categories, comprising the types of human rights abuses encountered in the world. The dataset contains eight violation categories and a supplementary ‘no violation’ class.

Human rights violations recognition is closely related to, but radically different from, object and scene recognition. For this reason, following a conventional image collection procedure is not appropriate for collecting images with respect to human rights violations. The first issue encountered is that the query terms for describing different categories of human rights violations must be provided by experts in the field of human rights.

Non-governmental organizations (NGOs) and their public repositories are considered to create a dataset.

The first NGO considered is Human Rights Watch which offers an online media platform capable of exposing human rights and international humanitarian law violations in the form of various media types such as videos, photo essays, satellite imagery and audio clips. Their online repository contains nine main topics in the context of human rights violations (arms, business, children’s rights, disabilities, health and human rights, international justice, LGBT, refugee rights and women rights) and 49 subcategories. One considerable drawback in the course of that process is the presence of a watermark in most of the video files available from that platform. As a result, all the recorded images that originally contained the watermark had to be cropped in a suitable way.

Only colour images of 600×900 pixels or larger were retrieved after the cropping stage. In addition to those images, all photo essays available for each topic and its subcategories are added, resulting in 342 more images to the final array. The entire pipeline used for collecting and filtering out the pictures from Human Rights Watch is depicted in Figure 1.

The second NGO investigated is the United Nations which presents an online collection of images in the context of human rights. Their website is equipped with a search mechanism capable of returning relevant images for simple and complex query terms.

Search mechanism
Figure 1

The final dataset contains a set of 8 human rights violations categories and 2847 images. 367 ready-made images are downloaded from the two online repositories representing 12.88% of the entire dataset, while the remainder (2480) images are recorded from videos coming out of Human Rights Watch media platform. The final dataset consists of eight categories which are as follows:

  1. Arms
  2. Child Labour
  3. Child Marriage
  4. Detention Centres
  5. Disability Rights
  6. Displaced populations
  7. Environment
  8. Out of School

How It Works

Given the impressive classification performance of the deep convolutional neural networks, three modern object-centric CNN architectures, ResNet50, VGG 16 and VGG 19 convolutional-layer CNNs are used and then fine-tune them on HRA to create baseline CNN models.

Transfer Learning technique is used to injects knowledge from other tasks by deploying weights and parameters from a pre-trained network to the new one and has become a commonly used method to learn task-specific features.

Considering the size of the dataset, the chosen method to apply a deep CNN is to reduce the number of free parameters. To achieve this, the first filter stages can be trained in advance on different tasks of object or scene recognition and held fixed during training on human rights violations recognition. By freezing (preventing the weights from getting updated during training) the earlier layers, overfitting can be avoided.

Feature extraction modules has been initialized using pre-trained models from two different large-scale datasets, ImageNet and Places. ImageNet is an object-centric dataset which contains images of generic objects including person and therefore is a good option for understanding the contents of the image region comprising the target person. On the contrary, Places is a scene-centric dataset specifically created for high-level visual understanding tasks such as recognizing scene categories.

Network architecture for high-level feature extraction
Figure 2: Network architecture used for high-level feature extraction with HRA

Hence, pretraining the image feature extraction model using this dataset ensures providing global (high level) contextual support. For the target task (human rights violation recognition), the network will output scores for the eight target categories of the HRA dataset or no violation if none of the categories is present in the image.

Results

The classification results for top-1 accuracy and coverage are listed below. A more natural performance metric to use in this situation is coverage, the fraction of examples for which the system can produce a response. For all the experiments in this paper, we employ a threshold of 0.85 over the prediction confidence in order to report the coverage performance metric.

Perfomance metric

Figure 3 shows the responses to examples predicted by the best performing HRA-CNN, VGG19. Broadly, we can identify one type of misclassification given the current label attribution of HRA: images depicting the evidence which are responsible for a particular situation and not the actual action, such as schools being targeted by armed attacks. Future development of the HRA database will explore to assign multi-ground truth labels or free-form sentences to images to better capture the richness of visual descriptions of human rights violations.

Predictions by HRA-CNN and VGG19
Figure 3

This technique addresses the problem of recognizing abuses of human rights given a single image. HRA dataset is created with images used in non-controlled environments containing activities which reveal a human right being violated without any other prior knowledge. Using this dataset and a two-phase deep transfer learning scheme, a state of the art deep learning algorithms is present for the problem of visual human rights violations recognition. A technology capable of identifying potential human rights abuses in the same way as humans do has a lot of potential applications in human-assistive technologies and would significantly support human rights investigators.

Muneeb Ul Hassan

Image Inpainting for Irregular Holes Using Partial Convolutions

8 May 2018
Image Inpainting for Irregular Holes Using Partial Convolutions

Image Inpainting for Irregular Holes Using Partial Convolutions

Deep learning is growing very fast and it is one of the fast-growing areas of artificial intelligence. It has been used in many fields extensively including real-time object detection, image…

Deep learning is growing very fast and it is one of the fast-growing areas of artificial intelligence. It has been used in many fields extensively including real-time object detection, image recognition, and video classification. Deep learning usually implemented as Convolutional Neural Network, Deep Belief Network, Recurrent Neural Network etc. One of the problems with images in image inpainting. Image inpainting is the task of filling the holes in an image. The goal of this work is to propose a model for image inpainting that operates robustly on irregular hole patterns and produces semantically meaningful predictions that incorporate smoothly with the rest of the image without the need for any additional post-processing or blending operation. It can be used with many applications e.g. it can be used in image editing to remove unwanted image content while filling the image with reasonable content.

There are many different approaches used in for image inpainting but none of them uses the deep learning approach and these approaches have some limitations. One of the methods is called patch match which iteratively searches for the best fitting patches to fill in the holes. While this approach generally produces smooth results, it is limited by the available image statistics and has no concept of visual semantics. Another limitation of many recent approaches is the focus on rectangular shaped holes, often assumed to be a center in the image. Another limitation of many recent methods is the focus on rectangular shaped holes, often assumed to be the center in the image. These limitations may lead to overfitting to the rectangular holes, and ultimately limit the utility of these models in the application.

How Does It Work?

To overcome the limitations of the previous approach, Partial convolution has been used by Nvidia Research to solve the image inpainting problem. Partial Convolution Layer comprising a masked and re-normalized convolution operation followed by a mask-update step. The main extension is the automatic mask update step, which removes any masking where the partial convolution was able to operate on an unmasked value. The following contribution has been made:

  • The use of partial convolutions with an automatic mask update step for achieving state-of-the-art on image inpainting.
  • Substituting convolutional layers with partial convolutions and mask updates can achieve state-of-the-art inpainting results.
  • Demonstrate the efficacy of training image-inpainting models on irregularly shaped holes.

Partial Convolutional Layer

The model uses stacked partial convolution operations and mask updating steps to perform image inpainting. Partial convolution operation and mask update function jointly as the Partial Convolutional Layer.

Let W be the convolution filter weights for the convolution filter and b is the corresponding bias. is the feature values (pixels values) for the current convolution (sliding) window and M is the corresponding binary mask. The partial convolution at every location is expressed as:

Partial Convolutional Layer

Where ⊙ denotes element-wise multiplication. As can be seen, output values depend only on the unmasked inputs. The scaling factor 1/sum(M) applies appropriate scaling to adjust for the varying amount of valid (unmasked) inputs. After each partial convolution operation, the mask has been updated. The unmasking rule is simple: if the convolution was able to condition its output on at least one valid input value, then remove the mask for that location. This is expressed as:

Partial Convolutional Layer

and can easily be implemented in any deep learning framework as part of the forward pass.

Network Architecture

Partial convolution layer is implemented by extending existing standard PyTorch. The straightforward implementation is to define binary masks of size C×H×W, the same size with their associated images/features, and then to implement mask updating is implemented using a fixed convolution layer, with the same kernel size as the partial convolution operation, but with weights identically set to 1 and bias set to 0. The entire network inference on a 512×512 image takes 0.23s on a single NVIDIA V100 GPU, regardless of the hole size.

The architecture used is UNet-like architecture, replacing all convolutional layers with partial convolutional layers and using nearest neighbor up-sampling in the decoding stage.

Network Architecture
Figure 1: This architecture used in image inpainting where all the convolutional layers replaced with partial convolutional layers

ReLU is used in the encoding stage and LeakyReLU with alpha = 0.2 is used between all decoding layers. The encoder comprises eight partial convolutional layers with stride=2. The kernel sizes are 7, 5, 5, 3, 3, 3, 3 and 3. The channel sizes are 64, 128, 256, 512, 512, 512, 512, and 512. The last partial convolution layer’s input will contain the concatenation of the original input image with hole and original mask.

Loss Function

The loss functions target both per-pixel reconstruction accuracy as well as composition, i.e. how smoothly the predicted hole values transition into their surrounding context. Given an input image with hole Iin and mask M, the network prediction Iout and ground truth image Igt, then the pixel loss is defined as:

Loss Function

The perceptual (perceptual loss functions measures high-level perceptual and semantic differences between images. They make use of a loss network which is pretrained for image classification, meaning that these perceptual loss functions are themselves deep convolutional neural networks) loss is defined as:

Loss Function

The perceptual loss computes the L1 distances between both Iout and Icompand the ground truth. To perform autocorrelation, style loss term is introduced on each feature map.

Loss Function

The total loss is the combination of all the above loss:

Total Loss

Results

Partial convolution outperforms other methods. To prove that partial convolution performs better than other methods, l1 error, peak signal-to-noise ratio (PSNR), Structural SIMilarity (SSIM) index and Inception score(IScore) evaluation metrics are used. Below table shows the comparison results. It can be seen that PConv method outperforms all the other methods on these measurements on irregular masks.

measurements on irregular masks

The use of a partial convolution layer with an automatic mask updating mechanism and achieve state-of-the-art image inpainting results. The model can robustly handle holes of any shape, size location, or distance from the image borders. Further, the performance does not deteriorate catastrophically as holes increase in size, as seen in Figure 2.

Сorresponding inpainted results
Figure 2: Top row: input; bottom row: corresponding inpainted results
Comparison convolution layer and partial convolution results
Figure 3: Comparison between typical convolution layer based results (Conv) and partial convolution layer based results (PConv)

More results of partial convolution (PConv) approach:

PConv PConv

Muneeb ul Hassan