Deep Clustering Approach for Image Classification Task

20 September 2018
deepcluster facebook

Deep Clustering Approach for Image Classification Task

Clustering of images seems to be a well-researched topic. But in fact, little work has been done to adapt it to the end-to-end training of visual features on large-scale datasets.…

Clustering of images seems to be a well-researched topic. But in fact, little work has been done to adapt it to the end-to-end training of visual features on large-scale datasets.

The existence and usefulness of ImageNet, a fully-supervised dataset, has contributed to pre-training of convolutional neural networks. However, ImageNet is not so large by today standards: it “only” contains a million images. Now we need to move to the next level and build a bigger and more diverse dataset, potentially consisting of billions of images.

No Supervision Required

Can you imagine the number of manual annotations required for this kind of dataset? This is huge! Replacing labels by raw metadata is also a wrong solution as this leads to biases in the visual representations with unpredictable consequences.

So, it looks like we need methods that can be trained on internet-scale datasets with no supervision. That’s precisely what a Facebook AI Research team suggests. DeepCluster is a novel clustering approach for the large-scale end-to-end training of convolutional neural networks.

The authors of this method claim that the resulting model outperforms the current state of the art by a significant margin on all the standard benchmarks. But let’s first discover the previous works in this research area.

Previous Works

All the related work can be arranged into three groups:

  • Unsupervised learning of features: for example, Yang et al. iteratively learn Сonvnet features and clusters with a recurrent framework, Bojanowski and Joulin learn visual features on a large dataset with a loss that attempts to preserve the information flowing through the network.
  • Self-supervised learning: for instance, Doersch et al. use the prediction of the relative position of patches in an image as a pretext task, Noroozi and Favaro train a network to rearrange shuffled patches spatially. These approaches are usually domain dependent.
  • Generative models: for example, Donahue et al. and Dumoulin et al. have shown that using a GAN with an encoder results in visual features that are pretty much competitive.

State-of-the-art idea

DeepCluster is a clustering method presented recently by a Facebook AI Research team. The method iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update the weights of the network. For simplicity, the researchers have focused their study on k-means, but other clustering approaches can also be used, like for instance, Power Iteration Clustering (PIC).

Images and their 3 nearest neighbors: query –> results from randomly initialized network –> results from the same network after training with DeepCluster PIC
Images and their 3 nearest neighbors: query –> results from randomly initialized network –> results from the same network after training with DeepCluster PIC

Such an approach has a significant advantage over the self-supervised methods as it doesn’t require specific signals from the output or extended domain knowledge. As we will see later, DeepCluster achieves significantly higher performance than previously published unsupervised methods.

Let’s now have a closer look at the design of this model.

Method overview

The performance of random convolutional networks is intimately tied to their convolutional structure which gives a strong prior on the input signal. The idea of DeepCluster is to exploit this weak signal to bootstrap the discriminative power of a Сonvnet.

As illustrated below, the method implies iterative clustering of deep features and using the cluster assignments as pseudo-labels to learn the parameters of the Сonvnet.

DeepCluster
Illustration of the proposed method

This type of alternating procedure is prone to trivial solutions, which we’re going to discuss briefly right now:

  • Empty clusters. Automatic reassigning of empty clusters solve this problem during the k-means optimization.
  • Trivial parametrization. If the vast majority of images is assigned to a few clusters, the parameters will exclusively discriminate between them. The solution to this issue lies in sampling images based on a uniform distribution over the classes, or pseudo-labels.

DeepCluster is based on a standard AlexNet architecture with five convolutional layers and three fully connected layers. To remove color and increase local contrast, the researchers apply a fixed linear transformation based on Sobel filters.

So, the model doesn’t look complicated, but let’s check its performance on the ImageNet classification and transfer tasks.

Results

The results of preliminary studies are demonstrated below:

  • (a) the evolution of the Normalized Mutual Information (NMI) between the cluster assignments and the ImageNet labels during training;
  • (b) the development of the model’s stability along the training epochs;
  • (c) the impact of the number of clusters k on the quality of the model (k = 10,000 gives the best performance).
Preliminary studies
Preliminary studies

To assess the quality of a target filter, the researchers learn an input image that maximizes its activation. The figure below shows these synthetic filter visualizations and the top 9 activated images from a subset of 1 million images from YFCC100M.

Filter visualization and top 9 activated images for target filters in the layers conv1, conv3, and conv5 of an AlexNet trained with DeepCluster
Filter visualization and top 9 activated images for target filters in the layers conv1, conv3, and conv5 of an AlexNet trained with DeepCluster

Deeper layers in the network seem to capture larger textural structures. However, it looks like some filters in the last convolutional layers merely replicate the texture already captured in the previous layers.

Check below the results from the last convolutional layers but this time using VGG-16 architecture instead of AlexNet.

Filter visualization and top 9 activated images for target filters in the last convolutional layer of VGG-16 trained with DeepCluster
Filter visualization and top 9 activated images for target filters in the last convolutional layer of VGG-16 trained with DeepCluster

As you can see, the filters learned without any supervision, can capture quite complex structures.

Next figure shows the top 9 activated images of some filters that seem to be semantically coherent. The filters in the top row reflect the structures that are highly correlated with object class, while the filters in the bottom row seem to trigger on style.

Top 9 activated images for target filters in the last convolutional layer
Top 9 activated images for target filters in the last convolutional layer

Comparison

To compare DeepCluster to other methods, the researchers train a linear classifier on top of different frozen convolutional layers. The table below reports the classification accuracy of different state-of-the-art approaches on the ImageNet and the Places dataset.

On ImageNet, DeepCluster outperforms state of the art from conv2 to conv5 layers by 1-6%. Poor performance in the first layer is probably due to the Sobel filtering discarding color. Remarkably, the difference of performance between DeepCluster and a supervised AlexNet is only around 4% at conv2-conv3 layers, but rises to 12.3% at conv5, showing where the AlexNet stores most of the class level information.

Table 1. Linear classification on ImageNet and Places using activations from the convolutional layers of an AlexNet as features
Table 1. Linear classification on ImageNet and Places using activations from the convolutional layers of an AlexNet as features

The same experiment on the Places dataset reveals that DeepCluster yields conv3-conv4 features that are comparable to those trained with the ImageNet labels. This implies that when the target task is sufficiently far from the domain covered by the ImageNet, labels are less important.

The next table summarizes the comparisons of DeepCluster with other feature learning approaches on the three tasks: classification, detection, and semantic segmentation. As you can see, DeepCluster outperforms all previous unsupervised methods on all three tasks with the most substantial improvement in semantic segmentation.

Table 2: Comparison of the proposed approach to state-of-the-art unsupervised feature learning on classification, detection, and segmentation on Pascal VOC
Table 2: Comparison of the proposed approach to state-of-the-art unsupervised feature learning on classification, detection, and segmentation on Pascal VOC

Bottom Line

DeepCluster proposed by the Facebook AI Research team achieves performance that is significantly better than the previous state of the art on every standard transfer task. What is more, when tested on the Pascal VOC 2007 object detection task with fine-tuning, DeepCluster is only 1.4% below the supervised topline.

This approach makes a little assumption about the inputs and doesn’t require much domain-specific knowledge, which makes it a good candidate for learning deep representations specific domains where labeled datasets are not available.

Unsupervised Attention-Guided Image-to-Image Translation

30 July 2018
Unsupervised Attention-Guided Image-to-Image Translation

Unsupervised Attention-Guided Image-to-Image Translation

Image-to-image translation is the task of mapping an image from a source domain to a target domain. Applications include image colorization, image super-resolution, style transfer, domain adaptation and data augmentation. Most of the approaches require…

Image-to-image translation is the task of mapping an image from a source domain to a target domain. Applications include image colorizationimage super-resolutionstyle transferdomain adaptation and data augmentation. Most of the approaches require data from each domain to be paired or under alignment, e.g., when translating satellite images to topographic maps, which restricts applications and may not even be possible for some domains. Unsupervised approaches, such as DiscoGAN and CycleGAN overcome this problem with cyclic losses which encourage the translated domain to be faithfully reconstructed when mapped back to the original domain. Existing algorithms feed an input image to an encoder–decoder-like neural network architecture called the generator, which tries to translate the image. Then, this output is fed to a discriminator which attempts to classify if the output image has indeed been translated.

However, these approaches are limited by the system’s inability to attend only to specific scene objects. In the unsupervised case, where images are not paired or aligned, the network must additionally learn which parts of the scene are intended to be translated. For example, in Figure 1, a convincing translation between the horse and zebra domains requires the network to attend to each animal and change only those parts of the image. This is challenging for existing approaches, even if they use a localized loss like PatchGAN, as the network itself has no explicit attention mechanism. Instead, they typically aim to minimize the divergence between the underlying data-generating distribution for the entire image in the source and target domains. To overcome this limitation, a new approach is introduced which minimize the divergence between only the relevant parts of the data-generating distributions for the source and target domains.

Architecture Design

data flow diagram
Data-flow diagram from the source domain S to the target domain T during training

The goal of image translation is to estimate a map F(S→T) from a source image domain S to a target image domain T based on independently sampled data instances X(S) and X(T), such that the distribution of the mapped instances F(S→T) (XS) matches the probability distribution P(T) of the target. The training of the transfer network F(S→T) requires a discriminator D(T) to try to detect the translated outputs from the observed instances X(T). For cycle consistency, the inverse map F(T→S) and the corresponding discriminator D(S) are simultaneously trained. Solving this problem requires solving two equally important tasks:

  • (1) locating the areas to map in each image, and
  • (2) applying the right mapping to the located areas.

To achieve this, two attention networks A(S) and A(T), which select areas to translate by maximizing the probability that the discriminator makes a mistake.

Attention-guided generator

Input images are feed into attention network A(s), resulting in the attention map s(a) =AS(s). the mapped image s` is obtained by:

attention guided generator

The ‘foreground’ object s(f) is obtained via an element-wise product on each RGB channel: s(f) =s(a)⊙s. Then, the foreground s(f) is fed into the generator F(S→T), which maps sf to the target domain T. To create background image s(b) = (1−s(a))⊙s, and add it to the masked output of the generator F(S→T).

Loss function: This process is governed by the adversarial energy:

 attention guided generator

Attention-guided discriminator

This added loss makes our framework more robust in two ways: (1) it enforces the attended regions in the generated image to conserve content (e.g., pose), and (2) it encourages the attention maps to be sharp (converging towards a binary map), as the cycle-consistency loss of unattended areas will always be zero.

The final energy is obtained loss by combining the adversarial and cycle-consistency losses for both source and target domains are as follows:

attention guided discriminator

With a continuous attention map, the discriminator may receive ‘fractional’ pixel values, which may be close to zero early in training. While the generator benefits from being able to blend pixels at object boundaries, multiplying real images by these fractional values cause the discriminator to learn that mid gray is ‘real’ (i.e., we push the answer towards the midpoint 0 of the normalized [−1,1] pixel space). The learned attention map for the discriminator is as follows:

attention-guided discriminator

Thus, the updated adversarial energy L(adv) are as follows:

 attention guided discriminator

Result

Fréchet Inception Distance (FID) is used to evaluate the image translation framework. FID computes the Fréchet distance between feature representations of real and generated images. Such feature representations are extracted from the last hidden layer of the Inception architecture. This approach achieves the lowest FID in all but one mapping, with CycleGAN as the next best performing approach. UNIT achieves the second-lowest FID value, which suggests that the latent space assumption is useful in this setting. The code can be found here.

Fréchet Inception Distance
Fréchet Inception Distance for different algorithms

While modern unsupervised image-to-image translation techniques can map relevant image regions, they also inadvertently map irrelevant regions, too. By doing so, the generated images fail to look realistic, as the background and foreground are generally not appropriately blended. By incorporating an attention mechanism into unsupervised image-to-image translation, this approach demonstrates significant improvements in the quality of generated images.

Fréchet Inception Distance resultsFréchet Inception Distance results

Fréchet Inception Distance results

Apples to orange

Apples to orange results

Bonus — results for ablation experiments

By only adopting the holistic image discriminator (‘Ours–D’), the attention networks start to focus on the background as shown in the bottom row:

results for ablation experiments