Deforming Autoencoders (DAEs) – Learning Disentangled Representations

21 September 2018
DAE deforming autoencoders

Deforming Autoencoders (DAEs) – Learning Disentangled Representations

Generative Models are drawing a lot of attention within the Machine Learning research community. This kind of models has practical applications in different domains. Two of the most commonly used…

Generative Models are drawing a lot of attention within the Machine Learning research community. This kind of models has practical applications in different domains. Two of the most commonly used and efficient approaches recently are Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN).

While vanilla autoencoders can learn to generate compact representations and reconstruct their inputs well, they are pretty limited when it comes to practical applications. The fundamental problem of the standard autoencoders is that the latent space (in which they encode the input data distribution) may not be continuous therefore may not allow smooth interpolation. A different type of autoencoders called Variational Autoencoders (VAEs) can solve this problem, and their latent spaces are, by design, continuous, allowing easy random sampling and interpolation. This allowed VAEs to become very popular and used for many different tasks, especially in Computer Vision.

However, controlling and understanding deep neural networks, especially deep autoencoders is a difficult task and being able to control what the networks are learning is of crucial importance.

Previous works

The problem of feature disentanglement has been explored in the literature, for image and video processing and text analysis. Disentangling factors of variation are necessary for the goal of controlling and understanding deep networks and many attempts have been made to solve this problem.

Past work has explored the separation of the latent image representations into dimensions that account for different factors of variation. For example identity, illumination and spatial support, then low-dimensional transformations, such as rotations, translation, or scaling or more descriptive levels of variation such as age, gender, wearing glasses.

State-of-the-art idea

Recently, Zhixin Shu et al. introduced Deforming Autoencoders or shortly DAEs – a generative model for images that disentangles shape from the appearance in an unsupervised manner. In their paper, researchers propose a way to disentangle shape and appearance by assuming that object instances are obtained by deforming a prototypical object or ‘template’. This means that the object’s variability can be separated into variations associated with spatial transformations linked to the object’s shape, and variations that are associated with appearance. As simple as the idea sounds, this kind of disentanglement using deep autoencoders and unsupervised learning proved to be quite powerful.

Method

The proposed method can disentangle shape and appearance as factors of variation in a learned lower-dimensional latent space. The technique employs a deep learning architecture comprising of an encoder network that encodes the input image into two latent vectors (one for each shape and appearance) and two decoder networks taking the latent vectors as input and outputting generated texture and deformation, respectively.

The proposed Deforming Autoencoder architecture comprising of one encoder and two decoder networks
The proposed Deforming Autoencoder architecture comprising of one encoder and two decoder networks

Independent decoder networks learn the appearance and deformation functions. The generated spatial deformation is used to warp the texture to the observed image coordinates. In this way, the Deforming Autoencoder can reconstruct the input image and at the same time disentangle the shape and appearance as different features. The whole architecture is trained in an unsupervised manner using only simple image reconstruction loss.

In addition to Deforming Autoencoders (DAEs), the researchers propose Class-aware Deforming Autoencoders, which learn to reconstruct an image while at the same time disentangle the shape and appearance factors of variation conditioned by the class. To make this possible, they introduce a classifier network that takes a latent vector (third latent vector used to encode the class, besides the latent vectors for shape and appearance). This kind of architecture allows learning a mixture model conditioned on the class of the input image (rather than a joint multi-modal distribution).

The proposed class-aware Deforming Autoencoder
The proposed class-aware Deforming Autoencoder

They show that introducing class-aware learning drastically improves the performance and stability of the training. Intuitively, this can be explained as the network learning to separate the spatial deformation that is different among different classes.

Also, the researchers propose a Deforming Autoencoder to learn to disentangle albedo and shading (a widespread problem in computer vision) from facial images. They call this architecture Intrinsic Deforming Autoencoder, and it is shown in the picture below.

Intrinsic Deforming Autoencoder (Intrinsic DAE)
Intrinsic Deforming Autoencoder (Intrinsic DAE)

Results

It is shown that this method can successfully disentangle shape and appearance while learning to reconstruct an input image in an unsupervised manner. They show that class-aware Deforming Autoencoders provide better results in both reconstruction and appearance learning.

Results of the image reconstruction of MNIST images using Deforming Autoencoder
Results of the image reconstruction of MNIST images using Deforming Autoencoder

Besides the qualitative evaluation, the proposed Deforming Autoencoder architecture is evaluated quantitatively concerning landmark localization accuracy. The method was evaluated on

  1. unsupervised image alignment/appearance inference;
  2. learning semantically meaningful manifolds for shape and appearance;
  3. unsupervised intrinsic decomposition
  4. unsupervised landmark detection.
Results of the image reconstruction of MNIST images using Class-aware Deforming Autoencoder
Results of the image reconstruction of MNIST images using Class-aware Deforming Autoencoder
deforming autoencoders result
Unsupervised alignment on images of palms of left hands. (a) The input images; (b) reconstructed images; (c) texture images warped with the average of the decoded deformation; (d) the average input image; and (e) the average texture

defermation interpolation

Smooth interpolation of the latent space representation
Smooth interpolation of the latent space representation

Watch the video:

Comparison with other state-of-the-art

The proposed method was evaluated on the MAFL test – mean error on unsupervised landmark detection. It outperforms the self-supervised approach proposed by Thewlis et al.

Mean error on unsupervised landmark detection on the MAFL test set.

Conclusion

Lighting interpolation with Intrinsic-DAE
Lighting interpolation with Intrinsic-DAE

As I mentioned previously, being able to disentangle factors of variation can be of crucial importance for many tasks. Disentanglement allows complete control and understanding of deep neural network models, and it may be a key to solving problems. This approach introduced Deforming Autoencoders as a specific architecture able to disentangle particular factors of variation (in this case shape and appearance). The results show that this method can successfully disentangle the factors of variability by employing an autoencoder architecture.

How Artificial Intelligence Can Help DJs Deliver a Seamless Mix

16 May 2018
svedenie trekov neironnye seti

How Artificial Intelligence Can Help DJs Deliver a Seamless Mix

It is a common and good practice among DJs to create and divide playlists by mood (aggressive, soulful, melancholy) and energy (slow, medium, fast) rather than by music genre. In…

It is a common and good practice among DJs to create and divide playlists by mood (aggressive, soulful, melancholy) and energy (slow, medium, fast) rather than by music genre. In this way, the DJ is making a trade-off between the smooth, clean transition between tracks and his/her performance in terms of creating a natural mix of tracks that expresses a specific style and brings its own energy. Good DJs are able to provide a seamless and perceptually smooth transition between two tracks, thus making a mix of different tracks sound like a single flowing one.

Maybe one of the most difficult tasks that DJs face is smooth transitioning between songs from a different genre. As I mentioned before, this is necessary in order to create a playlist that will express a certain mood or emotion rather than just creating a list of same-genre songs carrying no energy altogether.

Recent work in machine learning (or more specifically deep learning), has given some answers and has provided useful methods for solving the problem of smooth transitioning between tracks of a different genre. In fact, Tijn Borghuis et al., propose a generative method that generates drum patterns which can be used to seamlessly transition different-genre tracks in the electronic dance music domain.

How it Works

The method is based on deep learning, utilizing Variational Autoencoders (VAEs) and interpolation in the latent space. The music data representation, the architecture, as well as the interpolation and the whole method, are explained below.

The authors created a dataset of drum patterns of three popular electronic music genres: Electro, Techno and Intelligent Dance Music (IDM) ending up with 1–1.5 hours of music for each of the three genres to be used for their method. The dataset, in the end, consisted of 1782 drum patterns. Each pattern is represented as a two-dimensional array whose y-axis represents the 6 drum instruments and the x-axis represents the time. Each pattern is given as a 6 x 64 array since all the generated patterns have a length of 64.

Drum patterns in the EDM dataset
Ten sample drum patterns in the EDM dataset. Instruments from the top are (1): bass drum, (2): snare drum, (3): closed hi-hat, (4): open hi-hat, (5): rimshot, (6): cowbell. Pixel intensities correspond to MIDI velocities. Top row: Electro-Funk; mid two rows: IDM; bottom two rows: Techno.

The proposed method is taking two music patterns (each one represented as a 6×64 array), it encodes them using an encoder from a learned VAE model, then it interpolates between the latent representations of the two patterns and then decodes both patterns to give smooth transition patterns as output.

The idea behind the proposed method is that interpolation in latent space will provide far better results than interpolation in the pattern (feature) space. But, one might ask what is the reasoning behind this statement. Is it always true? It turns out, it actually is and the answer lies in the theory of deep learning. Interpolating in the latent space works better because of the non-linear mappings from the input to the latent space and from the latent space to the output. In the context of this work, generating a weighted average of two patterns would give the well-known crossfading (gradually lowering the volume of one track while increasing the volume of the other track). It is basically, interpolating in the representation space that is just a linear combination.

Encoding-decoding
Overview of the encoding/decoding procedure to generate smooth transitioning between patterns.

The encoder of the Variational Autoencoder described in the paper has three main parts: the input, the recurrent layers which are bi-directional LSTM layers including tanh nonlinearities and fully-connected layers which map the input into a latent representation — a vector of size 4. The decoder, on the other hand, consists only of dense (fully-connected) upsampling layers outputting a reconstruction of the same shape (6×64). The fully-connected layers in the decoder are followed by a ReLU non-linear activation.

Encoder Network
Encoder Network

This kind of architecture proved to work well enough in the task of generating new music patterns used for transitioning between tracks or for autonomous drumming.

An interesting experiment was done by the authors trying to understand the originality of the generated patterns. They did Principal Component Analysis (PCA) on all of the training patterns plus the generated patterns through interpolation.

Decoder Network
Decoder Network

They visualize the patterns (both the training and the generated ones) in the space of the first two principal components (which usually preserve a large amount of the variance).

The conclusion is that the interpolation trajectories (the coloured lines in the plot) tend to follow the distribution of the training patterns while including new data points. These data points are the generated patterns that are actually genuinely new and original.

Plot of the patterns in the 2D space
A plot of the patterns in the 2D space of the first two principal components

What About The Music?

Well, the authors show that deep learning and especially generative models have enormous potential in the field of music production. From the conducted experiments, they show that their method, as well as deep learning in general, can be efficiently employed in the process of producing music patterns. Moreover, it can support the creation of music flow by generating patterns for transitioning between songs, even from a different genre. They proved that the newly generated patterns are musically appealing and original. Taking into account the speed at which AI is developed and advancing we can conclude that it is just a matter of time when AI will be capable of producing pleasing music even without any human intervention.

Dane Mitriev

Neural Network Compresses Images Without Loss Of Quality

24 April 2018
Neural Network Compresses Images Without Loss Of Quality

Neural Network Compresses Images Without Loss Of Quality

Graphical information requires a lot of storage resources, and the main task is to learn how to compress images without losing quality. Computer Vision Laboratory Solution A group of researchers…

Graphical information requires a lot of storage resources, and the main task is to learn how to compress images without losing quality.

Computer Vision Laboratory Solution

A group of researchers from the Swiss Computer Vision laboratory offered a way to process images and video, allowing to significantly reduce the amount of memory required for storing graphical information. They reported that this approach is intended for images with a small bit rate (the number of bits for storing one pixel of the image) and can save up to 67% of memory compared to the BPG method.

Computer Vision Laboratory Solution

Developers decided not to store the entire image, but only the most significant parts of it. The rest will be reconstructed when data is extracted.

For example, if we are watching a video with a walking person, we primarily focus on that person. The background does not really matter if there is nothing extraordinary.

Generation of insignificant areas cannot occur from scratch. To do this, a map of semantic marks is created from the original image. It notes that in one part of the image, for example, green foliage is located, in other part — asphalt, etc.

A solution became possible due to the use of deep neural networks (DNNs) as image compression systems. We are faced with them searching for information and translating the text using Google and Yandex.

GANs Modes of Operation

Generative adversarial networks (GANs) developed in the Computer Vision laboratory can work in two modes:

a) global generative compression (GC);

global generative compression

b) selective generative compression (SC).

selective generative compression

In the scheme, the original image is designated as x, and the map of semantic marks is s. In the GC mode, the semantic map can be used optionally. In the SC mode, its application is compulsory.

Other marks on the diagram:

  • E — image encoder;
  • w — image code;
  • w(^) — the code after the quantization procedure (the second level of discretization of information);
  • G — the generator of the compressed image;
  • x(^) — compressed image.

An F device is added to the SC circuit, which extracts data from the semantic map and indicates their place to the generator.

The system allows efficient processing and indexing of any compressed images.

Neural Networks

Serious errors in the compressed image occur at bitrates below 0.1 bit per pixel (bpp). When bpp aspires to zero, it is impossible to preserve the full content of the image.

Therefore, it is important to go beyond the standard peak-to-noise ratio (PSNR). Adversarial losses function is considered a promising tool. It is used to capture global semantic information and local texture. They allow you to obtain high-resolution images from a semantic tag map.

Two types of neural networks are used to work with this technology: autoencoders and recursive neural networks. They convert the incoming image into a bitstream, which is compressed by mathematical coding or the Huffman method. The image quality remains the same.

Generative adversarial networks (GANs) have become a popular tool for neural networks training. They allow you to create more stable and clear images in comparison with previous technologies.