Open AI released Sparse Transformers – deep neural network models which set new records in predicting what comes next in a sequence. Researchers from OpenAI proposed an improvement to the concept of “attention” in transformer models to model longer sequences of text, images or sound.
Transformer models have been proposed by researchers at Google in 2017, where they introduced the concept of self-attention in order to model long-range dependencies and relationships between words in sentences. The mechanism of self-attention was algorithmically improved in the proposed Sparse Transformer models.
Researchers argue that the self-attention mechanism is computationally expensive to employ, especially with rich data types. The complexity comes from the necessity to compute an attention matrix for every layer. Researchers from OpenAI propose a novel approach with sparse attention patterns, where each output position only computes weightings from a subset of input positions.
They tried to make the attention computation tractable for longer sequences by using a relatively smaller subset of inputs instead of all the inputs. According to them, the visualization of the learned attention patterns for deep Transformers on images showed interpretable and structured sparsity patterns.
The proposed model was evaluated as a generative model for images. In fact, researchers used the sparse transformer model towards the task of image completion. They showed that the model is able to learn a global structure.
To show that the model is able to learn long-range dependencies in different kind of data, researchers employed the proposed model to generate raw audio wavelets. They mention that Sparse Transformers can also be easily adapted to generate raw audio instead of images by simply changing the position embeddings.
Besides the initial evaluations that show promising results, researchers mention that this approach still seems impractical for generating very high-resolution images or video.
The implementation of the proposed Sparse Transformer model was open-sourced and it is available on Github. Samples of the generated audio waveforms can be found in the official blog post. The paper is available at arxiv.