Introducing "UDA": Unsupervised Data Augmentation Method That Achieves State-of-the-art Results

Researchers from Google Brain and Carnegie Mellon University have proposed a new data augmentation method that achieves state-of-the-art results on IMDB Dataset (with just 20 labeled examples), SSL Cifar10 and SVHN.

The novel method, called UDA or Unsupervised Data Augmentation follows a very simple idea of augmenting unlabeled data samples. Generally, data augmentation is done by modifying existing samples in the dataset and assigning the same label to the new samples (as they are just a small perturbation of the same object or scene).

The idea in unsupervised data augmentation is to compute a distribution of the samples alongside a distribution of the samples injected with some small noise and then enforcing those distributions to be as similar as possible. In the proposed method, researchers tried to enforce smoothness between those distributions in a slightly different way than usual.

In fact, they minimize the KL divergence between the predicted distributions on an unlabeled example and an augmented unlabeled example. Also, the augmented samples are generated using more sophisticated augmentation methods (instead of simple Gaussian or dropout noise) in order to produce more realistic and diverse samples.

The evaluations show that the proposed method achieves state-of-the-art performance on the IMDb dataset using only 20 labeled examples. UDA also outperforms all previous approaches on the CIFAR-10 (with 4000 samples) and SVHN datasets where it reduces more than 30% of the error rates of state-of-the-art methods.

More about the proposed unsupervised data augmentation method can be read in the pre-print paper published on arxiv.