Object detectors based on deep convolutional neural networks have excelled in their job in the past few years. Most of these models are actually fully-convolutional neural networks that use one handcrafted last step called NMS or non-maximum suppression. Exactly this NMS post-processing does not allow for full end-to-end training of these object detectors.
In a recent paper, researchers from Megvii Technology and Xi’an Jiaotong University have proposed a fully-convolutional deep neural network that can be trained in an end-to-end manner. The idea behind their method is based on the fact that a proper label assignment can lead to an end-to-end single prediction learning. Researchers proposed the so-called POTO or Prediction-aware One-to-One label assignment, where the labels are dynamically assigned during training from the POTO module according to the quality of predictions. They also introduced a new module called 3D Max filtering whose sole purpose is to suppress duplicate predictions. It is basically a multi-scale max filter that transforms the features at each scale. The architecture of the proposed method is given in the diagram below.
To evaluate the performance of the newly proposed detector, researchers used two wide-known datasets: COCO and CrowdHuman. Results from these experiments and the ablation studies that researchers conducted showed that the method shows a competitive performance when compared with state-of-the-art methods that use NMS. Compared to end-to-end object detectors, the novel method shows superior performance.
The implementation of the method was open-sourced and it can be found on Github. The paper was published on arxiv.