PyOD: Python-a library for detecting anomalous data

PyOD is a Python library with more than 30 state-of-the-art algorithms for detecting rare and suspicious data or events. The field of application of methods for recognizing such anomalies includes fraud detection and implementation in security systems.

Detection of anomalies (“outliers”) in most cases is complicated by the lack of appropriate labels for such data and therefore is a task of teaching without a teacher. Special outlier detection algorithms enable reliable pattern recognition in large, unmarked data sets. PyOD includes a broad class of anomaly detection algorithms, ranging from classic algorithms such as Isolation Forest to the latest deep learning techniques and new algorithms (such as COPOD). The algorithms included in PyOD are often used in research. The library is easy to use and contains unified documentation with examples.

All algorithms are initialized with a numerical parameter in the range from 0 to 0.5 (by default equal to 0.1), which characterizes the expected share of outliers in the data and is used to set the threshold for estimating anomalies when adapting the model to a specific data set. The library includes the following classes of algorithms:

linear models, in particular, PCA and One-Class SVM;
proximity-based models that measure distances between data elements: data that is close to each other is more likely to be normal, and data that is far away is more likely to be abnormal;
probabilistic models that use statistical distributions to identify outliers;
ensemble models that use model ensembles to detect isolated points (one of such algorithms is Isolation Forest);
neural networks: autoencoders, including variational ones, can be trained to recognize anomalies in unmarked data.

In the latter type of algorithms, autoencoders are trained to compress and then restore information in the data. Errors of such reconstruction are candidates for anomalies. Recently, several generative-adversarial network architectures have been proposed for anomaly detection (for example, MO_GAAL).

One way to build the most reliable outlier detection model (and avoid choosing one algorithm, since each of them is not universal) is to combine the models into an ensemble. The PyOD library also allows you to do this and then aggregate the anomaly estimates recognized by multiple algorithms.