Welcome to PyThresh Documentation

Deployment, Stats, & License

PyPI version Anaconda version Documentation status testing Codecov Maintainability GitHub stars Downloads Python versions License Zenodo DOI

PyThresh is a comprehensive and scalable Python toolkit for thresholding outlier detection likelihood scores in univariate/multivariate data. It has been written to work in tandem with PyOD and has similar syntax and data structures. However, it is not limited to this single library. PyThresh is meant to threshold likelihood scores generated by an outlier detector. It thresholds these likelihood scores and replaces the need to set a contamination level or have the user guess the amount of outliers that may exist in the dataset beforehand. These non-parametric methods were written to reduce the user’s input/guess work and rather rely on statistics instead to threshold outlier likelihood scores. For thresholding to be applied correctly, the outlier detection likelihood scores must follow this rule: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where inliers and outliers are represented by a 0 and 1 respectively.

PyThresh includes more than 30 thresholding algorithms. These algorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology.

API Demo:

# train the KNN detector
from pyod.models.knn import KNN
from pythresh.thresholds.clust import CLUST

clf = KNN()
clf.fit(X_train)

# get outlier scores
decision_scores = clf.decision_scores_  # raw outlier scores on the train data

# get outlier labels
thres = CLUST()
labels = thres.eval(decision_scores)

Benchmarking & Utilities

Benchmarking has been done on all the thresholders and it was found that the MIXMOD thresholder performed best while the CLF thresholder provided the smallest uncertainty about its mean and is the most robust (best least accurate prediction). However, for interpretability and general performance the MIXMOD, FILTER, and META thresholders are good fits.

Further utilities are available for assisting in the selection of the most optimal outlier detection and thresholding methods ranking as well as determining the confidence with regards to the selected thresholding method thresholding confidence


External Feature Cases

Towards Data Science: Thresholding Outlier Detection Scores with PyThresh

Towards Data Science: When Outliers are Significant: Weighted Linear Regression

ArXiv: Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection


Available Thresholding Algorithms

Abbr

Description

References

AUCP

Area Under Curve Percentage

[RYZ+18]

BOOT

Bootstrapping

[MR06]

CHAU

Chauvenet’s Criterion

[BU75]

CLF

Trained Linear Classifier

[Agg17]

CLUST

Clustering Based

[KR08]

CPD

Change Point Detection

[FR16]

DECOMP

Decomposition

[BP02]

DSN

Distance Shift from Normal

[AOH21]

EB

Elliptical Boundary

[FMF13]

FGD

Fixed Gradient Descent

[QJC21]

FILTER

Filtering Based

[HGRR19]

FWFM

Full Width at Full Minimum

[Jon13]

GAMGMM

Bayesian Gamma GMM

[PBurknerK23]

GESD

Generalized Extreme Studentized Deviate

[Alr21]

HIST

Histogram Based

[TVAJS15]

IQR

Inter-Quartile Regression

[BD15]

KARCH

Karcher mean (Riemannian Center of Mass)

[AFS11]

MAD

Median Absolute Deviation

[NP15]

MCST

Monte Carlo Shapiro Tests

[Coi08]

META

Metamodel Trained Classifier

[ZRA20]

MIXMOD

Normal & Non-Normal Mixture Models

[vV23]

MOLL

Friedrichs’ Mollifier

[KS97]

MTT

Modified Thompson Tau Test

[RRF20]

OCSVM

One-Class Support Vector Machine

[BCB22]

QMCD

Quasi-Monte Carlo Discrepancy

[IRRN19]

REGR

Regression Based

[Agg17]

VAE

Variational Autoencoder

[XYA20]

WIND

Topological Winding Number

[JKSH13]

YJ

Yeo-Johnson Transformation

[RR21]

ZSCORE

Z-score

[BP20]

COMB

Thresholder Combination

The comparison among of implemented models is made available below (Figure). For Jupyter Notebooks, please navigate to “/notebooks/Compare All Thesholders.ipynb”.

Comparison of selected models

API Cheatsheet & Reference

The following APIs are applicable for all detector models for easy use.

  • pythresh.thresholders.base.BaseDetector.eval(): evaluate a single outlier or multiple outlier detection likelihood score sets

Key Attributes of a threshold:

  • pythresh.thresholders.base.BaseDetector.thresh_: Return the threshold value that separates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from likelihood scores normalized between 0 and 1.

  • pythresh.thresholders.base.BaseDetector.confidence_interval_: Return the lower and upper confidence interval of the contamination level. Only applies to the COMB thresholder

  • pythresh.thresholders.base.BaseDetector.dscores_: 1D array of the TruncatedSVD decomposed decision scores if multiple outlier detector score sets are passed

  • pythresh.thresholders.mixmod.MIXMOD.mixture_: fitted mixture model class of the selected model used for thresholding. Only applies to MIXMOD. Attributes include: components, weights, params. Functions include: fit, loglikelihood, pdf, and posterior.



References