Welcome to PyThresh Documentation

Deployment, Stats, & License

PyThresh is a comprehensive and scalable Python toolkit for thresholding outlier detection likelihood scores in univariate/multivariate data. It has been written to work in tandem with PyOD and has similar syntax and data structures. However, it is not limited to this single library. PyThresh is meant to threshold likelihood scores generated by an outlier detector. It thresholds these likelihood scores and replaces the need to set a contamination level or have the user guess the amount of outliers that may exist in the dataset beforehand. These non-parametric methods were written to reduce the user’s input/guess work and rather rely on statistics instead to threshold outlier likelihood scores. For thresholding to be applied correctly, the outlier detection likelihood scores must follow this rule: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where inliers and outliers are represented by a 0 and 1 respectively.

PyThresh includes more than 30 thresholding algorithms. These algorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology.

API Demo:

# train the KNN detector
from pyod.models.knn import KNN
from pythresh.thresholds.clust import CLUST

clf = KNN()
clf.fit(X_train)

# get outlier likelihood scores
decision_scores = clf.decision_scores_

# get outlier labels
thres = CLUST()
thres.fit(decision_scores)

labels = thres.labels_ # or thres.predict(decision_scores)

Benchmarking & Utilities

Benchmarking has been done on all the thresholders and it was found that the MIXMOD thresholder performed best while the CLF thresholder provided the smallest uncertainty about its mean and is the most robust (best least accurate prediction). However, for interpretability and general performance the MIXMOD, FILTER, and META thresholders are good fits.

Further utilities are available for assisting in the selection of the most optimal outlier detection and thresholding methods ranking as well as determining the confidence with regards to the selected thresholding method thresholding confidence

External Feature Cases

Towards Data Science: Thresholding Outlier Detection Scores with PyThresh

Towards Data Science: When Outliers are Significant: Weighted Linear Regression

ArXiv: Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection

Available Thresholding Algorithms

Abbr	Description	References
AUCP	Area Under Curve Percentage	[RYZ+18]
BOOT	Bootstrapping	[MR06]
CHAU	Chauvenet’s Criterion	[BU75]
CLF	Trained Linear Classifier	[Agg17]
CLUST	Clustering Based	[KR08]
CPD	Change Point Detection	[FR16]
DECOMP	Decomposition	[BP02]
DSN	Distance Shift from Normal	[AOH21]
EB	Elliptical Boundary	[FMF13]
FGD	Fixed Gradient Descent	[QJC21]
FILTER	Filtering Based	[HGRR19]
FWFM	Full Width at Full Minimum	[Jon13]
GAMGMM	Bayesian Gamma GMM	[PBurknerK23]
GESD	Generalized Extreme Studentized Deviate	[Alr21]
HIST	Histogram Based	[TVAJS15]
IQR	Inter-Quartile Regression	[BD15]
KARCH	Karcher mean (Riemannian Center of Mass)	[AFS11]
MAD	Median Absolute Deviation	[NP15]
MCST	Monte Carlo Shapiro Tests	[Coi08]
META	Metamodel Trained Classifier	[ZRA20]
MIXMOD	Normal & Non-Normal Mixture Models	[vV23]
MOLL	Friedrichs’ Mollifier	[KS97]
MTT	Modified Thompson Tau Test	[RRF20]
OCSVM	One-Class Support Vector Machine	[BCB22]
QMCD	Quasi-Monte Carlo Discrepancy	[IRRN19]
REGR	Regression Based	[Agg17]
VAE	Variational Autoencoder	[XYA20]
WIND	Topological Winding Number	[JKSH13]
YJ	Yeo-Johnson Transformation	[RR21]
ZSCORE	Z-score	[BP20]
COMB	Thresholder Combination
DUMMY	Dummy Percentile Based

Tutorial Notebooks

Notebook	Description
Introduction	Basic intro into outlier thresholding
Advanced Thresholding	Additional thresholding options for more advanced use
Threshold Confidence	Calculating the confidence levels around the threshold point
Outlier Ranking	Assisting in selecting the best performing outlier and thresholding method combo using ranking

The comparison among of implemented models is made available below:

API Cheatsheet & Reference

The following APIs are applicable for all detector models for easy use.

pythresh.thresholders.base.BaseDetector.eval(): evaluate a single outlier or multiple outlier detection likelihood score set (Legacy method).
pythresh.thresholders.base.BaseDetector.fit(): fit a thresholder for a single outlier or multiple outlier detection likelihood score set.
pythresh.thresholders.base.BaseDetector.predict(): predict the binary labels using the fitted thresholder on a single outlier or multiple outlier detection likelihood score set.

Key Attributes of a threshold:

pythresh.thresholders.base.BaseDetector.thresh_: Return the threshold value that separates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from likelihood scores normalized between 0 and 1.
pythresh.thresholds.base.BaseThresholder.labels_: Return a binary array of labels for the fitted thresholder on the fitted dataset.
pythresh.thresholders.base.BaseDetector.confidence_interval_: Return the lower and upper confidence interval of the contamination level. Only applies to the COMB thresholder
pythresh.thresholders.base.BaseDetector.dscores_: 1D array of the TruncatedSVD decomposed decision scores if multiple outlier detector score sets are passed
pythresh.thresholders.mixmod.MIXMOD.mixture_: fitted mixture model class of the selected model used for thresholding. Only applies to MIXMOD. Attributes include: components, weights, params. Functions include: fit, loglikelihood, pdf, and posterior.

References