Welcome to PyThresh Documentation
Deployment, Stats, & License
PyThresh is a comprehensive and scalable Python toolkit for thresholding outlier detection likelihood scores in univariate/multivariate data. It has been written to work in tandem with PyOD and has similar syntax and data structures. However, it is not limited to this single library. PyThresh is meant to threshold likelihood scores generated by an outlier detector. It thresholds these likelihood scores and replaces the need to set a contamination level or have the user guess the amount of outliers that may exist in the dataset beforehand. These non-parametric methods were written to reduce the user’s input/guess work and rather rely on statistics instead to threshold outlier likelihood scores. For thresholding to be applied correctly, the outlier detection likelihood scores must follow this rule: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where inliers and outliers are represented by a 0 and 1 respectively.
PyThresh includes more than 30 thresholding algorithms. These algorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology.
API Demo:
# train the KNN detector
from pyod.models.knn import KNN
from pythresh.thresholds.clust import CLUST
clf = KNN()
clf.fit(X_train)
# get outlier scores
decision_scores = clf.decision_scores_ # raw outlier scores on the train data
# get outlier labels
thres = CLUST()
labels = thres.eval(decision_scores)
Benchmarking & Utilities
Benchmarking has been done on all the thresholders and it was found
that the MIXMOD
thresholder performed best while the CLF
thresholder provided the smallest uncertainty about its mean and is
the most robust (best least accurate prediction). However, for
interpretability and general performance the MIXMOD, FILTER,
and
META
thresholders are good fits.
Further utilities are available for assisting in the selection of the most optimal outlier detection and thresholding methods ranking as well as determining the confidence with regards to the selected thresholding method thresholding confidence
External Feature Cases
Towards Data Science: Thresholding Outlier Detection Scores with PyThresh
Towards Data Science: When Outliers are Significant: Weighted Linear Regression
ArXiv: Estimating the Contamination Factor’s Distribution in Unsupervised Anomaly Detection
Available Thresholding Algorithms
Abbr |
Description |
References |
---|---|---|
AUCP |
Area Under Curve Percentage |
[RYZ+18] |
BOOT |
Bootstrapping |
[MR06] |
CHAU |
Chauvenet’s Criterion |
[BU75] |
CLF |
Trained Linear Classifier |
[Agg17] |
CLUST |
Clustering Based |
[KR08] |
CPD |
Change Point Detection |
[FR16] |
DECOMP |
Decomposition |
[BP02] |
DSN |
Distance Shift from Normal |
[AOH21] |
EB |
Elliptical Boundary |
[FMF13] |
FGD |
Fixed Gradient Descent |
[QJC21] |
FILTER |
Filtering Based |
[HGRR19] |
FWFM |
Full Width at Full Minimum |
[Jon13] |
GAMGMM |
Bayesian Gamma GMM |
|
GESD |
Generalized Extreme Studentized Deviate |
[Alr21] |
HIST |
Histogram Based |
[TVAJS15] |
IQR |
Inter-Quartile Regression |
[BD15] |
KARCH |
Karcher mean (Riemannian Center of Mass) |
[AFS11] |
MAD |
Median Absolute Deviation |
[NP15] |
MCST |
Monte Carlo Shapiro Tests |
[Coi08] |
META |
Metamodel Trained Classifier |
[ZRA20] |
MIXMOD |
Normal & Non-Normal Mixture Models |
[vV23] |
MOLL |
Friedrichs’ Mollifier |
[KS97] |
MTT |
Modified Thompson Tau Test |
[RRF20] |
OCSVM |
One-Class Support Vector Machine |
[BCB22] |
QMCD |
Quasi-Monte Carlo Discrepancy |
[IRRN19] |
REGR |
Regression Based |
[Agg17] |
VAE |
Variational Autoencoder |
[XYA20] |
WIND |
Topological Winding Number |
[JKSH13] |
YJ |
Yeo-Johnson Transformation |
[RR21] |
ZSCORE |
Z-score |
[BP20] |
COMB |
Thresholder Combination |
The comparison among of implemented models is made available below (Figure). For Jupyter Notebooks, please navigate to “/notebooks/Compare All Thesholders.ipynb”.
API Cheatsheet & Reference
The following APIs are applicable for all detector models for easy use.
pythresh.thresholders.base.BaseDetector.eval()
: evaluate a single outlier or multiple outlier detection likelihood score sets
Key Attributes of a threshold:
pythresh.thresholders.base.BaseDetector.thresh_
: Return the threshold value that separates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from likelihood scores normalized between 0 and 1.pythresh.thresholders.base.BaseDetector.confidence_interval_
: Return the lower and upper confidence interval of the contamination level. Only applies to the COMB thresholderpythresh.thresholders.base.BaseDetector.dscores_
: 1D array of the TruncatedSVD decomposed decision scores if multiple outlier detector score sets are passedpythresh.thresholders.mixmod.MIXMOD.mixture_
: fitted mixture model class of the selected model used for thresholding. Only applies to MIXMOD. Attributes include: components, weights, params. Functions include: fit, loglikelihood, pdf, and posterior.
References