Utility Functions

pythresh.utils.rank module

class pythresh.utils.rank.RANK(od_models, thresh, method='model', weights=None)[source]

Bases: object

RANK class for ranking outlier detection and thresholding methods.

Use the RANK class to rank outlier detection and thresholding methods’ capabilities to provide the best matthews correlation with respect to the selected threshold method

Parameters:
  • od_models ({list of pyod.model classes}) –

  • thresh ({pythresh.threshold class, float, int, list of pythresh.threshold classes, list of floats, list of ints}) –

  • method ({'model', 'native'}, optional (default='model')) –

  • weights (list of shape 3, optional (default=None)) – These weights are applied to the combined rank score. The first is for the cdf rankings, the second for the clust rankings, and the third for the mode rankings. Default applies equal weightings to all proxy-metrics. Only applies when method = ‘native’.

cdf_rank_
Type:

list of tuples shape (2, n_od_models) of cdf based rankings

clust_rank_
Type:

list of tuples shape (2, n_od_models) of cluster based rankings

consensus_rank_
Type:

list of tuples shape (2, n_od_models) of consensus based rankings

Notes

The RANK class ranks the outlier detection methods by evaluating three distinct proxy-metrics. The first proxy-metric looks at the outlier likelihood scores by class and measures the cumulative distribution separation using the the Wasserstein distance, and the Exponential Euclidean Bregman distance. The second proxy-metric looks at the relationship between the fitted features (X) and the evaluated classes (y) using the Calinski-Harabasz scores and between the outlier likihood score and the evaluated classes using the Mclain Rao Index. The third proxy-metric evaluates the class difference for each outlier detection and thresholding method with respect to consensus based metrics of all the evaluated outlier detection class labels. This is done using the mean contamination deviation based on TruncatedSVD decomposed scores and Gaussian Naive-Bayes trained consensus score

Each proxy-metric is ranked separately and a final ranking is applied using all three proxy-metric to get a single ranked result of each outlier detection and thresholding method using the ‘native’ method. The model method uses a trained LambdaMART ranking model using all the proxy-metrics as input.

Please note that the data is standardized using from pyod.utils.utility import standardizer during this ranking process

Examples

# Import libraries
from pyod.models.knn import KNN
from pyod.models.iforest import IForest
from pyod.models.pca import PCA
from pyod.models.mcd import MCD
from pyod.models.qmcd import QMCD
from pythresh.thresholds.filter import FILTER
from pythresh.utils.ranking import RANK

# Initialize models
clfs = [KNN(), IForest(), PCA(), MCD(), QMCD()]
thres = FILTER()

# Get rankings
ranker = RANK(clfs, thres)
rankings = ranker.eval(X)
eval(X)[source]

Outlier detection and thresholding method ranking.

Parameters:

X (np.array or list of input data of shape) – (n_samples, 1) or (n_samples, n_features)

Returns:

rankings – For each combination of outlier detection model and thresholder ranked from best to worst in terms of performance

Return type:

list of tuples shape (2, n_od_models)

pythresh.utils.conf module

class pythresh.utils.conf.CONF(thresh, alpha=0.05, split=0.25, n_test=100, random_state=1234)[source]

Bases: object

CONF class for calculating the confidence of thresholding.

Use the CONF class for evaluating the confidence of thresholding methods based on confidence-interval bounds to find datpoints that lie within the bounds and therefore are difficult to allocate whether they are true inliers or outliers for the selected confidence level

Parameters:
  • thresh ({pythresh.threshold class}) – The thresholding method

  • alpha (float, optional (default=0.05)) – Confidence level corresponding to the t-Student distribution map to sample

  • split (float, optional (default=0.25)) – The test size thresholding test

  • n_test (int, optional (default=100)) – The number of thresholding tests to build the confidence region

  • random_state (int, optional (default=1234)) – Random seed for the starting random number generators of the test split. Can also be set to None.

Notes

The CONF class is designed for evaluating the confidence of thresholding methods within the context of outlier detection. It assesses the confidence of thresholding, a critical step in the outlier detection process. By sampling and testing different thresholds evaluated by the selected thresholding method, the class provides a confidence region for the selected threshold method. After building the confidence region, uncertain data points are identified. These are data points that lie within the confidence-interval bounds and may be challenging to classify as outliers or inliers.

Examples

# Import libraries
from pyod.models.knn import KNN
from pythresh.thresholds.filter import FILTER
from pythresh.utils.conf import CONF

# Initialize models
clf = KNN()
thres = FILTER()

clf.fit(X)
scores = clf.decision_scores_
labels = thres.eval(scores)

# Get indices of datapoint outside of confidence bounds
confidence = CONF(thres)
uncertains = confidence.eval(scores)
eval(decision)[source]

Outlier detection and thresholding method confidence interval bounds.

Parameters:

decision (np.array or list of shape (n_samples)) – which are the decision scores from a outlier detection.

Returns:

uncertains – confidence-interval bounds and can be classified as “uncertain” datapoints

Return type:

list of indices of all datapoints that lie within the