Utility Functions
pythresh.utils.rank module
- class pythresh.utils.rank.RANK(od_models, thresh, method='model', weights=None)[source]
Bases:
object
RANK class for ranking outlier detection and thresholding methods.
Use the RANK class to rank outlier detection and thresholding methods’ capabilities to provide the best matthews correlation with respect to the selected threshold method
- Parameters:
od_models ({list of pyod.model classes}) –
thresh ({pythresh.threshold class, float, int, list of pythresh.threshold classes, list of floats, list of ints}) –
method ({'model', 'native'}, optional (default='model')) –
weights (list of shape 3, optional (default=None)) – These weights are applied to the combined rank score. The first is for the cdf rankings, the second for the clust rankings, and the third for the mode rankings. Default applies equal weightings to all proxy-metrics. Only applies when method = ‘native’.
Notes
The RANK class ranks the outlier detection methods by evaluating three distinct proxy-metrics. The first proxy-metric looks at the outlier likelihood scores by class and measures the cumulative distribution separation using the the Wasserstein distance, and the Exponential Euclidean Bregman distance. The second proxy-metric looks at the relationship between the fitted features (X) and the evaluated classes (y) using the Calinski-Harabasz scores and between the outlier likihood score and the evaluated classes using the Mclain Rao Index. The third proxy-metric evaluates the class difference for each outlier detection and thresholding method with respect to consensus based metrics of all the evaluated outlier detection class labels. This is done using the mean contamination deviation based on TruncatedSVD decomposed scores and Gaussian Naive-Bayes trained consensus score
Each proxy-metric is ranked separately and a final ranking is applied using all three proxy-metric to get a single ranked result of each outlier detection and thresholding method using the ‘native’ method. The model method uses a trained LambdaMART ranking model using all the proxy-metrics as input.
Please note that the data is standardized using
from pyod.utils.utility import standardizer
during this ranking processExamples
# Import libraries from pyod.models.knn import KNN from pyod.models.iforest import IForest from pyod.models.pca import PCA from pyod.models.mcd import MCD from pyod.models.qmcd import QMCD from pythresh.thresholds.filter import FILTER from pythresh.utils.ranking import RANK # Initialize models clfs = [KNN(), IForest(), PCA(), MCD(), QMCD()] thres = FILTER() # Get rankings ranker = RANK(clfs, thres) rankings = ranker.eval(X)
pythresh.utils.conf module
- class pythresh.utils.conf.CONF(thresh, alpha=0.05, split=0.25, n_test=100, random_state=1234)[source]
Bases:
object
CONF class for calculating the confidence of thresholding.
Use the CONF class for evaluating the confidence of thresholding methods based on confidence-interval bounds to find datpoints that lie within the bounds and therefore are difficult to allocate whether they are true inliers or outliers for the selected confidence level
- Parameters:
thresh ({pythresh.threshold class}) – The thresholding method
alpha (float, optional (default=0.05)) – Confidence level corresponding to the t-Student distribution map to sample
split (float, optional (default=0.25)) – The test size thresholding test
n_test (int, optional (default=100)) – The number of thresholding tests to build the confidence region
random_state (int, optional (default=1234)) – Random seed for the starting random number generators of the test split. Can also be set to None.
Notes
The CONF class is designed for evaluating the confidence of thresholding methods within the context of outlier detection. It assesses the confidence of thresholding, a critical step in the outlier detection process. By sampling and testing different thresholds evaluated by the selected thresholding method, the class provides a confidence region for the selected threshold method. After building the confidence region, uncertain data points are identified. These are data points that lie within the confidence-interval bounds and may be challenging to classify as outliers or inliers.
Examples
# Import libraries from pyod.models.knn import KNN from pythresh.thresholds.filter import FILTER from pythresh.utils.conf import CONF # Initialize models clf = KNN() thres = FILTER() clf.fit(X) scores = clf.decision_scores_ labels = thres.eval(scores) # Get indices of datapoint outside of confidence bounds confidence = CONF(thres) uncertains = confidence.eval(scores)