Ranking

Introduction

Outlier detection for unsupervised tasks is difficult. The difficulty arises with how to evaluate whether the selected methods for outlier detection and thresholding are correct or the best for the dataset. Often, additional post-evaluation involves visual or domain knowledge based decisions to make a final call. This process can be tedious and is often just as complex as the methods used to do the outlier detection in the first place.

However, there are some robust statistical methods that can be used to indicate or at least provide a guide of which outlier detection and thresholding methods perform better than others. This process is that of a ranking problem and can be used to list in order the evaluated outlier detection and thresholding methods in terms of their respective performance with other methods.

In order to rank the multiple outlier detection and thresholding methods’ capabilities of effectively labeling a given dataset, it is important to note what is defined by “capabilities”. In this case, this refers to how well the predicted labels solved by the outlier detection and thresholding method compares to the true labels. This can be done using the F1 score or Matthews correlation coefficient (MCC). But, since unsupervised tasks means there is lack of true labels, the best option is to use other robust statistical metrics that have a strong correlation to the above mentioned scores.

To find these proxy-metrics a good starting point is to know what data can be used to compute them. In the case of unsupervised outlier detection, there are essentially three main components: the exploratory variables (X), the predicted outlier likelihood scores, and the thresholded binary labels. With these, criteria can be set to apply proxy-metrics that can then be ranked to provide a list of best to worst performing outlier detection and thresholding methods with respect to these metrics.

Proxy-Metrics

Different types of proxy-metrics can be used with respect to the F1 and MCC. Below four catagories have been defined.

Clustering Based

Since the dataset is considered to contain outliers, a clear distinction should exist between the inliers and outliers. By using clustering based metrics, the measure of similarity or distance between centroids or each datapoints can provide a single score of outlier detection method’s to effectively distinguish inliers from outliers. This should also indicate greater distinction between the inlier and outlier clusters. For these clustering based metrics the quality of the labeled data will be evaluated using either the exploratory variables or the outlier likelihood scores and their assigned labels Automatic Selection of Outlier Detection Techniques

Statistical Distances

Statistical distances quantify the measure of distances between probability based measures. These distances can be used to compute a single value difference between two probability distributions. This hints to what data from the unsupervised outlier detection method task should be used. Two distinct probability distributions can be computed for the outlier likelihood scores with respected to their labeled class.

Curve Based

Curve based method like clustering based methods can be used to compare the performance of unsupervised anomaly detection algorithms. This is done by leveraging the density of OD likelihood scores and evaluating their distributions with particular interest on their tails as this is typically where the outlier will lie How to Evaluate the Quality of Unsupervised Anomaly Detection Algorithms?

Consensus Based

Since the other proxy-metrics evaluate each outlier detection method individually, consensus based proxy-metrics rather compare the OD methods against one another. This is done be creating a consensus baseline by using all the OD method’s results and comparing the deviation of each OD method from this baseline. This allows for a consensus based score to be assigned to each OD method with respect to the other tested OD methods A Large-scale Study on Unsupervised Outlier Model Selection: Do Internal Strategies Suffice? Consensus outlier detection in survival analysis using the rank product test

Proxy-Metrics Correlation Tests

35 proxy-metrics scores were correlated against the true versus the thresholded labels MCC score by applying Pearson’s correlation. This was done for all passing possible combinations of the PyOD outlier detection methods LODA, QMCD, MCD, GMM, KNN, KDE, PCA, COPOD, HBOS, and IForest on the AD benchmark datasets: annthyroid, breastw, cardio, Cardiotocography, fault, glass, Hepatitis, Ionosphere, landsat, letter, Lymphography, mnist, musk, optdigits, PageBlocks, pendigits, Pima, satellite, satimage-2, SpamBase, Stamps, thyroid, vertebral, vowels, Waveform, WBC, WDBC, Wilt, wine, WPBC, and yeast available at ADBench dataset and applying the FILTER, CLF, DSN, OCSVM, KARCH, HIST, and META thresholders. A total of 198357 possible passing (dropped error fits, nans, and infs) combinations were tested for this evaluation. Tabulated below is a meta-analysis of the proxy-metrics Pearson’s score statistics calculated for all the combination across the datasets.

For clustering based between the exploratory variables and the predicted labels, the following proxy-metrics were tested.

CAL: Calinski-Harabasz
DAV: Davies-Bouldin score
SIL: Silhouette score

For clustering based between the outlier likelihood scores and the predicted labels, the following proxy-metrics were tested.

BH: Ball-Hall Index
BR: Banfeld-Raftery
CAL_sc: Calinski-Harabasz score
DAV_sc: Davies-Bouldin score
DR: Det Ratio Index Dunn: Dunn’s index
Hubert: Hubert index
Iind: Local Moran’s I index
MR: Mclain Rao Index
PB: Point biserial Index
RL: Ratkowsky Lance Index
RT: Ray-Turi Index
SIL_sc: Silhouette score
SDBW: S_Dbw validity index
SS: Scott-Symons Index
WG: Wemmert-Gancarski Index
XBS: Xie-Beni index

For statistical distances, the following proxy-metrics were tested.

AMA: Amari distance
BHT: Bhattacharyya distance
BREG: Exponential Euclidean Bregman distance
COR: Correlation distance
ENG: Energy distance
JS: Jensen-Shannon distance
MAH: Mahalanobis distance
LK: Lukaszyk-Karmowski metric for normal distributions
WS: Wasserstein or Earth Movers distance

For curve based, the following proxy-metrics were tested.

EM: Excess-Mass curves
MV: Mass-Volume curves

For consensus based, the following proxy-metrics were tested.

Contam: Mean contamination deviation based on TruncatedSVD decomposed scores
GNB: Gaussian Naive-Bayes trained consensus score
HITS: Deviation from the HITS consensus authority based labels
Mode: Deviation from the mode of the predicted labels
Thresh: Label deviation from the TruncatedSVD decomposed thresholded labels

Label	Mean	Median	Standard Deviation
CAL	0.1872	0.2027	0.4303
DAV	-0.1257	-0.0474	0.4523
SIL	0.1262	0.0589	0.4744
BH	0.0039	0.0391	0.4836
BR	0.0260	0.0300	0.4750
CAL_sc	0.0158	0.0111	0.5083
DAV_sc	-0.0483	-0.0944	0.5230
DR	-0.0157	-0.0111	0.5084
Dunn	0.0270	0.0207	0.5353
Hubert	0.0641	0.1632	0.4674
Iind	0.0648	0.1527	0.4763
MR	0.0789	0.1710	0.4973
PB	0.0367	0.0715	0.5089
RL	-0.0314	-0.0569	0.5081
RT	-0.0034	0.0524	0.4884
SIL_sc	0.0473	0.0801	0.4885
SDBW	-0.0641	-0.0627	0.4736
SS	-0.0034	0.0524	0.4884
WG	-0.0244	-0.0210	0.5221
XBS	-0.0535	0.0032	0.5280
AMA	0.0543	0.0088	0.5046
BHT	0.0239	0.0163	0.5050
BREG	0.1022	0.1546	0.5006
COR	0.0173	0.0364	0.5114
ENG	0.1120	0.1054	0.5030
JS	0.0566	0.1282	0.5045
MAH	0.0749	0.0300	0.5075
LK	0.0749	0.1048	0.5070
WS	0.1133	0.1766	0.5001
EM	0.0261	0.0752	0.4332
MV	0.0094	-0.0164	0.4361
Contam	-0.2003	-0.1498	0.5297
GNB	-0.1931	-0.2902	0.4768
HITS	-0.0449	-0.1235	0.5998
Mode	-0.1505	-0.0840	0.5377
Thresh	-0.1696	-0.1940	0.6031

From the table above it can be seen that most proxy-metrics performed sub-optimally. However, there were six proxy-metrics that performed generally better than the others. These are the six proxy-metrics

The Calinski-Harabasz score
The Mclain Rao Index
The Exponential Euclidean Bregman distance
The Wasserstein distance
The mean contamination deviation based on TruncatedSVD decomposed scores
The Gaussian Naive-Bayes trained consensus score

In order to get a better understanding on how these six proxy-metrics performed overall, joyplots below demonstrate the distributions of their Pearson’s correlation with respect to the MCC scores for each statistic.

Total Average Pearson's score — Figure 1: Total average Pearson’s score of selected proxy metrics.

Mean Pearson's score Across Datasets — Figure 2: Mean Pearson’s score across datasets for selected proxy metrics.

Median Pearson's score Across Datasets — Figure 3: Median Pearson’s score across datasets for selected proxy metrics.

Standard Deviation Pearson's score Across Datasets — Figure 4: Standard deviation Pearson’s score across datasets for selected proxy metrics.

Rank OD Methods

The ranking process involves ordering the proxy-metric scores with respect to their performance. The proxy-metrics are ordered highest-to-lowest or lowest-to-highest based on their performance criterion. The proxy-metrics are combined as follows: the statistical based distances are combined using equal weighting on their ordered ranks to compute a single ranked list. The same method is applied to the clustering based, and consensus based metrics. Finally, an overall combined rank is computed using the combined statistical based ranking, the combined clustering based ranking, and the combined consensus based ranking. This final combined ranking can either be computed using equal weightings for each three combined rankings or a weight list can be parsed based on preference.

The RANK method in PyThresh applies the method above to rank the performance of the outlier detection and thresholding methods against each other.

To evaluate the performance of RANK the ranking evaluation measure was employed as described in RankDCG: Rank-Ordering Evaluation Measure and GitHub implementation ranking

The RANK method was tested on the same dataset and OD and thresholding methods as was used for the selected proxy-metrics. Additionally, a fine-tuned LambdaMART model using XGBoost was also trained on the selected proxy-metrics and ranks from the same dataset with which to further evaluate the RANK performance. Also a random ordered selection rank method was tested with which to compare the RankDCG results to.

The joyplots below indicate the performance between the methods with respect aggregation across each dataset.

Figure 5: RankDCG scores for each tested combination.

Mean RankDCG scores per dataset — Figure 6: Mean RankDCG scores for each tested combination agregated per dataset.

As can be seen, the RANK method performed on average better than random selection. However, the trained LambdaMART model achieved significantly better scores than RANK. To test that the model was not overfitted and is able to generalize, the datasets mammography, skin, and smtp were tested as the model had not been trained on them.

Mean RankDCG scores per test dataset — Figure 7: Mean RankDCG scores for each tested combination aggregated per test dataset.

On these datasets the RANK method performed worse than random selection. However, the trained LambdaMART model performed marginally better than random selection. The lower performance of the model may be due to the proxy-metrics’ lower than training datasets correlation to the MCC as well as the OD and thresholding methods performing poorly in general on these dataset. This is reiterated as seen by the RANK method’s performance. However, the model is still valid with respect to its ability to rank OD and thresholding problems.

Example

Below is a simple example of how to apply the RANK method:

# Import libraries
from pyod.models.knn import KNN
from pyod.models.iforest import IForest
from pyod.models.pca import PCA
from pyod.models.mcd import MCD
from pyod.models.qmcd import QMCD
from pythresh.thresholds.filter import FILTER
from pythresh.utils.ranking import RANK

# Initialize models
clfs = [KNN(), IForest(), PCA(), MCD(), QMCD()]
thres = FILTER() # or list of thresholder methods

# Get rankings
ranker = RANK(clfs, thres)
rankings = ranker.eval(X)

Final Notes

While the RANK method is a useful tool to assist in selecting the possible best outlier detection and thresholding method to use, it is not infallible. The use of the trained LambdaMART model will be used by default however the standard RANK method can also be employed. It has been noted from the tests above, that in general the ranked results tended to return the best-to-worst performing outlier detection and thresholding methods in the correct order. However, they were not perfect. They at times exhibited incorrect orders and often the best performing OD and thresholding method was in the top three rather than being the top of the list. Additionally, some times well performing OD and thresholding methods was ranked poorly.

Future work will test to see how effective this method is with regards to other ranking OD and thresholding methods as well as how much better it performs than simply picking the IForest OD method.

Note that this method is under active development and the model and methods will be updated or upgraded in the future

The RANK method should be used with discretion but hopefully provide more clarity on which OD and thresholding method to select.