######### Ranking ######### ************** Introduction ************** Outlier detection for unsupervised tasks is difficult. The difficulty arises with how to evaluate whether the selected methods for outlier detection and thresholding are correct or the best for the dataset. Often, additional post-evaluation involves visual or domain knowledge based decisions to make a final call. This process can be tedious and is often just as complex as the methods used to do the outlier detection in the first place. However, there are some robust statistical methods that can be used to indicate or at least provide a guide of which outlier detection and thresholding methods perform better than others. This process is that of a ranking problem and can be used to list in order the evaluated outlier detection and thresholding methods in terms of their respective performance with other methods. ---- In order to rank the multiple outlier detection and thresholding methods' capabilities of effectively labeling a given dataset, it is important to note what is defined by "capabilities". In this case, this refers to how well the predicted labels solved by the outlier detection and thresholding method compares to the true labels. This can be done using the F1 score or Matthews correlation coefficient (MCC). But, since unsupervised tasks means there is lack of true labels, the best option is to use other robust statistical metrics that have a strong correlation to the above mentioned scores. To find these proxy-metrics a good starting point is to know what data can be used to compute them. In the case of unsupervised outlier detection, there are essentially three main components: the exploratory variables (X), the predicted outlier likelihood scores, and the thresholded binary labels. With these, criteria can be set to apply proxy-metrics that can then be ranked to provide a list of best to worst performing outlier detection and thresholding methods with respect to these metrics. *************** Proxy-Metrics *************** Different types of proxy-metrics can be used with respect to the F1 and MCC. Below four catagories have been defined. Clustering Based ================ Since the dataset is considered to contain outliers, a clear distinction should exist between the inliers and outliers. By using clustering based metrics, the measure of similarity or distance between centroids or each datapoints can provide a single score of outlier detection method's to effectively distinguish inliers from outliers. This should also indicate greater distinction between the inlier and outlier clusters. For these clustering based metrics the quality of the labeled data will be evaluated using either the exploratory variables or the outlier likelihood scores and their assigned labels `Automatic Selection of Outlier Detection Techniques `_ Statistical Distances ===================== Statistical distances quantify the measure of distances between probability based measures. These distances can be used to compute a single value difference between two probability distributions. This hints to what data from the unsupervised outlier detection method task should be used. Two distinct probability distributions can be computed for the outlier likelihood scores with respected to their labeled class. Curve Based =========== Curve based method like clustering based methods can be used to compare the performance of unsupervised anomaly detection algorithms. This is done by leveraging the density of OD likelihood scores and evaluating their distributions with particular interest on their tails as this is typically where the outlier will lie `How to Evaluate the Quality of Unsupervised Anomaly Detection Algorithms? `_ Consensus Based =============== Since the other proxy-metrics evaluate each outlier detection method individually, consensus based proxy-metrics rather compare the OD methods against one another. This is done be creating a consensus baseline by using all the OD method's results and comparing the deviation of each OD method from this baseline. This allows for a consensus based score to be assigned to each OD method with respect to the other tested OD methods `A Large-scale Study on Unsupervised Outlier Model Selection: Do Internal Strategies Suffice? `_ `Consensus outlier detection in survival analysis using the rank product test `_ ********************************* Proxy-Metrics Correlation Tests ********************************* 35 proxy-metrics scores were correlated against the true versus the thresholded labels MCC score by applying Pearson's correlation. This was done for all passing possible combinations of the ``PyOD`` outlier detection methods ``LODA, QMCD, MCD, GMM, KNN, KDE, PCA, COPOD, HBOS,`` and ``IForest`` on the AD benchmark datasets: ``annthyroid, breastw, cardio, Cardiotocography, fault, glass, Hepatitis, Ionosphere, landsat, letter, Lymphography, mnist, musk, optdigits, PageBlocks, pendigits, Pima, satellite, satimage-2, SpamBase, Stamps, thyroid, vertebral, vowels, Waveform, WBC, WDBC, Wilt, wine, WPBC,`` and ``yeast`` available at `ADBench dataset `_ and applying the ``FILTER, CLF, DSN, OCSVM, KARCH, HIST,`` and ``META`` thresholders. A total of 198357 possible passing (dropped error fits, nans, and infs) combinations were tested for this evaluation. Tabulated below is a meta-analysis of the proxy-metrics Pearson's score statistics calculated for all the combination across the datasets. For clustering based between the exploratory variables and the predicted labels, the following proxy-metrics were tested. - CAL: Calinski-Harabasz - DAV: Davies-Bouldin score - SIL: Silhouette score For clustering based between the outlier likelihood scores and the predicted labels, the following proxy-metrics were tested. - BH: Ball-Hall Index - BR: Banfeld-Raftery - CAL_sc: Calinski-Harabasz score - DAV_sc: Davies-Bouldin score - DR: Det Ratio Index Dunn: Dunn's index - Hubert: Hubert index - Iind: Local Moran's I index - MR: Mclain Rao Index - PB: Point biserial Index - RL: Ratkowsky Lance Index - RT: Ray-Turi Index - SIL_sc: Silhouette score - SDBW: S_Dbw validity index - SS: Scott-Symons Index - WG: Wemmert-Gancarski Index - XBS: Xie-Beni index For statistical distances, the following proxy-metrics were tested. - AMA: Amari distance - BHT: Bhattacharyya distance - BREG: Exponential Euclidean Bregman distance - COR: Correlation distance - ENG: Energy distance - JS: Jensen-Shannon distance - MAH: Mahalanobis distance - LK: Lukaszyk-Karmowski metric for normal distributions - WS: Wasserstein or Earth Movers distance For curve based, the following proxy-metrics were tested. - EM: Excess-Mass curves - MV: Mass-Volume curves For consensus based, the following proxy-metrics were tested. - Contam: Mean contamination deviation based on TruncatedSVD decomposed scores - GNB: Gaussian Naive-Bayes trained consensus score - HITS: Deviation from the HITS consensus authority based labels - Mode: Deviation from the mode of the predicted labels - Thresh: Label deviation from the TruncatedSVD decomposed thresholded labels .. csv-table:: :header-rows: 1 :file: tables/RankingCorr.csv :class: sphinx-datatable From the table above it can be seen that most proxy-metrics performed sub-optimally. However, there were six proxy-metrics that performed generally better than the others. These are the six proxy-metrics - The Calinski-Harabasz score - The Mclain Rao Index - The Exponential Euclidean Bregman distance - The Wasserstein distance - The mean contamination deviation based on TruncatedSVD decomposed scores - The Gaussian Naive-Bayes trained consensus score In order to get a better understanding on how these six proxy-metrics performed overall, joyplots below demonstrate the distributions of their Pearson's correlation with respect to the MCC scores for each statistic. .. thumbnail:: figs/Rank1.png :alt: Total Average Pearson's score Figure 1: Total average Pearson's score of selected proxy metrics. .. thumbnail:: figs/Rank2.png :alt: Mean Pearson's score Across Datasets Figure 2: Mean Pearson's score across datasets for selected proxy metrics. .. thumbnail:: figs/Rank3.png :alt: Median Pearson's score Across Datasets Figure 3: Median Pearson's score across datasets for selected proxy metrics. .. thumbnail:: figs/Rank4.png :alt: Standard Deviation Pearson's score Across Datasets Figure 4: Standard deviation Pearson's score across datasets for selected proxy metrics. ***************** Rank OD Methods ***************** The ranking process involves ordering the proxy-metric scores with respect to their performance. The proxy-metrics are ordered highest-to-lowest or lowest-to-highest based on their performance criterion. The proxy-metrics are combined as follows: the statistical based distances are combined using equal weighting on their ordered ranks to compute a single ranked list. The same method is applied to the clustering based, and consensus based metrics. Finally, an overall combined rank is computed using the combined statistical based ranking, the combined clustering based ranking, and the combined consensus based ranking. This final combined ranking can either be computed using equal weightings for each three combined rankings or a weight list can be parsed based on preference. The ``RANK`` method in ``PyThresh`` applies the method above to rank the performance of the outlier detection and thresholding methods against each other. To evaluate the performance of ``RANK`` the ranking evaluation measure was employed as described in `RankDCG: Rank-Ordering Evaluation Measure `_ and GitHub implementation `ranking `_ The ``RANK`` method was tested on the same dataset and OD and thresholding methods as was used for the selected proxy-metrics. Additionally, a fine-tuned LambdaMART model using XGBoost was also trained on the selected proxy-metrics and ranks from the same dataset with which to further evaluate the ``RANK`` performance. Also a random ordered selection rank method was tested with which to compare the RankDCG results to. The joyplots below indicate the performance between the methods with respect aggregation across each dataset. .. thumbnail:: figs/Rank5.png :alt: RankDCG scores for each test Figure 5: RankDCG scores for each tested combination. .. thumbnail:: figs/Rank6.png :alt: Mean RankDCG scores per dataset Figure 6: Mean RankDCG scores for each tested combination agregated per dataset. As can be seen, the ``RANK`` method performed on average better than random selection. However, the trained LambdaMART model achieved significantly better scores than ``RANK``. To test that the model was not overfitted and is able to generalize, the datasets ``mammography, skin``, and ``smtp`` were tested as the model had not been trained on them. .. thumbnail:: figs/Rank7.png :alt: Mean RankDCG scores per test dataset Figure 7: Mean RankDCG scores for each tested combination aggregated per test dataset. On these datasets the ``RANK`` method performed worse than random selection. However, the trained LambdaMART model performed marginally better than random selection. The lower performance of the model may be due to the proxy-metrics' lower than training datasets correlation to the MCC as well as the OD and thresholding methods performing poorly in general on these dataset. This is reiterated as seen by the ``RANK`` method's performance. However, the model is still valid with respect to its ability to rank OD and thresholding problems. ********* Example ********* Below is a simple example of how to apply the ``RANK`` method: .. code:: python # Import libraries from pyod.models.knn import KNN from pyod.models.iforest import IForest from pyod.models.pca import PCA from pyod.models.mcd import MCD from pyod.models.qmcd import QMCD from pythresh.thresholds.filter import FILTER from pythresh.utils.rank import RANK # Initialize models clfs = [KNN(), IForest(), PCA(), MCD(), QMCD()] thres = FILTER() # or list of thresholder methods # Get rankings ranker = RANK(clfs, thres) rankings = ranker.eval(X) ************* Final Notes ************* While the ``RANK`` method is a useful tool to assist in selecting the possible best outlier detection and thresholding method to use, it is not infallible. The use of the trained LambdaMART model will be used by default however the standard ``RANK`` method can also be employed. It has been noted from the tests above, that in general the ranked results tended to return the best-to-worst performing outlier detection and thresholding methods in the correct order. However, they were not perfect. They at times exhibited incorrect orders and often the best performing OD and thresholding method was in the top three rather than being the top of the list. Additionally, some times well performing OD and thresholding methods was ranked poorly. Future work will test to see how effective this method is with regards to other ranking OD and thresholding methods as well as how much better it performs than simply picking the ``IForest`` OD method. Note that this method is under active development and the model and methods will be updated or upgraded in the future The ``RANK`` method should be used with discretion but hopefully provide more clarity on which OD and thresholding method to select.