#########
 Ranking
#########

**************
 Introduction
**************

Outlier detection for unsupervised tasks is difficult. The difficulty
arises with how to evaluate whether the selected methods for outlier
detection and thresholding are correct or the best for the dataset.
Often, additional post-evaluation involves visual or domain knowledge
based decisions to make a final call. This process can be tedious and is
often just as complex as the methods used to do the outlier detection in
the first place.

However, there are some robust statistical methods that can be used to
indicate or at least provide a guide of which outlier detection and
thresholding methods perform better than others. This process is that of
a ranking problem and can be used to list in order the evaluated outlier
detection and thresholding methods in terms of their respective
performance with other methods.

----

In order to rank the multiple outlier detection and thresholding
methods' capabilities of effectively labeling a given dataset, it is
important to note what is defined by "capabilities". In this case, this
refers to how well the predicted labels solved by the outlier detection
and thresholding method compares to the true labels. This can be done
using the F1 score or Matthews correlation coefficient (MCC). But, since
unsupervised tasks means there is lack of true labels, the best option
is to use other robust statistical metrics that have a strong
correlation to the above mentioned scores.

To find these proxy-metrics a good starting point is to know what data
can be used to compute them. In the case of unsupervised outlier
detection, there are essentially three main components: the exploratory
variables (X), the predicted outlier likelihood scores, and the
thresholded binary labels. With these, criteria can be set to apply
proxy-metrics that can then be ranked to provide a list of best to worst
performing outlier detection and thresholding methods with respect to
these metrics.

***************
 Proxy-Metrics
***************

Different types of proxy-metrics can be used with respect to the F1 and
MCC. Below four catagories have been defined.

Clustering Based
================

Since the dataset is considered to contain outliers, a clear distinction
should exist between the inliers and outliers. By using clustering based
metrics, the measure of similarity or distance between centroids or each
datapoints can provide a single score of outlier detection method's to
effectively distinguish inliers from outliers. This should also indicate
greater distinction between the inlier and outlier clusters. For these
clustering based metrics the quality of the labeled data will be
evaluated using either the exploratory variables or the outlier
likelihood scores and their assigned labels `Automatic Selection of
Outlier Detection Techniques
<https://pure.tue.nl/ws/portalfiles/portal/109406381/CSE663_Vishal_Chouksey_31_aug.pdf>`_

Statistical Distances
=====================

Statistical distances quantify the measure of distances between
probability based measures. These distances can be used to compute a
single value difference between two probability distributions. This
hints to what data from the unsupervised outlier detection method task
should be used. Two distinct probability distributions can be computed
for the outlier likelihood scores with respected to their labeled class.

Curve Based
===========

Curve based method like clustering based methods can be used to compare
the performance of unsupervised anomaly detection algorithms. This is
done by leveraging the density of OD likelihood scores and evaluating
their distributions with particular interest on their tails as this is
typically where the outlier will lie `How to Evaluate the Quality of
Unsupervised Anomaly Detection Algorithms?
<https://arxiv.org/abs/1607.01152>`_

Consensus Based
===============

Since the other proxy-metrics evaluate each outlier detection method
individually, consensus based proxy-metrics rather compare the OD
methods against one another. This is done be creating a consensus
baseline by using all the OD method's results and comparing the
deviation of each OD method from this baseline. This allows for a
consensus based score to be assigned to each OD method with respect to
the other tested OD methods `A Large-scale Study on Unsupervised Outlier
Model Selection: Do Internal Strategies Suffice?
<https://arxiv.org/abs/2104.01422>`_ `Consensus outlier detection in
survival analysis using the rank product test
<https://www.biorxiv.org/content/10.1101/421917v1.full>`_

*********************************
 Proxy-Metrics Correlation Tests
*********************************

35 proxy-metrics scores were correlated against the true versus the
thresholded labels MCC score by applying Pearson's correlation. This was
done for all passing possible combinations of the ``PyOD`` outlier
detection methods ``LODA, QMCD, MCD, GMM, KNN, KDE, PCA, COPOD, HBOS,``
and ``IForest`` on the AD benchmark datasets: ``annthyroid, breastw,
cardio, Cardiotocography, fault, glass, Hepatitis, Ionosphere, landsat,
letter, Lymphography, mnist, musk, optdigits, PageBlocks, pendigits,
Pima, satellite, satimage-2, SpamBase, Stamps, thyroid, vertebral,
vowels, Waveform, WBC, WDBC, Wilt, wine, WPBC,`` and ``yeast`` available
at `ADBench dataset
<https://github.com/Minqi824/ADBench/tree/main/adbench/datasets/Classical>`_
and applying the ``FILTER, CLF, DSN, OCSVM, KARCH, HIST,`` and ``META``
thresholders. A total of 198357 possible passing (dropped error fits,
nans, and infs) combinations were tested for this evaluation. Tabulated
below is a meta-analysis of the proxy-metrics Pearson's score statistics
calculated for all the combination across the datasets.

For clustering based between the exploratory variables and the predicted
labels, the following proxy-metrics were tested.

-  CAL: Calinski-Harabasz
-  DAV: Davies-Bouldin score
-  SIL: Silhouette score

For clustering based between the outlier likelihood scores and the
predicted labels, the following proxy-metrics were tested.

-  BH: Ball-Hall Index
-  BR: Banfeld-Raftery
-  CAL_sc: Calinski-Harabasz score
-  DAV_sc: Davies-Bouldin score
-  DR: Det Ratio Index Dunn: Dunn's index
-  Hubert: Hubert index
-  Iind: Local Moran's I index
-  MR: Mclain Rao Index
-  PB: Point biserial Index
-  RL: Ratkowsky Lance Index
-  RT: Ray-Turi Index
-  SIL_sc: Silhouette score
-  SDBW: S_Dbw validity index
-  SS: Scott-Symons Index
-  WG: Wemmert-Gancarski Index
-  XBS: Xie-Beni index

For statistical distances, the following proxy-metrics were tested.

-  AMA: Amari distance
-  BHT: Bhattacharyya distance
-  BREG: Exponential Euclidean Bregman distance
-  COR: Correlation distance
-  ENG: Energy distance
-  JS: Jensen-Shannon distance
-  MAH: Mahalanobis distance
-  LK: Lukaszyk-Karmowski metric for normal distributions
-  WS: Wasserstein or Earth Movers distance

For curve based, the following proxy-metrics were tested.

-  EM: Excess-Mass curves
-  MV: Mass-Volume curves

For consensus based, the following proxy-metrics were tested.

-  Contam: Mean contamination deviation based on TruncatedSVD decomposed
   scores
-  GNB: Gaussian Naive-Bayes trained consensus score
-  HITS: Deviation from the HITS consensus authority based labels
-  Mode: Deviation from the mode of the predicted labels
-  Thresh: Label deviation from the TruncatedSVD decomposed thresholded
   labels

.. csv-table::
    :header-rows: 1
    :file: tables/RankingCorr.csv
    :class: sphinx-datatable

From the table above it can be seen that most proxy-metrics performed
sub-optimally. However, there were six proxy-metrics that performed
generally better than the others. These are the six proxy-metrics

-  The Calinski-Harabasz score
-  The Mclain Rao Index
-  The Exponential Euclidean Bregman distance
-  The Wasserstein distance
-  The mean contamination deviation based on TruncatedSVD decomposed
   scores
-  The Gaussian Naive-Bayes trained consensus score

In order to get a better understanding on how these six proxy-metrics
performed overall, joyplots below demonstrate the distributions of their
Pearson's correlation with respect to the MCC scores for each statistic.

.. thumbnail:: figs/Rank1.png
   :alt: Total Average Pearson's score

   Figure 1: Total average Pearson's score of selected proxy metrics.

.. thumbnail:: figs/Rank2.png
   :alt: Mean Pearson's score Across Datasets

   Figure 2: Mean Pearson's score across datasets for selected proxy metrics.

.. thumbnail:: figs/Rank3.png
   :alt: Median Pearson's score Across Datasets

   Figure 3: Median Pearson's score across datasets for selected proxy metrics.

.. thumbnail:: figs/Rank4.png
   :alt: Standard Deviation Pearson's score Across Datasets

   Figure 4: Standard deviation Pearson's score across datasets for selected proxy metrics.

*****************
 Rank OD Methods
*****************

The ranking process involves ordering the proxy-metric scores with
respect to their performance. The proxy-metrics are ordered
highest-to-lowest or lowest-to-highest based on their performance
criterion. The proxy-metrics are combined as follows: the statistical
based distances are combined using equal weighting on their ordered
ranks to compute a single ranked list. The same method is applied to the
clustering based, and consensus based metrics. Finally, an overall
combined rank is computed using the combined statistical based ranking,
the combined clustering based ranking, and the combined consensus based
ranking. This final combined ranking can either be computed using equal
weightings for each three combined rankings or a weight list can be
parsed based on preference.

The ``RANK`` method in ``PyThresh`` applies the method above to rank the
performance of the outlier detection and thresholding methods against
each other.

To evaluate the performance of ``RANK`` the ranking evaluation measure
was employed as described in `RankDCG: Rank-Ordering Evaluation Measure
<https://arxiv.org/abs/1803.00719>`_ and GitHub implementation `ranking
<https://github.com/dkaterenchuk/ranking_measures>`_

The ``RANK`` method was tested on the same dataset and OD and
thresholding methods as was used for the selected proxy-metrics.
Additionally, a fine-tuned LambdaMART model using XGBoost was also
trained on the selected proxy-metrics and ranks from the same dataset
with which to further evaluate the ``RANK`` performance. Also a random
ordered selection rank method was tested with which to compare the
RankDCG results to.

The joyplots below indicate the performance between the methods with
respect aggregation across each dataset.

.. thumbnail:: figs/Rank5.png
   :alt: RankDCG scores for each test

   Figure 5: RankDCG scores for each tested combination.

.. thumbnail:: figs/Rank6.png
   :alt: Mean RankDCG scores per dataset

   Figure 6: Mean RankDCG scores for each tested combination agregated per dataset.

As can be seen, the ``RANK`` method performed on average better than
random selection. However, the trained LambdaMART model achieved
significantly better scores than ``RANK``. To test that the model was
not overfitted and is able to generalize, the datasets ``mammography,
skin``, and ``smtp`` were tested as the model had not been trained on
them.

.. thumbnail:: figs/Rank7.png
   :alt: Mean RankDCG scores per test dataset

   Figure 7: Mean RankDCG scores for each tested combination aggregated per test dataset.

On these datasets the ``RANK`` method performed worse than random
selection. However, the trained LambdaMART model performed marginally
better than random selection. The lower performance of the model may be
due to the proxy-metrics' lower than training datasets correlation to
the MCC as well as the OD and thresholding methods performing poorly in
general on these dataset. This is reiterated as seen by the ``RANK``
method's performance. However, the model is still valid with respect to
its ability to rank OD and thresholding problems.

*********
 Example
*********

Below is a simple example of how to apply the ``RANK`` method:

.. code:: python

   # Import libraries
   from pyod.models.knn import KNN
   from pyod.models.iforest import IForest
   from pyod.models.pca import PCA
   from pyod.models.mcd import MCD
   from pyod.models.qmcd import QMCD
   from pythresh.thresholds.filter import FILTER
   from pythresh.utils.rank import RANK

   # Initialize models
   clfs = [KNN(), IForest(), PCA(), MCD(), QMCD()]
   thres = FILTER() # or list of thresholder methods

   # Get rankings
   ranker = RANK(clfs, thres)
   rankings = ranker.eval(X)

*************
 Final Notes
*************

While the ``RANK`` method is a useful tool to assist in selecting the
possible best outlier detection and thresholding method to use, it is
not infallible. The use of the trained LambdaMART model will be used by
default however the standard ``RANK`` method can also be employed. It
has been noted from the tests above, that in general the ranked results
tended to return the best-to-worst performing outlier detection and
thresholding methods in the correct order. However, they were not
perfect. They at times exhibited incorrect orders and often the best
performing OD and thresholding method was in the top three rather than
being the top of the list. Additionally, some times well performing OD
and thresholding methods was ranked poorly.

Future work will test to see how effective this method is with regards
to other ranking OD and thresholding methods as well as how much better
it performs than simply picking the ``IForest`` OD method.

Note that this method is under active development and the model and
methods will be updated or upgraded in the future

The ``RANK`` method should be used with discretion but hopefully provide
more clarity on which OD and thresholding method to select.