Frequently Asked Questions
What outlier thresholding method should I use
Since there are many methods to choose from, when selecting an outlier
thresholding method it is important to note: what type of data you are
working with, the selected outlier detection method, and the resultant
distribution of the outlier detection likelihood scores. The last
mentioned factor is of particular importance as it is the only
information that is directly provided to the thresholding method. A good
rule of thumb is to use the best benchmarked methods. However, these
methods may not necessarily be the best choice. Another option is to
combine multiple methods using the COMB thresholder. These options
should provide a general, if not best threshold, for the dataset, but at
the least give a good initial point for further investigation. Another
best practice is to use a thresholding method that is similar to the
outlier detection method. This may ensure a increased compatibility
between the two methods providing better results. A simple example of
this is using the PCA outlier detection method followed by the
DECOMP thersholder. However, this in general works but is not always
the case. Finally, the distribution of the outlier detection likelihood
scores can be the most helpful for selecting the best thresholding
method. Understanding the profile and complexity of of the distribution
is an important factor (e.g. is the distribution Gaussian-like?). The
API reference page has added notes on most thresholding methods and can
be very helpful with matching them with the distributions of the
likelihood scores.
Remember these are unsupervised methods, and sometimes interpretability may be more important than the best result. Hence, the range of level in complexity and variety between all the available thresholding methods. Another factor to consider is whether the importance is to get all the outliers, our rather get the best balance between missing outliers while removing the least amount of inliers.
Why are inliers and outliers incorrectly labeled even when the correct contamination level is applied
Most thresholding methods follow the assumption that the higher a
likelihood score is, the more likely it is to be an outlier. Therefore,
if the outlier detection method incorrectly gives a high/low score to
the wrong class, this misclassification will be carried over to the
thresholding method. This makes selecting the best outlier detection
method even more important than the thresholding method. With that being
said, the META thresholder was explicitly constructed to provide the
best possible threshold given the use of any outlier detection method.
But even so, there are obvious limitations to this.
How do I accurately threshold outliers from a test dataset with respect to a training dataset
So there are a few ways to threshold test data with respect to the
training dataset. A good method involves the outlier likelihood scores
of the test data being computed with regards to the training data. This
can be done with many of the outlier methods (e.g. using the
decision_function function of a fitted PyOD model). It is important
to note that not all outlier detection methods genuinely implement this
functionality correctly so best to check. The threshold method can be
fitted on the training data set and applied to any new data’s computed
outlier likelihood scores using the predict function.
How can I visualize the results
There are a few ways to visualize the labeled classes. One method
involves applying a 2D or 3D PCA transformation to the dataset and
scatter plotting the transformed variables while setting the colors to
the binary label output of the thresholder. Please note that a PCA
transformation will introduce its own bias of the dataset when
visualizing the results and sometimes it may look like the outlier
detection and thresholding have not worked well at all (this is
especially true for data that has a high non-linear relationship between
the classes). In this case perhaps a non-linear or more robust
decomposition method should be used for visualizing the results. Another
way to visualize the labeled classes is to generate a kernel density
estimation of the outlier likelihood scores and plot a vertical line on
the threshold point. This point can be obtained using the thresh_
attribute after evaluating the likelihood scores.
Can thresholders do multiclass thresholding
The short answer is kind of. PyThresh thresholding involves only
binary classification. However, if you wish for some reason to have
multiclass outlier classification (e.g. inliers, uncertains, outliers),
then the CONF utility available in pthresh.utils.conf can be
used to return the indeces of datapoints that fall within the confidence
bounds of uncetrainty for a selected confidence level.
Contributing
Anyone is welcome to contribute to PyThresh:
Please share your ideas and ask questions by opening an issue.
To contribute, first check the Issue list for the “help wanted” tag and comment on the one that you are interested in. The issue will then be assigned to you.
If the bug, feature, or documentation change is novel (not in the Issue list), you can either log a new issue or create a pull request for the new changes.
To start, fork the main branch and add your improvement/modification/fix.
To make sure the code has the same style and standard, please refer to qmcd.py for example.
Create a pull request to the main branch and follow the pull request template PR template
Please make sure that all code changes are accompanied with proper new/updated test functions. Automatic tests will be triggered. Before the pull request can be merged, make sure that all the tests pass.