################################### Welcome to PyThresh Documentation ################################### **Deployment, Stats, & License** |badge_pypi| |badge_anaconda| |badge_docs| |badge_testing| |badge_coverage| |badge_maintainability| |badge_stars| |badge_downloads| |badge_versions| |badge_licence| |badge_citation| .. |badge_pypi| image:: https://img.shields.io/pypi/v/pythresh.svg?color=brightgreen&logo=pypi&logoColor=white :alt: PyPI version :target: https://pypi.org/project/pythresh/ .. |badge_anaconda| image:: https://img.shields.io/conda/vn/conda-forge/pythresh?color=brightgreen&logo=conda-forge&logoColor=white :alt: Anaconda version :target: https://anaconda.org/conda-forge/pythresh .. |badge_docs| image:: https://img.shields.io/readthedocs/pythresh.svg?version=latest&logo=read-the-docs&logoColor=white :alt: Documentation status :target: http://pythresh.readthedocs.io/?badge=latest .. |badge_testing| image:: https://github.com/KulikDM/pythresh/actions/workflows/ci.yml/badge.svg :alt: testing :target: https://github.com/KulikDM/pythresh/actions/workflows/ci.yml .. |badge_coverage| image:: https://codecov.io/gh/KulikDM/pythresh/branch/main/graph/badge.svg?token=8ZAPXTLW9Y :alt: Codecov :target: https://codecov.io/gh/KulikDM/pythresh .. |badge_maintainability| image:: https://api.codeclimate.com/v1/badges/3e2de42b48701c731ef6/maintainability :alt: Maintainability :target: https://codeclimate.com/github/KulikDM/pythresh/maintainability .. |badge_stars| image:: https://img.shields.io/github/stars/KulikDM/pythresh.svg?logo=github&logoColor=white&style=flat :alt: GitHub stars :target: https://github.com/KulikDM/pythresh/stargazers .. |badge_downloads| image:: https://img.shields.io/badge/dynamic/xml?url=https%3A%2F%2Fstatic.pepy.tech%2Fbadge%2Fpythresh&query=%2F%2F*%5Blocal-name()%20%3D%20%27text%27%5D%5Blast()%5D&logo=data%3Aimage%2Fsvg%2Bxml%3Bbase64%2CPHN2ZyBzdHlsZT0iZW5hYmxlLWJhY2tncm91bmQ6bmV3IDAgMCAyNCAyNDsiIHZlcnNpb249IjEuMSIgdmlld0JveD0iMCAwIDI0IDI0IiB4bWw6c3BhY2U9InByZXNlcnZlIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hsaW5rIj48ZyBpZD0iaW5mbyIvPjxnIGlkPSJpY29ucyI%2BPGcgaWQ9InNhdmUiPjxwYXRoIGQ9Ik0xMS4yLDE2LjZjMC40LDAuNSwxLjIsMC41LDEuNiwwbDYtNi4zQzE5LjMsOS44LDE4LjgsOSwxOCw5aC00YzAsMCwwLjItNC42LDAtN2MtMC4xLTEuMS0wLjktMi0yLTJjLTEuMSwwLTEuOSwwLjktMiwyICAgIGMtMC4yLDIuMywwLDcsMCw3SDZjLTAuOCwwLTEuMywwLjgtMC44LDEuNEwxMS4yLDE2LjZ6IiBmaWxsPSIjZWJlYmViIi8%2BPHBhdGggZD0iTTE5LDE5SDVjLTEuMSwwLTIsMC45LTIsMnYwYzAsMC42LDAuNCwxLDEsMWgxNmMwLjYsMCwxLTAuNCwxLTF2MEMyMSwxOS45LDIwLjEsMTksMTksMTl6IiBmaWxsPSIjZWJlYmViIi8%2BPC9nPjwvZz48L3N2Zz4%3D&label=downloads :alt: Downloads :target: https://pepy.tech/project/pythresh .. |badge_versions| image:: https://img.shields.io/pypi/pyversions/pythresh.svg?logo=python&logoColor=white :alt: Python versions :target: https://pypi.org/project/pythresh/ .. |badge_licence| image:: https://img.shields.io/github/license/KulikDM/pythresh.svg?logo=data:image/svg+xml;base64,PHN2ZyBoZWlnaHQ9IjMyIiBpZD0iaWNvbiIgdmlld0JveD0iMCAwIDMyIDMyIiB3aWR0aD0iMzIiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyI+PGRlZnMgZmlsbD0iI2ViZjJlZSI+PHN0eWxlPgogICAgICAuY2xzLTEgewogICAgICAgIGZpbGw6IG5vbmU7CiAgICAgIH0KICAgIDwvc3R5bGU+PC9kZWZzPjxyZWN0IGhlaWdodD0iMiIgd2lkdGg9IjEyIiB4PSI4IiB5PSI2IiBmaWxsPSIjZWJmMmVlIi8+PHJlY3QgaGVpZ2h0PSIyIiB3aWR0aD0iMTIiIHg9IjgiIHk9IjEwIiBmaWxsPSIjZWJmMmVlIi8+PHJlY3QgaGVpZ2h0PSIyIiB3aWR0aD0iNiIgeD0iOCIgeT0iMTQiIGZpbGw9IiNlYmYyZWUiLz48cmVjdCBoZWlnaHQ9IjIiIHdpZHRoPSI0IiB4PSI4IiB5PSIyNCIgZmlsbD0iI2ViZjJlZSIvPjxwYXRoIGQ9Ik0yOS43MDcsMTkuMjkzbC0zLTNhLjk5OTQuOTk5NCwwLDAsMC0xLjQxNCwwTDE2LDI1LjU4NTlWMzBoNC40MTQxbDkuMjkyOS05LjI5M0EuOTk5NC45OTk0LDAsMCwwLDI5LjcwNywxOS4yOTNaTTE5LjU4NTksMjhIMThWMjYuNDE0MWw1LTVMMjQuNTg1OSwyM1pNMjYsMjEuNTg1OSwyNC40MTQxLDIwLDI2LDE4LjQxNDEsMjcuNTg1OSwyMFoiIGZpbGw9IiNlYmYyZWUiLz48cGF0aCBkPSJNMTIsMzBINmEyLjAwMjEsMi4wMDIxLDAsMCwxLTItMlY0QTIuMDAyMSwyLjAwMjEsMCwwLDEsNiwySDIyYTIuMDAyMSwyLjAwMjEsMCwwLDEsMiwyVjE0SDIyVjRINlYyOGg2WiIgZmlsbD0iI2ViZjJlZSIvPjxyZWN0IGNsYXNzPSJjbHMtMSIgZGF0YS1uYW1lPSImbHQ7VHJhbnNwYXJlbnQgUmVjdGFuZ2xlJmd0OyIgaGVpZ2h0PSIzMiIgaWQ9Il9UcmFuc3BhcmVudF9SZWN0YW5nbGVfIiB3aWR0aD0iMzIiIGZpbGw9IiNlYmYyZWUiLz48L3N2Zz4= :alt: License :target: https://github.com/KulikDM/pythresh/blob/main/LICENSE .. |badge_citation| image:: https://zenodo.org/badge/497683169.svg :alt: Zenodo DOI :target: https://zenodo.org/badge/latestdoi/497683169 ---- PyThresh is a comprehensive and scalable **Python toolkit** for **thresholding outlier detection likelihood scores** in univariate/multivariate data. It has been written to work in tandem with PyOD and has similar syntax and data structures. However, it is not limited to this single library. PyThresh is meant to threshold likelihood scores generated by an outlier detector. It thresholds these likelihood scores and replaces the need to set a contamination level or have the user guess the amount of outliers that may exist in the dataset beforehand. These non-parametric methods were written to reduce the user's input/guess work and rather rely on statistics instead to threshold outlier likelihood scores. For thresholding to be applied correctly, the outlier detection likelihood scores must follow this rule: the higher the score, the higher the probability that it is an outlier in the dataset. All threshold functions return a binary array where inliers and outliers are represented by a 0 and 1 respectively. PyThresh includes more than 30 thresholding algorithms. These algorithms range from using simple statistical analysis like the Z-score to more complex mathematical methods that involve graph theory and topology. **API Demo**: .. code:: python # train the KNN detector from pyod.models.knn import KNN from pythresh.thresholds.clust import CLUST clf = KNN() clf.fit(X_train) # get outlier likelihood scores decision_scores = clf.decision_scores_ # get outlier labels thres = CLUST() thres.fit(decision_scores) labels = thres.labels_ # or thres.predict(decision_scores) ---- ************************** Benchmarking & Utilities ************************** Benchmarking has been done on all the thresholders and it was found that the ``MIXMOD`` thresholder performed best while the ``CLF`` thresholder provided the smallest uncertainty about its mean and is the most robust (best least accurate prediction). However, for interpretability and general performance the ``MIXMOD, FILTER,`` and ``META`` thresholders are good fits. Further utilities are available for assisting in the selection of the most optimal outlier detection and thresholding methods `ranking `_ as well as determining the confidence with regards to the selected thresholding method `thresholding confidence `_ ---- ************************ External Feature Cases ************************ **Towards Data Science**: `Thresholding Outlier Detection Scores with PyThresh `_ **Towards Data Science**: `When Outliers are Significant: Weighted Linear Regression `_ **ArXiv**: `Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection `_ ---- *********************************** Available Thresholding Algorithms *********************************** +-----------+----------------------------------------------------------------+-----------------------------------+ | Abbr | Description | References | +===========+================================================================+===================================+ | AUCP | Area Under Curve Percentage | :cite:`ren2018aucp` | +-----------+----------------------------------------------------------------+-----------------------------------+ | BOOT | Bootstrapping | :cite:`martin2006boot` | +-----------+----------------------------------------------------------------+-----------------------------------+ | CHAU | Chauvenet's Criterion | :cite:`bolshev2016chau` | +-----------+----------------------------------------------------------------+-----------------------------------+ | CLF | Trained Linear Classifier | :cite:`aggarwal2017clf` | +-----------+----------------------------------------------------------------+-----------------------------------+ | CLUST | Clustering Based | :cite:`klawonn2008clust` | +-----------+----------------------------------------------------------------+-----------------------------------+ | CPD | Change Point Detection | :cite:`fearnhead2016cpd` | +-----------+----------------------------------------------------------------+-----------------------------------+ | DECOMP | Decomposition | :cite:`boente2002decomp` | +-----------+----------------------------------------------------------------+-----------------------------------+ | DSN | Distance Shift from Normal | :cite:`amagata2021dsn` | +-----------+----------------------------------------------------------------+-----------------------------------+ | EB | Elliptical Boundary | :cite:`friendly2013eb` | +-----------+----------------------------------------------------------------+-----------------------------------+ | FGD | Fixed Gradient Descent | :cite:`qi2021fgd` | +-----------+----------------------------------------------------------------+-----------------------------------+ | FILTER | Filtering Based | :cite:`hashemi2019filter` | +-----------+----------------------------------------------------------------+-----------------------------------+ | FWFM | Full Width at Full Minimum | :cite:`joneidi2013fwfm` | +-----------+----------------------------------------------------------------+-----------------------------------+ | GAMGMM | Bayesian Gamma GMM | :cite:`perini2023gamgmm` | +-----------+----------------------------------------------------------------+-----------------------------------+ | GESD | Generalized Extreme Studentized Deviate | :cite:`alrawashdeh2021gesd` | +-----------+----------------------------------------------------------------+-----------------------------------+ | HIST | Histogram Based | :cite:`thanammal2015hist` | +-----------+----------------------------------------------------------------+-----------------------------------+ | IQR | Inter-Quartile Regression | :cite:`bardet2015iqr` | +-----------+----------------------------------------------------------------+-----------------------------------+ | KARCH | Karcher mean (Riemannian Center of Mass) | :cite:`afsari2011karch` | +-----------+----------------------------------------------------------------+-----------------------------------+ | MAD | Median Absolute Deviation | :cite:`archana2015mad` | +-----------+----------------------------------------------------------------+-----------------------------------+ | MCST | Monte Carlo Shapiro Tests | :cite:`coin2008mcst` | +-----------+----------------------------------------------------------------+-----------------------------------+ | META | Metamodel Trained Classifier | :cite:`zhao2022meta` | +-----------+----------------------------------------------------------------+-----------------------------------+ | MIXMOD | Normal & Non-Normal Mixture Models | :cite:`veluw2023mixmod` | +-----------+----------------------------------------------------------------+-----------------------------------+ | MOLL | Friedrichs' Mollifier | :cite:`keyzer1997moll` | +-----------+----------------------------------------------------------------+-----------------------------------+ | MTT | Modified Thompson Tau Test | :cite:`rengasamy2020mtt` | +-----------+----------------------------------------------------------------+-----------------------------------+ | OCSVM | One-Class Support Vector Machine | :cite:`barbado2022ocsvm` | +-----------+----------------------------------------------------------------+-----------------------------------+ | QMCD | Quasi-Monte Carlo Discrepancy | :cite:`iouchtchenko2019qmcd` | +-----------+----------------------------------------------------------------+-----------------------------------+ | REGR | Regression Based | :cite:`aggarwal2017clf` | +-----------+----------------------------------------------------------------+-----------------------------------+ | VAE | Variational Autoencoder | :cite:`xiao2020vae` | +-----------+----------------------------------------------------------------+-----------------------------------+ | WIND | Topological Winding Number | :cite:`jacobson2013wind` | +-----------+----------------------------------------------------------------+-----------------------------------+ | YJ | Yeo-Johnson Transformation | :cite:`raymaekers2021yj` | +-----------+----------------------------------------------------------------+-----------------------------------+ | ZSCORE | Z-score | :cite:`bagdonavicius2020zscore` | +-----------+----------------------------------------------------------------+-----------------------------------+ | COMB | Thresholder Combination | | +-----------+----------------------------------------------------------------+-----------------------------------+ | DUMMY | Dummy Percentile Based | | +-----------+----------------------------------------------------------------+-----------------------------------+ **Tutorial Notebooks** +-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ | Notebook | Description | +===================================================================================================================+=====================================================================================================+ | `Introduction `_ | Basic intro into outlier thresholding | +-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ | `Advanced Thresholding `_ | Additional thresholding options for more advanced use | +-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ | `Threshold Confidence `_ | Calculating the confidence levels around the threshold point | +-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ | `Outlier Ranking `_ | Assisting in selecting the best performing outlier and thresholding method combo using ranking | +-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+ **The comparison among of implemented models** is made available below: .. thumbnail:: figs/All.png :alt: Comparison of selected models ############################ API Cheatsheet & Reference ############################ The following APIs are applicable for all detector models for easy use. - :func:`pythresh.thresholders.base.BaseDetector.eval`: evaluate a single outlier or multiple outlier detection likelihood score set (Legacy method). - :func:`pythresh.thresholders.base.BaseDetector.fit`: fit a thresholder for a single outlier or multiple outlier detection likelihood score set. - :func:`pythresh.thresholders.base.BaseDetector.predict`: predict the binary labels using the fitted thresholder on a single outlier or multiple outlier detection likelihood score set. Key Attributes of a threshold: - :attr:`pythresh.thresholders.base.BaseDetector.thresh_`: Return the threshold value that separates inliers from outliers. Outliers are considered all values above this threshold value. Note the threshold value has been derived from likelihood scores normalized between 0 and 1. - :attr:`pythresh.thresholds.base.BaseThresholder.labels_`: Return a binary array of labels for the fitted thresholder on the fitted dataset. - :attr:`pythresh.thresholders.base.BaseDetector.confidence_interval_`: Return the lower and upper confidence interval of the contamination level. Only applies to the COMB thresholder - :attr:`pythresh.thresholders.base.BaseDetector.dscores_`: 1D array of the TruncatedSVD decomposed decision scores if multiple outlier detector score sets are passed - :attr:`pythresh.thresholders.mixmod.MIXMOD.mixture_`: fitted mixture model class of the selected model used for thresholding. Only applies to MIXMOD. Attributes include: components, weights, params. Functions include: fit, loglikelihood, pdf, and posterior. ---- .. toctree:: :maxdepth: 2 :hidden: :caption: Getting Started install example benchmark ranking confidence .. toctree:: :maxdepth: 2 :hidden: :caption: Documentation api_cc pythresh .. toctree:: :maxdepth: 2 :hidden: :caption: Additional Information FAQ ---- .. rubric:: References .. bibliography:: :cited: :labelprefix: :keyprefix: a-