Version: 0.2.4 PyPI

https://github.com/hfawal/clustering_leakage_analysis

LeakyBlobs is a machine learning package for Python ≥ 3.11 which provides tools for evaluating and analyzing the quality of clustered data. Its main purpose is to measure the ‘leakage’ between clusters by using the predicted probabilities of a classification model.

LeakyBlobs provides a sensible alternative to traditional ways of evaluating the quality of a clustering, such as the Elbow Method, Silhouette Score, and Gap Statistic. These methods tend to oversimplify the problem of cluster evaluation by creating a single number which can be difficult to judge for human beings, often resulting in highly subjective choices for clustering hyperparameters such as the number of clusters in algorithms like K Means.

Instead, LeakyBlobs is based on the idea that a good clustering is a predictable clustering. The package provides tools to train simple classifiers to predict clusters and tools to analyze model probability outputs in order to see the extent to which clusters 'leak' into each other.

Table of Contents

Installation

LeakyBlobs is freely available through the Python Package Index. To install it, simply run the following in the command line within your Python environment:

pip install leakyblobs

LeakyBlobs requires the following packages, which should be automatically resolved if you are using the pip command above.

numpy>=1.26.1
pandas>=2.0.0
openpyxl>=3.1.5
pyvis>=0.3.2
plotly>=5.20.0
scipy>=1.14.0
openpyxl>=3.1.5
setuptools>=72.1.0
scikit-learn>=1.5.1

Main Features

LeakyBlobs is comprised of two independent components: the ClusterPredictor class and the ClusterEvaluator class. In order to produce cluster evaluation metrics, the ClusterEvaluator requires targets, predictions, and probabilities from a model on a test set. The model should be trained to predict clusters using the same data that was clustered as inputs/features. To use LeakyBlobs metrics from ClusterEvaluator , you have two options:

Important Points to Keep in Mind

  1. It is critical when using LeakyBlobs that the metrics are based on a test set, or otherwise labeled out-of-sample data. As any model performs better on in-sample data, the metrics will not accurately reflect the strength of the clustering if they are calculated from the training data.