Version: 0.2.4 PyPI
https://github.com/hfawal/clustering_leakage_analysis
LeakyBlobs is a machine learning package for Python ≥ 3.11 which provides tools for evaluating and analyzing the quality of clustered data. Its main purpose is to measure the ‘leakage’ between clusters by using the predicted probabilities of a classification model.
LeakyBlobs provides a sensible alternative to traditional ways of evaluating the quality of a clustering, such as the Elbow Method, Silhouette Score, and Gap Statistic. These methods tend to oversimplify the problem of cluster evaluation by creating a single number which can be difficult to judge for human beings, often resulting in highly subjective choices for clustering hyperparameters such as the number of clusters in algorithms like K Means.
Instead, LeakyBlobs is based on the idea that a good clustering is a predictable clustering. The package provides tools to train simple classifiers to predict clusters and tools to analyze model probability outputs in order to see the extent to which clusters 'leak' into each other.
LeakyBlobs is freely available through the Python Package Index. To install it, simply run the following in the command line within your Python environment:
pip install leakyblobs
LeakyBlobs requires the following packages, which should be automatically resolved if you are using the pip command above.
numpy>=1.26.1
pandas>=2.0.0
openpyxl>=3.1.5
pyvis>=0.3.2
plotly>=5.20.0
scipy>=1.14.0
openpyxl>=3.1.5
setuptools>=72.1.0
scikit-learn>=1.5.1
LeakyBlobs is comprised of two independent components: the ClusterPredictor class and the ClusterEvaluator class. In order to produce cluster evaluation metrics, the ClusterEvaluator requires targets, predictions, and probabilities from a model on a test set. The model should be trained to predict clusters using the same data that was clustered as inputs/features. To use LeakyBlobs metrics from ClusterEvaluator , you have two options:
ClusterEvaluator.ClusterPredictor and pass its test set predictions directly to a ClusterEvaluator.
ClusterPredictor class is effectively a wrapper on a Logistic Regression algorithm. The class also has a parameter which allows the model to have non-linear decision boundaries by making use of Random Fourier Features; this is enabled by default.