Clustering

K-Means Clustering

class graspologic.cluster.KMeansCluster[source]
ari_: List[float] | None

KMeans Cluster.

It computes all possible models from one component to max_clusters. When the true labels are known, the best model is given by the model with highest adjusted Rand index (ARI). Otherwise, the best model is given by the model with highest silhouette score.

Parameters:
max_clustersint, default=2.

The maximum number of clusters to consider. Must be >=2.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes:
n_clusters_int

Optimal number of clusters. If y is given, it is based on largest ARI. Otherwise, it is based on highest silhouette score.

model_KMeans object

Fitted KMeans object fitted with n_clusters_.

silhouette_list

List of silhouette scores computed for all possible number of clusters given by range(2, max_clusters).

ari_list

Only computed when y is given. List of ARI values computed for all possible number of clusters given by range(2, max_clusters).

__init__(max_clusters=2, random_state=None)[source]
Parameters:
  • max_clusters (int) --

  • random_state (int | RandomState | None) --

fit(X, y=None)[source]

Fits kmeans model to the data.

Parameters:
Xarray-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

yarray-like, shape (n_samples,), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
self
Parameters:
  • X (ndarray) --

  • y (ndarray | None) --

Return type:

KMeansCluster

fit_predict(X, y=None)

Fit the models and predict clusters based on best model.

Parameters:
Xarray-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

yarray-like, shape (n_samples,), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
labelsarray, shape (n_samples,)

Component labels.

Parameters:
  • X (ndarray) --

  • y (Any | None) --

Return type:

ndarray

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(X, y=None)

Predict clusters based on best model.

Parameters:
Xarray-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

yarray-like, shape (n_samples, ), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
labelsarray, shape (n_samples,)

Component labels.

Parameters:
  • X (ndarray) --

  • y (Any | None) --

Return type:

ndarray

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

Gaussian Mixture Models Clustering

class graspologic.cluster.GaussianCluster[source]

Gaussian Mixture Model (GMM)

Representation of a Gaussian mixture model probability distribution. This class allows to estimate the parameters of a Gaussian mixture distribution. It computes all possible models from one component to max_components. The best model is given by the lowest BIC score.

Parameters:
min_componentsint, default=2.

The minimum number of mixture components to consider (unless max_components is None, in which case this is the maximum number of components to consider). If max_componens is not None, min_components must be less than or equal to max_components.

max_componentsint or None, default=None.

The maximum number of mixture components to consider. Must be greater than or equal to min_components.

covariance_type{'all' (default), 'full', 'tied', 'diag', 'spherical'}, optional

String or list/array describing the type of covariance parameters to use. If a string, it must be one of:

  • 'all'

    considers all covariance structures in ['spherical', 'diag', 'tied', 'full']

  • 'full'

    each component has its own general covariance matrix

  • 'tied'

    all components share the same general covariance matrix

  • 'diag'

    each component has its own diagonal covariance matrix

  • 'spherical'

    each component has its own single variance

If a list/array, it must be a list/array of strings containing only

'spherical', 'tied', 'diag', and/or 'full'.

tolfloat, defaults to 1e-3.

The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.

reg_covarfloat, defaults to 1e-6.

Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.

max_iterint, defaults to 100.

The number of EM iterations to perform.

n_initint, defaults to 1.

The number of initializations to perform. The best results are kept.

init_params{'kmeans', 'random'}, defaults to 'kmeans'.

The method used to initialize the weights, the means and the precisions. Must be one of:

'kmeans' : responsibilities are initialized using kmeans.
'random' : responsibilities are initialized randomly.
random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes:
n_components_int

Optimal number of components based on BIC.

covariance_type_str

Optimal covariance type based on BIC.

model_GaussianMixture object

Fitted GaussianMixture object fitted with optimal number of components and optimal covariance structure.

bic_pandas.DataFrame

A pandas DataFrame of BIC values computed for all possible number of clusters given by range(min_components, max_components + 1) and all covariance structures given by covariance_type.

ari_pandas.DataFrame

Only computed when y is given. Pandas Dataframe containing ARI values computed for all possible number of clusters given by range(min_components, max_components) and all covariance structures given by covariance_type.

__init__(min_components=2, max_components=None, covariance_type='all', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', random_state=None)[source]
Parameters:
  • min_components (int) --

  • max_components (int | None) --

  • covariance_type (typing_extensions.Literal[all, spherical, diag, tied, full] | List[typing_extensions.Literal[spherical, diag, tied, full]] | ndarray) --

  • tol (float) --

  • reg_covar (float) --

  • max_iter (int) --

  • n_init (int) --

  • init_params (typing_extensions.Literal[random, kmeans]) --

  • random_state (int | RandomState | None) --

fit(X, y=None)[source]

Fits gaussian mixure model to the data. Estimate model parameters with the EM algorithm.

Parameters:
Xarray-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

yarray-like, shape (n_samples,), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
self
Parameters:
  • X (ndarray) --

  • y (ndarray | None) --

Return type:

GaussianCluster

fit_predict(X, y=None)

Fit the models and predict clusters based on best model.

Parameters:
Xarray-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

yarray-like, shape (n_samples,), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
labelsarray, shape (n_samples,)

Component labels.

Parameters:
  • X (ndarray) --

  • y (Any | None) --

Return type:

ndarray

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(X, y=None)

Predict clusters based on best model.

Parameters:
Xarray-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

yarray-like, shape (n_samples, ), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
labelsarray, shape (n_samples,)

Component labels.

Parameters:
  • X (ndarray) --

  • y (Any | None) --

Return type:

ndarray

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

class graspologic.cluster.AutoGMMCluster[source]

Automatic Gaussian Mixture Model (GMM) selection.

Clustering algorithm using a hierarchical agglomerative clustering then Gaussian mixtured model (GMM) fitting. Different combinations of agglomeration, GMM, and cluster numbers are used and the clustering with the best selection criterion (bic/aic) is chosen.

Parameters:
min_componentsint, default=2.

The minimum number of mixture components to consider (unless max_components is None, in which case this is the maximum number of components to consider). If max_components is not None, min_components must be less than or equal to max_components. If label_init is given, min_components must match number of unique labels in label_init.

max_componentsint or None, default=10.

The maximum number of mixture components to consider. Must be greater than or equal to min_components. If label_init is given, min_components must match number of unique labels in label_init.

affinity{'euclidean','manhattan','cosine','none', 'all' (default)}, optional

String or list/array describing the type of affinities to use in agglomeration. If a string, it must be one of:

  • 'euclidean'

    L2 norm

  • 'manhattan'

    L1 norm

  • 'cosine'

    cosine similarity

  • 'none'

    no agglomeration - GMM is initialized with k-means

  • 'all'

    considers all affinities in ['euclidean','manhattan','cosine','none']

If a list/array, it must be a list/array of strings containing only 'euclidean', 'manhattan', 'cosine', and/or 'none'.

Note that cosine similarity can only work when all of the rows are not the zero vector. If the input matrix has a zero row, cosine similarity will be skipped and a warning will be thrown.

linkage{'ward','complete','average','single', 'all' (default)}, optional

String or list/array describing the type of linkages to use in agglomeration. If a string, it must be one of:

  • 'ward'

    ward's clustering, can only be used with euclidean affinity

  • 'complete'

    complete linkage

  • 'average'

    average linkage

  • 'single'

    single linkage

  • 'all'

    considers all linkages in ['ward','complete','average','single']

If a list/array, it must be a list/array of strings containing only 'ward', 'complete', 'average', and/or 'single'.

covariance_type{'full', 'tied', 'diag', 'spherical', 'all' (default)} , optional

String or list/array describing the type of covariance parameters to use. If a string, it must be one of:

  • 'full'

    each component has its own general covariance matrix

  • 'tied'

    all components share the same general covariance matrix

  • 'diag'

    each component has its own diagonal covariance matrix

  • 'spherical'

    each component has its own single variance

  • 'all'

    considers all covariance structures in ['spherical', 'diag', 'tied', 'full']

If a list/array, it must be a list/array of strings containing only 'spherical', 'tied', 'diag', and/or 'spherical'.

random_stateint, RandomState instance or None, optional (default=None)

There is randomness in k-means initialization of sklearn.mixture.GaussianMixture. This parameter is passed to GaussianMixture to control the random state. If int, random_state is used as the random number generator seed; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

label_initarray-like, shape (n_samples,), optional (default=None)

List of labels for samples if available. Used to initialize the model. If provided, min_components and max_components must match the number of unique labels given here.

kmeans_n_initint, optional (default = 1)

If kmeans_n_init is larger than 1 and label_init is None, additional kmeans_n_init-1 runs of sklearn.mixture.GaussianMixture initialized with k-means will be performed for all covariance parameters in covariance_type.

max_iterint, optional (default = 100).

The maximum number of EM iterations to perform.

selection_criteriastr {"bic" or "aic"}, optional, (default="bic")

select the best model based on Bayesian Information Criterion (bic) or Aikake Information Criterion (aic)

verboseint, optional (default = 0)

Enable verbose output. If 1 then it prints the current initialization and each iteration step. If greater than 1 then it prints also the log probability and the time needed for each step.

max_agglom_sizeint or None, optional (default = 2000)

The maximum number of datapoints on which to do agglomerative clustering as the initialization to GMM. If the number of datapoints is larger than this value, a random subset of the data is used for agglomerative initialization. If None, all data is used for agglomerative clustering for initialization.

n_jobsint or None, optional (default = None)

The number of jobs to use for the computation. This works by computing each of the initialization runs in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See https://scikit-learn.org/stable/glossary.html#term-n-jobs for more details.

Attributes:
results_pandas.DataFrame

Contains exhaustive information about all the clustering runs. Columns are:

'model'GaussianMixture object

GMM clustering fit to the data

'bic/aic'float

Bayesian Information Criterion

'ari'float or nan

Adjusted Rand Index between GMM classification, and true classification, nan if y is not given

'n_components'int

number of clusters

'affinity'{'euclidean','manhattan','cosine','none'}

affinity used in Agglomerative Clustering

'linkage'{'ward','complete','average','single'}

linkage used in Agglomerative Clustering

'covariance_type'{'full', 'tied', 'diag', 'spherical'}

covariance type used in GMM

'reg_covar'float

regularization used in GMM

criter_the best (lowest) Bayesian Information Criterion
n_components_int

number of clusters in the model with the best bic/aic

covariance_type_str

covariance type in the model with the best bic/aic

affinity_str

affinity used in the model with the best bic/aic

linkage_str

linkage used in the model with the best bic/aic

reg_covar_float

regularization used in the model with the best bic/aic

ari_float

ARI from the model with the best bic/aic, nan if no y is given

model_sklearn.mixture.GaussianMixture

object with the best bic/aic

Notes

This algorithm was strongly inspired by mclust, a clustering package in R

References

[1]

Jeffrey D. Banfield and Adrian E. Raftery. Model-based gaussian and non-gaussian clustering. Biometrics, 49:803–821, 1993.

[2]

Abhijit Dasgupta and Adrian E. Raftery. Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association, 93(441):294–302, 1998.

__init__(min_components=2, max_components=10, affinity='all', linkage='all', covariance_type='all', random_state=None, label_init=None, kmeans_n_init=1, max_iter=100, verbose=0, selection_criteria='bic', max_agglom_size=2000, n_jobs=None)[source]
Parameters:
  • min_components (int) --

  • max_components (int | None) --

  • affinity (str | ndarray | List[str]) --

  • linkage (str | ndarray | List[str]) --

  • covariance_type (str | ndarray | List[str]) --

  • random_state (int | RandomState | None) --

  • label_init (ndarray | List[int] | None) --

  • kmeans_n_init (int) --

  • max_iter (int) --

  • verbose (int) --

  • selection_criteria (str) --

  • max_agglom_size (int | None) --

  • n_jobs (int | None) --

fit(X, y=None)[source]

Fits gaussian mixture model to the data. Initialize with agglomerative clustering then estimate model parameters with EM algorithm.

Parameters:
Xarray-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

yarray-like, shape (n_samples,), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
selfobject

Returns an instance of self.

Parameters:
  • X (ndarray) --

  • y (ndarray | None) --

Return type:

AutoGMMCluster

fit_predict(X, y=None)

Fit the models and predict clusters based on best model.

Parameters:
Xarray-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

yarray-like, shape (n_samples,), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
labelsarray, shape (n_samples,)

Component labels.

Parameters:
  • X (ndarray) --

  • y (Any | None) --

Return type:

ndarray

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(X, y=None)

Predict clusters based on best model.

Parameters:
Xarray-like, shape (n_samples, n_features)

List of n_features-dimensional data points. Each row corresponds to a single data point.

yarray-like, shape (n_samples, ), optional (default=None)

List of labels for X if available. Used to compute ARI scores.

Returns:
labelsarray, shape (n_samples,)

Component labels.

Parameters:
  • X (ndarray) --

  • y (Any | None) --

Return type:

ndarray

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

Hierarchical Clustering

class graspologic.cluster.DivisiveCluster[source]

Recursively clusters data based on a chosen clustering algorithm. This algorithm implements a "divisive" or "top-down" approach.

Parameters:
cluster_methodstr {"gmm", "kmeans"}, defaults to "gmm".

The underlying clustering method to apply. If "gmm" will use AutoGMMCluster. If "kmeans", will use KMeansCluster.

min_componentsint, defaults to 1.

The minimum number of mixture components/clusters to consider for the first split if "gmm" is selected as cluster_method; and is set to 1 for later splits. If cluster_method is "kmeans", it is set to 2 for all splits.

max_componentsint, defaults to 2.

The maximum number of mixture components/clusters to consider at each split.

min_splitint, defaults to 1.

The minimum size of a cluster for it to be considered to be split again.

max_levelint, defaults to 4.

The maximum number of times to recursively cluster the data.

delta_criterfloat, non-negative, defaults to 0.

The smallest difference between selection criterion values of a new model and the current model that is required to accept the new model. Applicable only if cluster_method is "gmm".

cluster_kwsdict, defaults to {}

Keyword arguments (except min_components and max_components) for chosen clustering method.

Attributes:
model_GaussianMixture or KMeans object

Fitted clustering object based on which cluster_method was used.

Notes

This class inherits from anytree.node.nodemixin.NodeMixin, a lightweight class for doing various simple operations on trees.

This algorithm was strongly inspired by maggotcluster, a divisive clustering algorithm in https://github.com/neurodata/maggot_models and the algorithm for estimating a hierarchical stochastic block model presented in [2].

References

[1]

Athey, T. L., & Vogelstein, J. T. (2019). AutoGMM: Automatic Gaussian Mixture Modeling in Python. arXiv preprint arXiv:1909.02688.

[2]

Lyzinski, V., Tang, M., Athreya, A., Park, Y., & Priebe, C. E (2016). Community detection and classification in hierarchical stochastic blockmodels. IEEE Transactions on Network Science and Engineering, 4(1), 13-26.

__init__(cluster_method='gmm', min_components=1, max_components=2, cluster_kws={}, min_split=1, max_level=4, delta_criter=0)[source]
Parameters:
  • cluster_method (typing_extensions.Literal[gmm, kmeans]) --

  • min_components (int) --

  • max_components (int) --

  • cluster_kws (Dict[str, Any]) --

  • min_split (int) --

  • max_level (int) --

  • delta_criter (float) --

fit(X)[source]

Fits clustering models to the data as well as resulting clusters

Parameters:
Xarray-like, shape (n_samples, n_features)
Returns:
selfobject

Returns an instance of self.

Parameters:

X (ndarray) --

Return type:

DivisiveCluster

fit_predict(X, fcluster=False, level=None)[source]

Fits clustering models to the data as well as resulting clusters and using fitted models to predict a hierarchy of labels

Parameters:
Xarray-like, shape (n_samples, n_features)
fcluster: bool, default=False

if True, returned labels will be re-numbered so that each column of labels represents a flat clustering at current level, and each label corresponds to a cluster indexed the same as the corresponding node in the overall clustering dendrogram

level: int, optional (default=None)

the level of a single flat clustering to generate only available if fcluster is True

Returns:
labelsarray_label, shape (n_samples, n_levels)

if no level specified; otherwise, shape (n_samples,)

Parameters:
  • X (ndarray) --

  • fcluster (bool) --

  • level (int | None) --

Return type:

ndarray

predict(X, fcluster=False, level=None)[source]

Predicts a hierarchy of labels based on fitted models

Parameters:
Xarray-like, shape (n_samples, n_features)
fcluster: bool, default=False

if True, returned labels will be re-numbered so that each column of labels represents a flat clustering at current level, and each label corresponds to a cluster indexed the same as the corresponding node in the overall clustering dendrogram

level: int, optional (default=None)

the level of a single flat clustering to generate only available if fcluster is True

Returns:
labelsarray-like, shape (n_samples, n_levels)

if no level specified; otherwise, shape (n_samples,)

Parameters:
  • X (ndarray) --

  • fcluster (bool) --

  • level (int | None) --

Return type:

ndarray

set_predict_request(*, fcluster='$UNCHANGED$', level='$UNCHANGED$')

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
fclusterstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for fcluster parameter in predict.

levelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for level parameter in predict.

Returns:
selfobject

The updated object.

Parameters:
Return type:

DivisiveCluster