Clustering¶

K-Means Clustering¶

class graspologic.cluster.KMeansCluster[source]¶

KMeans Cluster.

It computes all possible models from one component to max_clusters. When the true labels are known, the best model is given by the model with highest adjusted Rand index (ARI). Otherwise, the best model is given by the model with highest silhouette score.

Parameters:

max_clustersint, default=2.: The maximum number of clusters to consider. Must be >=2.
random_stateint, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes:

n_clusters_int: Optimal number of clusters. If y is given, it is based on largest ARI. Otherwise, it is based on highest silhouette score.
model_KMeans object: Fitted KMeans object fitted with n_clusters_.
silhouette_list: List of silhouette scores computed for all possible number of clusters given by range(2, max_clusters).
ari_list: Only computed when y is given. List of ARI values computed for all possible number of clusters given by range(2, max_clusters).

__init__(max_clusters=2, random_state=None)[source]¶

Parameters:

max_clusters (int) --
random_state (int | RandomState | None) --

fit(X, y=None)[source]¶

Fits kmeans model to the data.

Parameters:

Xarray-like, shape (n_samples, n_features): List of n_features-dimensional data points. Each row corresponds to a single data point.
yarray-like, shape (n_samples,), optional (default=None): List of labels for X if available. Used to compute ARI scores.

Returns:

self

Parameters:

X (ndarray) --
y (ndarray | None) --

Return type:

KMeansCluster

fit_predict(X, y=None)¶

Fit the models and predict clusters based on best model.

Parameters:

Xarray-like, shape (n_samples, n_features): List of n_features-dimensional data points. Each row corresponds to a single data point.
yarray-like, shape (n_samples,), optional (default=None): List of labels for X if available. Used to compute ARI scores.

Returns:

labelsarray, shape (n_samples,): Component labels.

Parameters:

X (ndarray) --
y (Any | None) --

Return type:

ndarray

get_metadata_routing()¶

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

predict(X, y=None)¶

Predict clusters based on best model.

Parameters:

Xarray-like, shape (n_samples, n_features): List of n_features-dimensional data points. Each row corresponds to a single data point.
yarray-like, shape (n_samples, ), optional (default=None): List of labels for X if available. Used to compute ARI scores.

Returns:

labelsarray, shape (n_samples,): Component labels.

Parameters:

X (ndarray) --
y (Any | None) --

Return type:

ndarray

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

Gaussian Mixture Models Clustering¶

class graspologic.cluster.GaussianCluster[source]¶

Gaussian Mixture Model (GMM)

Representation of a Gaussian mixture model probability distribution. This class allows to estimate the parameters of a Gaussian mixture distribution. It computes all possible models from one component to max_components. The best model is given by the lowest BIC score.

Parameters:

min_componentsint, default=2.

The minimum number of mixture components to consider (unless max_components is None, in which case this is the maximum number of components to consider). If max_componens is not None, min_components must be less than or equal to max_components.

max_componentsint or None, default=None.

The maximum number of mixture components to consider. Must be greater than or equal to min_components.

covariance_type{'all' (default), 'full', 'tied', 'diag', 'spherical'}, optional

String or list/array describing the type of covariance parameters to use. If a string, it must be one of:

'all'
considers all covariance structures in ['spherical', 'diag', 'tied', 'full']
'full'
each component has its own general covariance matrix
'tied'
all components share the same general covariance matrix
'diag'
each component has its own diagonal covariance matrix
'spherical'
each component has its own single variance

If a list/array, it must be a list/array of strings containing only: 'spherical', 'tied', 'diag', and/or 'full'.

tolfloat, defaults to 1e-3.

The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.

reg_covarfloat, defaults to 1e-6.

Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.

max_iterint, defaults to 100.

The number of EM iterations to perform.

n_initint, defaults to 1.

The number of initializations to perform. The best results are kept.

init_params{'kmeans', 'random'}, defaults to 'kmeans'.

The method used to initialize the weights, the means and the precisions. Must be one of:

'kmeans' : responsibilities are initialized using kmeans.
'random' : responsibilities are initialized randomly.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes:

n_components_int: Optimal number of components based on BIC.
covariance_type_str: Optimal covariance type based on BIC.
model_GaussianMixture object: Fitted GaussianMixture object fitted with optimal number of components and optimal covariance structure.
bic_pandas.DataFrame: A pandas DataFrame of BIC values computed for all possible number of clusters given by range(min_components, max_components + 1) and all covariance structures given by covariance_type.
ari_pandas.DataFrame: Only computed when y is given. Pandas Dataframe containing ARI values computed for all possible number of clusters given by range(min_components, max_components) and all covariance structures given by covariance_type.

__init__(min_components=2, max_components=None, covariance_type='all', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', random_state=None)[source]¶

Parameters:

min_components (int) --
max_components (int | None) --
covariance_type (typing_extensions.Literal[all, spherical, diag, tied, full] | List[typing_extensions.Literal[spherical, diag, tied, full]] | ndarray) --
tol (float) --
reg_covar (float) --
max_iter (int) --
n_init (int) --
init_params (typing_extensions.Literal[random, kmeans]) --
random_state (int | RandomState | None) --

fit(X, y=None)[source]¶

Fits gaussian mixure model to the data. Estimate model parameters with the EM algorithm.

Parameters:

Xarray-like, shape (n_samples, n_features): List of n_features-dimensional data points. Each row corresponds to a single data point.
yarray-like, shape (n_samples,), optional (default=None): List of labels for X if available. Used to compute ARI scores.

Returns:

self

Parameters:

X (ndarray) --
y (ndarray | None) --

Return type:

GaussianCluster

fit_predict(X, y=None)¶

Fit the models and predict clusters based on best model.

Parameters:

Xarray-like, shape (n_samples, n_features): List of n_features-dimensional data points. Each row corresponds to a single data point.
yarray-like, shape (n_samples,), optional (default=None): List of labels for X if available. Used to compute ARI scores.

Returns:

labelsarray, shape (n_samples,): Component labels.

Parameters:

X (ndarray) --
y (Any | None) --

Return type:

ndarray

get_metadata_routing()¶

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

predict(X, y=None)¶

Predict clusters based on best model.

Parameters:

Xarray-like, shape (n_samples, n_features): List of n_features-dimensional data points. Each row corresponds to a single data point.
yarray-like, shape (n_samples, ), optional (default=None): List of labels for X if available. Used to compute ARI scores.

Returns:

labelsarray, shape (n_samples,): Component labels.

Parameters:

X (ndarray) --
y (Any | None) --

Return type:

ndarray

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

class graspologic.cluster.AutoGMMCluster[source]¶

Automatic Gaussian Mixture Model (GMM) selection.

Clustering algorithm using a hierarchical agglomerative clustering then Gaussian mixtured model (GMM) fitting. Different combinations of agglomeration, GMM, and cluster numbers are used and the clustering with the best selection criterion (bic/aic) is chosen.

Parameters:

min_componentsint, default=2.

The minimum number of mixture components to consider (unless max_components is None, in which case this is the maximum number of components to consider). If max_components is not None, min_components must be less than or equal to max_components. If label_init is given, min_components must match number of unique labels in label_init.

max_componentsint or None, default=10.

The maximum number of mixture components to consider. Must be greater than or equal to min_components. If label_init is given, min_components must match number of unique labels in label_init.

affinity{'euclidean','manhattan','cosine','none', 'all' (default)}, optional

String or list/array describing the type of affinities to use in agglomeration. If a string, it must be one of:

'euclidean'
L2 norm
'manhattan'
L1 norm
'cosine'
cosine similarity
'none'
no agglomeration - GMM is initialized with k-means
'all'
considers all affinities in ['euclidean','manhattan','cosine','none']

If a list/array, it must be a list/array of strings containing only 'euclidean', 'manhattan', 'cosine', and/or 'none'.

Note that cosine similarity can only work when all of the rows are not the zero vector. If the input matrix has a zero row, cosine similarity will be skipped and a warning will be thrown.

linkage{'ward','complete','average','single', 'all' (default)}, optional

String or list/array describing the type of linkages to use in agglomeration. If a string, it must be one of:

'ward'
ward's clustering, can only be used with euclidean affinity
'complete'
complete linkage
'average'
average linkage
'single'
single linkage
'all'
considers all linkages in ['ward','complete','average','single']

If a list/array, it must be a list/array of strings containing only 'ward', 'complete', 'average', and/or 'single'.

covariance_type{'full', 'tied', 'diag', 'spherical', 'all' (default)} , optional

String or list/array describing the type of covariance parameters to use. If a string, it must be one of:

'full'
each component has its own general covariance matrix
'tied'
all components share the same general covariance matrix
'diag'
each component has its own diagonal covariance matrix
'spherical'
each component has its own single variance
'all'
considers all covariance structures in ['spherical', 'diag', 'tied', 'full']

If a list/array, it must be a list/array of strings containing only 'spherical', 'tied', 'diag', and/or 'spherical'.

random_stateint, RandomState instance or None, optional (default=None)

There is randomness in k-means initialization of sklearn.mixture.GaussianMixture. This parameter is passed to GaussianMixture to control the random state. If int, random_state is used as the random number generator seed; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

label_initarray-like, shape (n_samples,), optional (default=None)

List of labels for samples if available. Used to initialize the model. If provided, min_components and max_components must match the number of unique labels given here.

kmeans_n_initint, optional (default = 1)

If kmeans_n_init is larger than 1 and label_init is None, additional kmeans_n_init-1 runs of sklearn.mixture.GaussianMixture initialized with k-means will be performed for all covariance parameters in covariance_type.

max_iterint, optional (default = 100).

The maximum number of EM iterations to perform.

selection_criteriastr {"bic" or "aic"}, optional, (default="bic")

select the best model based on Bayesian Information Criterion (bic) or Aikake Information Criterion (aic)

verboseint, optional (default = 0)

Enable verbose output. If 1 then it prints the current initialization and each iteration step. If greater than 1 then it prints also the log probability and the time needed for each step.

max_agglom_sizeint or None, optional (default = 2000)

The maximum number of datapoints on which to do agglomerative clustering as the initialization to GMM. If the number of datapoints is larger than this value, a random subset of the data is used for agglomerative initialization. If None, all data is used for agglomerative clustering for initialization.

n_jobsint or None, optional (default = None)

The number of jobs to use for the computation. This works by computing each of the initialization runs in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See https://scikit-learn.org/stable/glossary.html#term-n-jobs for more details.

Attributes:

results_pandas.DataFrame

Contains exhaustive information about all the clustering runs. Columns are:

'model'GaussianMixture object: GMM clustering fit to the data
'bic/aic'float: Bayesian Information Criterion
'ari'float or nan: Adjusted Rand Index between GMM classification, and true classification, nan if y is not given
'n_components'int: number of clusters
'affinity'{'euclidean','manhattan','cosine','none'}: affinity used in Agglomerative Clustering
'linkage'{'ward','complete','average','single'}: linkage used in Agglomerative Clustering
'covariance_type'{'full', 'tied', 'diag', 'spherical'}: covariance type used in GMM
'reg_covar'float: regularization used in GMM

criter_the best (lowest) Bayesian Information Criterion

n_components_int

number of clusters in the model with the best bic/aic

covariance_type_str

covariance type in the model with the best bic/aic

affinity_str

affinity used in the model with the best bic/aic

linkage_str

linkage used in the model with the best bic/aic

reg_covar_float

regularization used in the model with the best bic/aic

ari_float

ARI from the model with the best bic/aic, nan if no y is given

model_sklearn.mixture.GaussianMixture

object with the best bic/aic

Hierarchical Clustering¶

class graspologic.cluster.DivisiveCluster[source]¶

Recursively clusters data based on a chosen clustering algorithm. This algorithm implements a "divisive" or "top-down" approach.

Parameters:

cluster_methodstr {"gmm", "kmeans"}, defaults to "gmm".: The underlying clustering method to apply. If "gmm" will use AutoGMMCluster. If "kmeans", will use KMeansCluster.
min_componentsint, defaults to 1.: The minimum number of mixture components/clusters to consider for the first split if "gmm" is selected as cluster_method; and is set to 1 for later splits. If cluster_method is "kmeans", it is set to 2 for all splits.
max_componentsint, defaults to 2.: The maximum number of mixture components/clusters to consider at each split.
min_splitint, defaults to 1.: The minimum size of a cluster for it to be considered to be split again.
max_levelint, defaults to 4.: The maximum number of times to recursively cluster the data.
delta_criterfloat, non-negative, defaults to 0.: The smallest difference between selection criterion values of a new model and the current model that is required to accept the new model. Applicable only if cluster_method is "gmm".
cluster_kwsdict, defaults to {}: Keyword arguments (except min_components and max_components) for chosen clustering method.

Attributes:

model_GaussianMixture or KMeans object: Fitted clustering object based on which cluster_method was used.