Clustering¶
K-Means Clustering¶
- class graspologic.cluster.KMeansCluster[source]¶
- ari_: List[float] | None¶
KMeans Cluster.
It computes all possible models from one component to
max_clusters
. When the true labels are known, the best model is given by the model with highest adjusted Rand index (ARI). Otherwise, the best model is given by the model with highest silhouette score.- Parameters:
- max_clustersint, default=2.
The maximum number of clusters to consider. Must be
>=2
.- random_stateint, RandomState instance or None, optional (default=None)
If int,
random_state
is the seed used by the random number generator; If RandomState instance,random_state
is the random number generator; If None, the random number generator is the RandomState instance used bynp.random
.
- Attributes:
- n_clusters_int
Optimal number of clusters. If y is given, it is based on largest ARI. Otherwise, it is based on highest silhouette score.
- model_KMeans object
Fitted KMeans object fitted with
n_clusters_
.- silhouette_list
List of silhouette scores computed for all possible number of clusters given by
range(2, max_clusters)
.- ari_list
Only computed when y is given. List of ARI values computed for all possible number of clusters given by
range(2, max_clusters)
.
- fit(X, y=None)[source]¶
Fits kmeans model to the data.
- Parameters:
- Xarray-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- yarray-like, shape (n_samples,), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
- Returns:
- self
- Parameters:
X (ndarray) --
y (ndarray | None) --
- Return type:
- fit_predict(X, y=None)¶
Fit the models and predict clusters based on best model.
- Parameters:
- Xarray-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- yarray-like, shape (n_samples,), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
- Returns:
- labelsarray, shape (n_samples,)
Component labels.
- Parameters:
X (ndarray) --
y (Any | None) --
- Return type:
ndarray
- get_metadata_routing()¶
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- predict(X, y=None)¶
Predict clusters based on best model.
- Parameters:
- Xarray-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- yarray-like, shape (n_samples, ), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
- Returns:
- labelsarray, shape (n_samples,)
Component labels.
- Parameters:
X (ndarray) --
y (Any | None) --
- Return type:
ndarray
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it's possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
Gaussian Mixture Models Clustering¶
- class graspologic.cluster.GaussianCluster[source]¶
Gaussian Mixture Model (GMM)
Representation of a Gaussian mixture model probability distribution. This class allows to estimate the parameters of a Gaussian mixture distribution. It computes all possible models from one component to
max_components
. The best model is given by the lowest BIC score.- Parameters:
- min_componentsint, default=2.
The minimum number of mixture components to consider (unless
max_components
is None, in which case this is the maximum number of components to consider). Ifmax_componens
is not None,min_components
must be less than or equal tomax_components
.- max_componentsint or None, default=None.
The maximum number of mixture components to consider. Must be greater than or equal to
min_components
.- covariance_type{'all' (default), 'full', 'tied', 'diag', 'spherical'}, optional
String or list/array describing the type of covariance parameters to use. If a string, it must be one of:
- 'all'
considers all covariance structures in ['spherical', 'diag', 'tied', 'full']
- 'full'
each component has its own general covariance matrix
- 'tied'
all components share the same general covariance matrix
- 'diag'
each component has its own diagonal covariance matrix
- 'spherical'
each component has its own single variance
- If a list/array, it must be a list/array of strings containing only
'spherical', 'tied', 'diag', and/or 'full'.
- tolfloat, defaults to 1e-3.
The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.
- reg_covarfloat, defaults to 1e-6.
Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.
- max_iterint, defaults to 100.
The number of EM iterations to perform.
- n_initint, defaults to 1.
The number of initializations to perform. The best results are kept.
- init_params{'kmeans', 'random'}, defaults to 'kmeans'.
The method used to initialize the weights, the means and the precisions. Must be one of:
'kmeans' : responsibilities are initialized using kmeans. 'random' : responsibilities are initialized randomly.
- random_stateint, RandomState instance or None, optional (default=None)
If int,
random_state
is the seed used by the random number generator; If RandomState instance,random_state
is the random number generator; If None, the random number generator is the RandomState instance used bynp.random
.
- Attributes:
- n_components_int
Optimal number of components based on BIC.
- covariance_type_str
Optimal covariance type based on BIC.
- model_GaussianMixture object
Fitted GaussianMixture object fitted with optimal number of components and optimal covariance structure.
- bic_pandas.DataFrame
A pandas DataFrame of BIC values computed for all possible number of clusters given by
range(min_components, max_components + 1)
and all covariance structures given bycovariance_type
.- ari_pandas.DataFrame
Only computed when y is given. Pandas Dataframe containing ARI values computed for all possible number of clusters given by
range(min_components, max_components)
and all covariance structures given bycovariance_type
.
- __init__(min_components=2, max_components=None, covariance_type='all', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', random_state=None)[source]¶
- Parameters:
min_components (int) --
max_components (int | None) --
covariance_type (typing_extensions.Literal[all, spherical, diag, tied, full] | List[typing_extensions.Literal[spherical, diag, tied, full]] | ndarray) --
tol (float) --
reg_covar (float) --
max_iter (int) --
n_init (int) --
init_params (typing_extensions.Literal[random, kmeans]) --
random_state (int | RandomState | None) --
- fit(X, y=None)[source]¶
Fits gaussian mixure model to the data. Estimate model parameters with the EM algorithm.
- Parameters:
- Xarray-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- yarray-like, shape (n_samples,), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
- Returns:
- self
- Parameters:
X (ndarray) --
y (ndarray | None) --
- Return type:
- fit_predict(X, y=None)¶
Fit the models and predict clusters based on best model.
- Parameters:
- Xarray-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- yarray-like, shape (n_samples,), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
- Returns:
- labelsarray, shape (n_samples,)
Component labels.
- Parameters:
X (ndarray) --
y (Any | None) --
- Return type:
ndarray
- get_metadata_routing()¶
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- predict(X, y=None)¶
Predict clusters based on best model.
- Parameters:
- Xarray-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- yarray-like, shape (n_samples, ), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
- Returns:
- labelsarray, shape (n_samples,)
Component labels.
- Parameters:
X (ndarray) --
y (Any | None) --
- Return type:
ndarray
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it's possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
- class graspologic.cluster.AutoGMMCluster[source]¶
Automatic Gaussian Mixture Model (GMM) selection.
Clustering algorithm using a hierarchical agglomerative clustering then Gaussian mixtured model (GMM) fitting. Different combinations of agglomeration, GMM, and cluster numbers are used and the clustering with the best selection criterion (bic/aic) is chosen.
- Parameters:
- min_componentsint, default=2.
The minimum number of mixture components to consider (unless
max_components
is None, in which case this is the maximum number of components to consider). Ifmax_components
is not None,min_components
must be less than or equal tomax_components
. Iflabel_init
is given, min_components must match number of unique labels inlabel_init
.- max_componentsint or None, default=10.
The maximum number of mixture components to consider. Must be greater than or equal to min_components. If label_init is given, min_components must match number of unique labels in label_init.
- affinity{'euclidean','manhattan','cosine','none', 'all' (default)}, optional
String or list/array describing the type of affinities to use in agglomeration. If a string, it must be one of:
- 'euclidean'
L2 norm
- 'manhattan'
L1 norm
- 'cosine'
cosine similarity
- 'none'
no agglomeration - GMM is initialized with k-means
- 'all'
considers all affinities in ['euclidean','manhattan','cosine','none']
If a list/array, it must be a list/array of strings containing only 'euclidean', 'manhattan', 'cosine', and/or 'none'.
Note that cosine similarity can only work when all of the rows are not the zero vector. If the input matrix has a zero row, cosine similarity will be skipped and a warning will be thrown.
- linkage{'ward','complete','average','single', 'all' (default)}, optional
String or list/array describing the type of linkages to use in agglomeration. If a string, it must be one of:
- 'ward'
ward's clustering, can only be used with euclidean affinity
- 'complete'
complete linkage
- 'average'
average linkage
- 'single'
single linkage
- 'all'
considers all linkages in ['ward','complete','average','single']
If a list/array, it must be a list/array of strings containing only 'ward', 'complete', 'average', and/or 'single'.
- covariance_type{'full', 'tied', 'diag', 'spherical', 'all' (default)} , optional
String or list/array describing the type of covariance parameters to use. If a string, it must be one of:
- 'full'
each component has its own general covariance matrix
- 'tied'
all components share the same general covariance matrix
- 'diag'
each component has its own diagonal covariance matrix
- 'spherical'
each component has its own single variance
- 'all'
considers all covariance structures in ['spherical', 'diag', 'tied', 'full']
If a list/array, it must be a list/array of strings containing only 'spherical', 'tied', 'diag', and/or 'spherical'.
- random_stateint, RandomState instance or None, optional (default=None)
There is randomness in k-means initialization of
sklearn.mixture.GaussianMixture
. This parameter is passed toGaussianMixture
to control the random state. If int, random_state is used as the random number generator seed; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used bynp.random
.- label_initarray-like, shape (n_samples,), optional (default=None)
List of labels for samples if available. Used to initialize the model. If provided, min_components and
max_components
must match the number of unique labels given here.- kmeans_n_initint, optional (default = 1)
If
kmeans_n_init
is larger than 1 andlabel_init
is None, additionalkmeans_n_init
-1 runs ofsklearn.mixture.GaussianMixture
initialized with k-means will be performed for all covariance parameters incovariance_type
.- max_iterint, optional (default = 100).
The maximum number of EM iterations to perform.
- selection_criteriastr {"bic" or "aic"}, optional, (default="bic")
select the best model based on Bayesian Information Criterion (bic) or Aikake Information Criterion (aic)
- verboseint, optional (default = 0)
Enable verbose output. If 1 then it prints the current initialization and each iteration step. If greater than 1 then it prints also the log probability and the time needed for each step.
- max_agglom_sizeint or None, optional (default = 2000)
The maximum number of datapoints on which to do agglomerative clustering as the initialization to GMM. If the number of datapoints is larger than this value, a random subset of the data is used for agglomerative initialization. If None, all data is used for agglomerative clustering for initialization.
- n_jobsint or None, optional (default = None)
The number of jobs to use for the computation. This works by computing each of the initialization runs in parallel. None means 1 unless in a
joblib.parallel_backend context
. -1 means using all processors. See https://scikit-learn.org/stable/glossary.html#term-n-jobs for more details.
- Attributes:
- results_pandas.DataFrame
Contains exhaustive information about all the clustering runs. Columns are:
- 'model'GaussianMixture object
GMM clustering fit to the data
- 'bic/aic'float
Bayesian Information Criterion
- 'ari'float or nan
Adjusted Rand Index between GMM classification, and true classification, nan if y is not given
- 'n_components'int
number of clusters
- 'affinity'{'euclidean','manhattan','cosine','none'}
affinity used in Agglomerative Clustering
- 'linkage'{'ward','complete','average','single'}
linkage used in Agglomerative Clustering
- 'covariance_type'{'full', 'tied', 'diag', 'spherical'}
covariance type used in GMM
- 'reg_covar'float
regularization used in GMM
- criter_the best (lowest) Bayesian Information Criterion
- n_components_int
number of clusters in the model with the best bic/aic
- covariance_type_str
covariance type in the model with the best bic/aic
- affinity_str
affinity used in the model with the best bic/aic
- linkage_str
linkage used in the model with the best bic/aic
- reg_covar_float
regularization used in the model with the best bic/aic
- ari_float
ARI from the model with the best bic/aic, nan if no y is given
- model_
sklearn.mixture.GaussianMixture
object with the best bic/aic
Notes
This algorithm was strongly inspired by mclust, a clustering package in R
References
[1]Jeffrey D. Banfield and Adrian E. Raftery. Model-based gaussian and non-gaussian clustering. Biometrics, 49:803–821, 1993.
[2]Abhijit Dasgupta and Adrian E. Raftery. Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association, 93(441):294–302, 1998.
- __init__(min_components=2, max_components=10, affinity='all', linkage='all', covariance_type='all', random_state=None, label_init=None, kmeans_n_init=1, max_iter=100, verbose=0, selection_criteria='bic', max_agglom_size=2000, n_jobs=None)[source]¶
- fit(X, y=None)[source]¶
Fits gaussian mixture model to the data. Initialize with agglomerative clustering then estimate model parameters with EM algorithm.
- Parameters:
- Xarray-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- yarray-like, shape (n_samples,), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
- Returns:
- selfobject
Returns an instance of self.
- Parameters:
X (ndarray) --
y (ndarray | None) --
- Return type:
- fit_predict(X, y=None)¶
Fit the models and predict clusters based on best model.
- Parameters:
- Xarray-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- yarray-like, shape (n_samples,), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
- Returns:
- labelsarray, shape (n_samples,)
Component labels.
- Parameters:
X (ndarray) --
y (Any | None) --
- Return type:
ndarray
- get_metadata_routing()¶
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- predict(X, y=None)¶
Predict clusters based on best model.
- Parameters:
- Xarray-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- yarray-like, shape (n_samples, ), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
- Returns:
- labelsarray, shape (n_samples,)
Component labels.
- Parameters:
X (ndarray) --
y (Any | None) --
- Return type:
ndarray
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it's possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
Hierarchical Clustering¶
- class graspologic.cluster.DivisiveCluster[source]¶
Recursively clusters data based on a chosen clustering algorithm. This algorithm implements a "divisive" or "top-down" approach.
- Parameters:
- cluster_methodstr {"gmm", "kmeans"}, defaults to "gmm".
The underlying clustering method to apply. If "gmm" will use
AutoGMMCluster
. If "kmeans", will useKMeansCluster
.- min_componentsint, defaults to 1.
The minimum number of mixture components/clusters to consider for the first split if "gmm" is selected as
cluster_method
; and is set to 1 for later splits. Ifcluster_method
is "kmeans", it is set to 2 for all splits.- max_componentsint, defaults to 2.
The maximum number of mixture components/clusters to consider at each split.
- min_splitint, defaults to 1.
The minimum size of a cluster for it to be considered to be split again.
- max_levelint, defaults to 4.
The maximum number of times to recursively cluster the data.
- delta_criterfloat, non-negative, defaults to 0.
The smallest difference between selection criterion values of a new model and the current model that is required to accept the new model. Applicable only if
cluster_method
is "gmm".- cluster_kwsdict, defaults to {}
Keyword arguments (except
min_components
andmax_components
) for chosen clustering method.
- Attributes:
- model_GaussianMixture or KMeans object
Fitted clustering object based on which
cluster_method
was used.
See also
Notes
This class inherits from
anytree.node.nodemixin.NodeMixin
, a lightweight class for doing various simple operations on trees.This algorithm was strongly inspired by maggotcluster, a divisive clustering algorithm in https://github.com/neurodata/maggot_models and the algorithm for estimating a hierarchical stochastic block model presented in [2].
References
[1]Athey, T. L., & Vogelstein, J. T. (2019). AutoGMM: Automatic Gaussian Mixture Modeling in Python. arXiv preprint arXiv:1909.02688.
[2]Lyzinski, V., Tang, M., Athreya, A., Park, Y., & Priebe, C. E (2016). Community detection and classification in hierarchical stochastic blockmodels. IEEE Transactions on Network Science and Engineering, 4(1), 13-26.
- __init__(cluster_method='gmm', min_components=1, max_components=2, cluster_kws={}, min_split=1, max_level=4, delta_criter=0)[source]¶
- fit(X)[source]¶
Fits clustering models to the data as well as resulting clusters
- Parameters:
- Xarray-like, shape (n_samples, n_features)
- Returns:
- selfobject
Returns an instance of self.
- Parameters:
X (ndarray) --
- Return type:
- fit_predict(X, fcluster=False, level=None)[source]¶
Fits clustering models to the data as well as resulting clusters and using fitted models to predict a hierarchy of labels
- Parameters:
- Xarray-like, shape (n_samples, n_features)
- fcluster: bool, default=False
if True, returned labels will be re-numbered so that each column of labels represents a flat clustering at current level, and each label corresponds to a cluster indexed the same as the corresponding node in the overall clustering dendrogram
- level: int, optional (default=None)
the level of a single flat clustering to generate only available if
fcluster
is True
- Returns:
- labelsarray_label, shape (n_samples, n_levels)
if no level specified; otherwise, shape (n_samples,)
- Parameters:
- Return type:
ndarray
- predict(X, fcluster=False, level=None)[source]¶
Predicts a hierarchy of labels based on fitted models
- Parameters:
- Xarray-like, shape (n_samples, n_features)
- fcluster: bool, default=False
if True, returned labels will be re-numbered so that each column of labels represents a flat clustering at current level, and each label corresponds to a cluster indexed the same as the corresponding node in the overall clustering dendrogram
- level: int, optional (default=None)
the level of a single flat clustering to generate only available if
fcluster
is True
- Returns:
- labelsarray-like, shape (n_samples, n_levels)
if no level specified; otherwise, shape (n_samples,)
- Parameters:
- Return type:
ndarray
- set_predict_request(*, fcluster='$UNCHANGED$', level='$UNCHANGED$')¶
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
- fclusterstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
fcluster
parameter inpredict
.- levelstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
level
parameter inpredict
.
- Returns:
- selfobject
The updated object.
- Parameters:
self (DivisiveCluster) --
- Return type: