Pipeline¶
The pipeline
module includes a collection of higher level API abstractions from
the functionality exposed elsewhere in graspologic
. The classes, functions, and
modules elsewhere in graspologic
are intended to provide fine-grained, expert-level
control over the features they implement. These building blocks provide an excellent
backbone of utility, for researchers in mathematics and science, especially as they
hew so closely to scikit-learn
's programming paradigms and object model.
But for software engineers and datascientists, there is a certain ritualistic cost to preparing a graph, setting up the objects for use, and tearing them down afterwards.
pipeline
is intended to smooth the transition between a common developer and
a graph machine learning subject matter expert. We make a presumption that most
programmers are software developers first, and dabbling in ML second, and our intention
is to bridge this gap.
GraphBuilder¶
- class graspologic.pipeline.GraphBuilder[source]¶
GraphBuilder is a simple builder for networkx Graphs. To use less memory, it automatically maps all node ids of any hashable type to
int
.In other words, if you can use it as a key in a dictionary, it will work.
By default, the main method it provides,
add_edge
, will sum edge weights if the edge already exists.- Parameters:
- directedbool (default=False)
Used to create either a
networkx.Graph
ornetworkx.DiGraph
object.
- add_edge(source, target, weight=1.0, sum_weight=True, **attributes)[source]¶
Adds a weighted edge between the provided source and target. The source and target id are converted to a unique
int
.If no weight is provided, a default weight of
1.0
is used.If an edge between the source and target already exists, and if the
sum_weight
argument isTrue
, then the weights are summed.Otherwise, the last weight provided will be used as the edge's weight.
Any other attributes specified will be added to the edge's data dictionary.
- Parameters:
- sourceAny
source node id
- targetAny
target node id
- weightUnion[int, float] (default=1.0)
The weight for the edge. If none is provided, the weight is defaulted to 1.
- sum_weightbool (default=True)
If an edge between the
source
andtarget
already exist, should we sum the edge weights or overwrite the edge weight with the providedweight
value.- attributeskwargs
The attributes kwargs are presumed to be attributes that should be added to the edge dictionary for
source
andtarget
.
- Parameters:
- Return type:
None
- build()[source]¶
- Returns:
- Tuple[Union[nx.Graph, nx.DiGraph], Dict[Any, int], List[Any]]
The returned tuple is either an undirected or directed graph, depending on the constructor argument
directed
. The second value in the tuple is a dictionary of original node ids to their assigned integer ids. The third and final value in the tuple is a List of original node ids, where the index corresponds to the assigned integer and the value is the corresponding original ID.
- Return type:
Embed¶
The embed module of graspologic.pipeline.embed
is intended to provide faster
application development support. The functions provided in it reflect common call
patterns used when developing data processing pipelines and future consumption
by nearest neighbor services and visualization routines.
- class graspologic.pipeline.embed.embeddings.Embeddings[source]¶
Embeddings
is an iterable, indexed interface over the parallel numpy arrays that are generated by our embedding functions.>>> import numpy as np >>> raw_embeddings = np.array([[1,2,3,4,5], [6,4,2,0,8]]) >>> labels = np.array(["monotonic", "not_monotonic"]) >>> embeddings_obj = Embeddings(labels, raw_embeddings) >>> embeddings_obj[0] ('monotonic', array([1, 2, 3, 4, 5])) >>> len(embeddings_obj) 2 >>> list(iter(embeddings_obj)) [('monotonic', array([1, 2, 3, 4, 5])), ('not_monotonic', array([6, 4, 2, 0, 8]))] >>> embeddings_lookup = embeddings_obj.as_dict() >>> embeddings_lookup["not_monotonic"] array([6, 4, 2, 0, 8])
- Parameters:
- labelsnp.ndarray
The node labels that are positionally correlated with the embeddings. The dtype of labels is any object stored in a networkx Graph object, though type uniformity will be required
- embeddingsnp.ndarray
The embedded values generated by the embedding technique.
- Raises:
- beartype.roar.BeartypeCallHintParamViolation if the types are invalid
- ValueError if the row count of labels does not equal the row count of embeddings
- __init__(labels, embeddings)[source]¶
Embeddings
is an iterable, indexed interface over the parallel numpy arrays that are generated by our embedding functions.>>> import numpy as np >>> raw_embeddings = np.array([[1,2,3,4,5], [6,4,2,0,8]]) >>> labels = np.array(["monotonic", "not_monotonic"]) >>> embeddings_obj = Embeddings(labels, raw_embeddings) >>> embeddings_obj[0] ('monotonic', array([1, 2, 3, 4, 5])) >>> len(embeddings_obj) 2 >>> list(iter(embeddings_obj)) [('monotonic', array([1, 2, 3, 4, 5])), ('not_monotonic', array([6, 4, 2, 0, 8]))] >>> embeddings_lookup = embeddings_obj.as_dict() >>> embeddings_lookup["not_monotonic"] array([6, 4, 2, 0, 8])
- Parameters:
- labelsnp.ndarray
The node labels that are positionally correlated with the embeddings. The dtype of labels is any object stored in a networkx Graph object, though type uniformity will be required
- embeddingsnp.ndarray
The embedded values generated by the embedding technique.
- Raises:
- beartype.roar.BeartypeCallHintParamViolation if the types are invalid
- ValueError if the row count of labels does not equal the row count of embeddings
- Parameters:
labels (ndarray) --
embeddings (ndarray) --
- graspologic.pipeline.embed.adjacency_spectral_embedding(graph, dimensions=100, elbow_cut=None, svd_solver_algorithm='randomized', svd_solver_iterations=5, svd_seed=None, weight_attribute='weight')[source]¶
Given a directed or undirected networkx graph (not multigraph), generate an Embeddings object.
Adjacency spectral embeddings are extremely egocentric, implying that results are slanted toward the core-periphery of each node. This is in contrast to Laplacian spectral embeddings, which look further into the latent space when it captures change.
Adjacency Spectral Embedding Tutorial
Graphs will always have their diagonal augmented. In other words, a self-loop will be created for each node with a weight corresponding to the weighted degree.
Lastly, all weights will be rescaled based on their relative rank in the graph, which is beneficial in minimizing anomalous results if some edge weights are extremely atypical of the rest of the graph.
- Parameters:
- graphUnion[nx.Graph, nx.OrderedGraph, nx.DiGraph, nx.OrderedDiGraph]
An undirected or directed graph. The graph must:
be fully numerically weighted (every edge must have a real, numeric weight or else it will be treated as an unweighted graph)
be a basic graph (meaning it should not be a multigraph; if you have a multigraph you must first decide how you want to handle the weights of the edges between two nodes, whether summed, averaged, last-wins, maximum-weight-only, etc)
- dimensionsint (default=100)
Dimensions to use for the svd solver. For undirected graphs, if
elbow_cut==None
, you will receive an embedding that hasnodes
rows anddimensions
columns. For directed graphs, ifelbow_cut==None
, you will receive an embedding that hasnodes
rows and2*dimensions
columns. Ifelbow_cut
is specified to be notNone
, we will cut the embedding atelbow_cut
elbow, but the provideddimensions
will be used in the creation of the SVD.- elbow_cutOptional[int] (default=None)
Using a process described by Zhu & Ghodsi in their paper "Automatic dimensionality selection from the scree plot via the use of profile likelihood", truncate the dimensionality of the return on the
elbow_cut
-th elbow. By default this value isNone
but can be used to reduce the dimensionality of the returned tensors.- svd_solver_algorithmstr (default="randomized")
allowed values: {'randomized', 'full', 'truncated'}
SVD solver to use:
- 'randomized'
Computes randomized svd using
sklearn.utils.extmath.randomized_svd()
- 'full'
Computes full svd using
scipy.linalg.svd()
Does not supportgraph
input of type scipy.sparse.csr_array
- 'truncated'
Computes truncated svd using
scipy.sparse.linalg.svds()
- svd_solver_iterationsint (default=5)
Number of iterations for randomized SVD solver. Not used by 'full' or 'truncated'. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.
- svd_seedOptional[int] (default=None)
Used to seed the PRNG used in the
randomized
svd solver algorithm.- weight_attributestr (default="weight")
The edge dictionary key that contains the weight of the edge.
- Returns:
- Embeddings
- Raises:
- beartype.roar.BeartypeCallHintParamViolation if parameters do not match type hints
- ValueError if values are not within appropriate ranges or allowed values
- Parameters:
- Return type:
See also
graspologic.pipeline.embed.Embeddings
graspologic.embed.AdjacencySpectralEmbed
graspologic.embed.select_svd
Notes
The singular value decomposition:
\[A = U \Sigma V^T\]is used to find an orthonormal basis for a matrix, which in our case is the adjacency matrix of the graph. These basis vectors (in the matrices U or V) are ordered according to the amount of variance they explain in the original matrix. By selecting a subset of these basis vectors (through our choice of dimensionality reduction) we can find a lower dimensional space in which to represent the graph.
References
[1]Sussman, D.L., Tang, M., Fishkind, D.E., Priebe, C.E. "A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs," Journal of the American Statistical Association, Vol. 107(499), 2012
[2]Levin, K., Roosta-Khorasani, F., Mahoney, M. W., & Priebe, C. E. (2018). Out-of-sample extension of graph adjacency spectral embedding. PMLR: Proceedings of Machine Learning Research, 80, 2975-2984.
[3]Zhu, M. and Ghodsi, A. (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, 51(2), pp.918-930.
- graspologic.pipeline.embed.laplacian_spectral_embedding(graph, form='R-DAD', dimensions=100, elbow_cut=None, svd_solver_algorithm='randomized', svd_solver_iterations=5, svd_seed=None, weight_attribute='weight', regularizer=None)[source]¶
Given a directed or undirected networkx graph (not multigraph), generate an Embeddings object.
The laplacian spectral embedding process is similar to the adjacency spectral embedding process, with the key differentiator being that the LSE process looks further into the latent space when it captures changes, whereas the ASE process is egocentric and focused on immediate differentiators in a node's periphery.
All weights will be rescaled based on their relative rank in the graph, which is beneficial in minimizing anomalous results if some edge weights are extremely atypical of the rest of the graph.
- Parameters:
- graphUnion[nx.Graph, nx.OrderedGraph, nx.DiGraph, nx.OrderedDiGraph]
An undirected or directed graph. The graph must:
be fully numerically weighted (every edge must have a real, numeric weight or else it will be treated as an unweighted graph)
be a basic graph (meaning it should not be a multigraph; if you have a multigraph you must first decide how you want to handle the weights of the edges between two nodes, whether summed, averaged, last-wins, maximum-weight-only, etc)
- formstr (default="R-DAD")
Specifies the type of Laplacian normalization to use. Allowed values are: { "DAD", "I-DAD", "R-DAD" }. See
to_laplacian()
for more details regarding form.- dimensionsint (default=100)
Dimensions to use for the svd solver. For undirected graphs, if
elbow_cut==None
, you will receive an embedding that hasnodes
rows anddimensions
columns. For directed graphs, ifelbow_cut==None
, you will receive an embedding that hasnodes
rows and2*dimensions
columns. Ifelbow_cut
is specified to be notNone
, we will cut the embedding atelbow_cut
elbow, but the provideddimensions
will be used in the creation of the SVD.- elbow_cutOptional[int] (default=None)
Using a process described by Zhu & Ghodsi in their paper "Automatic dimensionality selection from the scree plot via the use of profile likelihood", truncate the dimensionality of the return on the
elbow_cut
-th elbow. By default this value isNone
but can be used to reduce the dimensionality of the returned tensors.- svd_solver_algorithmstr (default="randomized")
allowed values: {'randomized', 'full', 'truncated'}
SVD solver to use:
- 'randomized'
Computes randomized svd using
sklearn.utils.extmath.randomized_svd()
- 'full'
Computes full svd using
scipy.linalg.svd()
Does not supportgraph
input of type scipy.sparse.csr_array
- 'truncated'
Computes truncated svd using
scipy.sparse.linalg.svds()
- svd_solver_iterationsint (default=5)
Number of iterations for randomized SVD solver. Not used by 'full' or 'truncated'. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.
- svd_seedOptional[int] (default=None)
Used to seed the PRNG used in the
randomized
svd solver algorithm.- weight_attributestr (default="weight")
The edge dictionary key that contains the weight of the edge.
- regularizerOptional[numbers.Real] (default=None)
Only used when form="R-DAD". Must be None or nonnegative. Constant to be added to the diagonal of degree matrix. If None, average node degree is added. If int or float, must be >= 0.
- Returns:
- Embeddings
- Raises:
- beartype.roar.BeartypeCallHintParamViolation if parameters do not match type hints
- ValueError if values are not within appropriate ranges or allowed values
- Parameters:
- Return type:
See also
graspologic.pipeline.embed.Embeddings
graspologic.embed.LaplacianSpectralEmbed
graspologic.embed.select_svd
graspologic.utils.to_laplacian
Notes
The singular value decomposition:
\[A = U \Sigma V^T\]is used to find an orthonormal basis for a matrix, which in our case is the Laplacian matrix of the graph. These basis vectors (in the matrices U or V) are ordered according to the amount of variance they explain in the original matrix. By selecting a subset of these basis vectors (through our choice of dimensionality reduction) we can find a lower dimensional space in which to represent the graph.
References
[1]Sussman, D.L., Tang, M., Fishkind, D.E., Priebe, C.E. "A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs," Journal of the American Statistical Association, Vol. 107(499), 2012.
[2]Von Luxburg, Ulrike. "A tutorial on spectral clustering," Statistics and computing, Vol. 17(4), pp. 395-416, 2007.
[3]Rohe, Karl, Sourav Chatterjee, and Bin Yu. "Spectral clustering and the high-dimensional stochastic blockmodel," The Annals of Statistics, Vol. 39(4), pp. 1878-1915, 2011.
[4]Zhu, M. and Ghodsi, A. (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, 51(2), pp.918-930.
- graspologic.pipeline.embed.omnibus_embedding_pairwise(graphs, dimensions=100, elbow_cut=None, svd_solver_algorithm='randomized', svd_solver_iterations=5, svd_seed=None, weight_attribute='weight', use_laplacian=False)[source]¶
Generates a pairwise omnibus embedding for each pair of graphs in a list of graphs using the adjacency matrix. If given graphs A, B, and C, the embeddings will be computed for A, B and B, C.
If the node labels differ between each pair of graphs, then those nodes will only be found in the resulting embedding if they exist in the largest connected component of the union of all edges across all graphs in the time series.
Graphs will always have their diagonal augmented. In other words, a self-loop will be created for each node with a weight corresponding to the weighted degree.
Lastly, all weights will be rescaled based on their relative rank in the graph, which is beneficial in minimizing anomalous results if some edge weights are extremely atypical of the rest of the graph.
- Parameters:
- graphsList[Union[nx.Graph, nx.OrderedGraph, nx.DiGraph, nx.OrderedDiGraph]]
A list of undirected or directed graphs. The graphs must:
be fully numerically weighted (every edge must have a real, numeric weight or else it will be treated as an unweighted graph)
be a basic graph (meaning it should not be a multigraph; if you have a multigraph you must first decide how you want to handle the weights of the edges between two nodes, whether summed, averaged, last-wins, maximum-weight-only, etc)
- dimensionsint (default=100)
Dimensions to use for the svd solver. For undirected graphs, if
elbow_cut==None
, you will receive an embedding that hasnodes
rows anddimensions
columns. For directed graphs, ifelbow_cut==None
, you will receive an embedding that hasnodes
rows and2*dimensions
columns. Ifelbow_cut
is specified to be notNone
, we will cut the embedding atelbow_cut
elbow, but the provideddimensions
will be used in the creation of the SVD.- elbow_cutOptional[int] (default=None)
Using a process described by Zhu & Ghodsi in their paper "Automatic dimensionality selection from the scree plot via the use of profile likelihood", truncate the dimensionality of the return on the
elbow_cut
-th elbow. By default this value isNone
but can be used to reduce the dimensionality of the returned tensors.- svd_solver_algorithmstr (default="randomized")
allowed values: {'randomized', 'full', 'truncated'}
SVD solver to use:
- 'randomized'
Computes randomized svd using
sklearn.utils.extmath.randomized_svd()
- 'full'
Computes full svd using
scipy.linalg.svd()
Does not supportgraph
input of type scipy.sparse.csr_array
- 'truncated'
Computes truncated svd using
scipy.sparse.linalg.svds()
- svd_solver_iterationsint (default=5)
Number of iterations for randomized SVD solver. Not used by 'full' or 'truncated'. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.
- svd_seedOptional[int] (default=None)
Used to seed the PRNG used in the
randomized
svd solver algorithm.- weight_attributestr (default="weight")
The edge dictionary key that contains the weight of the edge.
- use_laplacianbool (default=False)
Determine whether to use the Laplacian matrix of each graph in order to calculate the omnibus embedding using the Laplacian spectral embedding technique.
- Returns:
- List[Tuple[Embeddings, Embeddings]]
- Raises:
- beartype.roar.BeartypeCallHintParamViolation if parameters do not match type hints
- ValueError if values are not within appropriate ranges or allowed values
- Parameters:
- Return type:
See also
graspologic.pipeline.embed.Embeddings
graspologic.embed.OmnibusEmbed
graspologic.embed.AdjacencySpectralEmbed
graspologic.embed.select_svd
References
[1]Levin, K., Athreya, A., Tang, M., Lyzinski, V., & Priebe, C. E. (2017, November). A central limit theorem for an omnibus embedding of multiple random dot product graphs. In Data Mining Workshops (ICDMW), 2017 IEEE International Conference on (pp. 964-967). IEEE.
[2]Sussman, D.L., Tang, M., Fishkind, D.E., Priebe, C.E. "A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs," Journal of the American Statistical Association, Vol. 107(499), 2012
[3]Levin, K., Roosta-Khorasani, F., Mahoney, M. W., & Priebe, C. E. (2018). Out-of-sample extension of graph adjacency spectral embedding. PMLR: Proceedings of Machine Learning Research, 80, 2975-2984.
[4]Zhu, M. and Ghodsi, A. (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, 51(2), pp.918-930.