Pipeline

The pipeline module includes a collection of higher level API abstractions from the functionality exposed elsewhere in graspologic. The classes, functions, and modules elsewhere in graspologic are intended to provide fine-grained, expert-level control over the features they implement. These building blocks provide an excellent backbone of utility, for researchers in mathematics and science, especially as they hew so closely to scikit-learn's programming paradigms and object model.

But for software engineers and datascientists, there is a certain ritualistic cost to preparing a graph, setting up the objects for use, and tearing them down afterwards.

pipeline is intended to smooth the transition between a common developer and a graph machine learning subject matter expert. We make a presumption that most programmers are software developers first, and dabbling in ML second, and our intention is to bridge this gap.

GraphBuilder

class graspologic.pipeline.GraphBuilder[source]

GraphBuilder is a simple builder for networkx Graphs. To use less memory, it automatically maps all node ids of any hashable type to int.

In other words, if you can use it as a key in a dictionary, it will work.

By default, the main method it provides, add_edge, will sum edge weights if the edge already exists.

Parameters:
directedbool (default=False)

Used to create either a networkx.Graph or networkx.DiGraph object.

__init__(directed=False)[source]
Parameters:

directed (bool) --

add_edge(source, target, weight=1.0, sum_weight=True, **attributes)[source]

Adds a weighted edge between the provided source and target. The source and target id are converted to a unique int.

If no weight is provided, a default weight of 1.0 is used.

If an edge between the source and target already exists, and if the sum_weight argument is True, then the weights are summed.

Otherwise, the last weight provided will be used as the edge's weight.

Any other attributes specified will be added to the edge's data dictionary.

Parameters:
sourceAny

source node id

targetAny

target node id

weightUnion[int, float] (default=1.0)

The weight for the edge. If none is provided, the weight is defaulted to 1.

sum_weightbool (default=True)

If an edge between the source and target already exist, should we sum the edge weights or overwrite the edge weight with the provided weight value.

attributeskwargs

The attributes kwargs are presumed to be attributes that should be added to the edge dictionary for source and target.

Parameters:
Return type:

None

build()[source]
Returns:
Tuple[Union[nx.Graph, nx.DiGraph], Dict[Any, int], List[Any]]

The returned tuple is either an undirected or directed graph, depending on the constructor argument directed. The second value in the tuple is a dictionary of original node ids to their assigned integer ids. The third and final value in the tuple is a List of original node ids, where the index corresponds to the assigned integer and the value is the corresponding original ID.

Return type:

Tuple[Graph | DiGraph, Dict[Any, int], List[Any]]

Embed

The embed module of graspologic.pipeline.embed is intended to provide faster application development support. The functions provided in it reflect common call patterns used when developing data processing pipelines and future consumption by nearest neighbor services and visualization routines.

class graspologic.pipeline.embed.embeddings.Embeddings[source]

Embeddings is an iterable, indexed interface over the parallel numpy arrays that are generated by our embedding functions.

>>> import numpy as np
>>> raw_embeddings = np.array([[1,2,3,4,5], [6,4,2,0,8]])
>>> labels = np.array(["monotonic", "not_monotonic"])
>>> embeddings_obj = Embeddings(labels, raw_embeddings)
>>> embeddings_obj[0]
('monotonic', array([1, 2, 3, 4, 5]))
>>> len(embeddings_obj)
2
>>> list(iter(embeddings_obj))
[('monotonic', array([1, 2, 3, 4, 5])), ('not_monotonic', array([6, 4, 2, 0, 8]))]
>>> embeddings_lookup = embeddings_obj.as_dict()
>>> embeddings_lookup["not_monotonic"]
array([6, 4, 2, 0, 8])
Parameters:
labelsnp.ndarray

The node labels that are positionally correlated with the embeddings. The dtype of labels is any object stored in a networkx Graph object, though type uniformity will be required

embeddingsnp.ndarray

The embedded values generated by the embedding technique.

Raises:
beartype.roar.BeartypeCallHintParamViolation if the types are invalid
ValueError if the row count of labels does not equal the row count of embeddings
__init__(labels, embeddings)[source]

Embeddings is an iterable, indexed interface over the parallel numpy arrays that are generated by our embedding functions.

>>> import numpy as np
>>> raw_embeddings = np.array([[1,2,3,4,5], [6,4,2,0,8]])
>>> labels = np.array(["monotonic", "not_monotonic"])
>>> embeddings_obj = Embeddings(labels, raw_embeddings)
>>> embeddings_obj[0]
('monotonic', array([1, 2, 3, 4, 5]))
>>> len(embeddings_obj)
2
>>> list(iter(embeddings_obj))
[('monotonic', array([1, 2, 3, 4, 5])), ('not_monotonic', array([6, 4, 2, 0, 8]))]
>>> embeddings_lookup = embeddings_obj.as_dict()
>>> embeddings_lookup["not_monotonic"]
array([6, 4, 2, 0, 8])
Parameters:
labelsnp.ndarray

The node labels that are positionally correlated with the embeddings. The dtype of labels is any object stored in a networkx Graph object, though type uniformity will be required

embeddingsnp.ndarray

The embedded values generated by the embedding technique.

Raises:
beartype.roar.BeartypeCallHintParamViolation if the types are invalid
ValueError if the row count of labels does not equal the row count of embeddings
Parameters:
  • labels (ndarray) --

  • embeddings (ndarray) --

labels()[source]
Return type:

ndarray

embeddings()[source]
Return type:

ndarray

as_dict()[source]
Return type:

EmbeddingsView

graspologic.pipeline.embed.adjacency_spectral_embedding(graph, dimensions=100, elbow_cut=None, svd_solver_algorithm='randomized', svd_solver_iterations=5, svd_seed=None, weight_attribute='weight')[source]

Given a directed or undirected networkx graph (not multigraph), generate an Embeddings object.

Adjacency spectral embeddings are extremely egocentric, implying that results are slanted toward the core-periphery of each node. This is in contrast to Laplacian spectral embeddings, which look further into the latent space when it captures change.

Adjacency Spectral Embedding Tutorial

Graphs will always have their diagonal augmented. In other words, a self-loop will be created for each node with a weight corresponding to the weighted degree.

Lastly, all weights will be rescaled based on their relative rank in the graph, which is beneficial in minimizing anomalous results if some edge weights are extremely atypical of the rest of the graph.

Parameters:
graphUnion[nx.Graph, nx.OrderedGraph, nx.DiGraph, nx.OrderedDiGraph]

An undirected or directed graph. The graph must:

  • be fully numerically weighted (every edge must have a real, numeric weight or else it will be treated as an unweighted graph)

  • be a basic graph (meaning it should not be a multigraph; if you have a multigraph you must first decide how you want to handle the weights of the edges between two nodes, whether summed, averaged, last-wins, maximum-weight-only, etc)

dimensionsint (default=100)

Dimensions to use for the svd solver. For undirected graphs, if elbow_cut==None, you will receive an embedding that has nodes rows and dimensions columns. For directed graphs, if elbow_cut==None, you will receive an embedding that has nodes rows and 2*dimensions columns. If elbow_cut is specified to be not None, we will cut the embedding at elbow_cut elbow, but the provided dimensions will be used in the creation of the SVD.

elbow_cutOptional[int] (default=None)

Using a process described by Zhu & Ghodsi in their paper "Automatic dimensionality selection from the scree plot via the use of profile likelihood", truncate the dimensionality of the return on the elbow_cut-th elbow. By default this value is None but can be used to reduce the dimensionality of the returned tensors.

svd_solver_algorithmstr (default="randomized")

allowed values: {'randomized', 'full', 'truncated'}

SVD solver to use:

svd_solver_iterationsint (default=5)

Number of iterations for randomized SVD solver. Not used by 'full' or 'truncated'. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.

svd_seedOptional[int] (default=None)

Used to seed the PRNG used in the randomized svd solver algorithm.

weight_attributestr (default="weight")

The edge dictionary key that contains the weight of the edge.

Returns:
Embeddings
Raises:
beartype.roar.BeartypeCallHintParamViolation if parameters do not match type hints
ValueError if values are not within appropriate ranges or allowed values
Parameters:
  • graph (Graph | DiGraph) --

  • dimensions (int) --

  • elbow_cut (int | None) --

  • svd_solver_algorithm (typing_extensions.Literal[full, truncated, randomized, eigsh]) --

  • svd_solver_iterations (int) --

  • svd_seed (int | None) --

  • weight_attribute (str) --

Return type:

Embeddings

See also

graspologic.pipeline.embed.Embeddings
graspologic.embed.AdjacencySpectralEmbed
graspologic.embed.select_svd

Notes

The singular value decomposition:

\[A = U \Sigma V^T\]

is used to find an orthonormal basis for a matrix, which in our case is the adjacency matrix of the graph. These basis vectors (in the matrices U or V) are ordered according to the amount of variance they explain in the original matrix. By selecting a subset of these basis vectors (through our choice of dimensionality reduction) we can find a lower dimensional space in which to represent the graph.

References

[1]

Sussman, D.L., Tang, M., Fishkind, D.E., Priebe, C.E. "A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs," Journal of the American Statistical Association, Vol. 107(499), 2012

[2]

Levin, K., Roosta-Khorasani, F., Mahoney, M. W., & Priebe, C. E. (2018). Out-of-sample extension of graph adjacency spectral embedding. PMLR: Proceedings of Machine Learning Research, 80, 2975-2984.

[3]

Zhu, M. and Ghodsi, A. (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, 51(2), pp.918-930.

graspologic.pipeline.embed.laplacian_spectral_embedding(graph, form='R-DAD', dimensions=100, elbow_cut=None, svd_solver_algorithm='randomized', svd_solver_iterations=5, svd_seed=None, weight_attribute='weight', regularizer=None)[source]

Given a directed or undirected networkx graph (not multigraph), generate an Embeddings object.

The laplacian spectral embedding process is similar to the adjacency spectral embedding process, with the key differentiator being that the LSE process looks further into the latent space when it captures changes, whereas the ASE process is egocentric and focused on immediate differentiators in a node's periphery.

All weights will be rescaled based on their relative rank in the graph, which is beneficial in minimizing anomalous results if some edge weights are extremely atypical of the rest of the graph.

Parameters:
graphUnion[nx.Graph, nx.OrderedGraph, nx.DiGraph, nx.OrderedDiGraph]

An undirected or directed graph. The graph must:

  • be fully numerically weighted (every edge must have a real, numeric weight or else it will be treated as an unweighted graph)

  • be a basic graph (meaning it should not be a multigraph; if you have a multigraph you must first decide how you want to handle the weights of the edges between two nodes, whether summed, averaged, last-wins, maximum-weight-only, etc)

formstr (default="R-DAD")

Specifies the type of Laplacian normalization to use. Allowed values are: { "DAD", "I-DAD", "R-DAD" }. See to_laplacian() for more details regarding form.

dimensionsint (default=100)

Dimensions to use for the svd solver. For undirected graphs, if elbow_cut==None, you will receive an embedding that has nodes rows and dimensions columns. For directed graphs, if elbow_cut==None, you will receive an embedding that has nodes rows and 2*dimensions columns. If elbow_cut is specified to be not None, we will cut the embedding at elbow_cut elbow, but the provided dimensions will be used in the creation of the SVD.

elbow_cutOptional[int] (default=None)

Using a process described by Zhu & Ghodsi in their paper "Automatic dimensionality selection from the scree plot via the use of profile likelihood", truncate the dimensionality of the return on the elbow_cut-th elbow. By default this value is None but can be used to reduce the dimensionality of the returned tensors.

svd_solver_algorithmstr (default="randomized")

allowed values: {'randomized', 'full', 'truncated'}

SVD solver to use:

svd_solver_iterationsint (default=5)

Number of iterations for randomized SVD solver. Not used by 'full' or 'truncated'. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.

svd_seedOptional[int] (default=None)

Used to seed the PRNG used in the randomized svd solver algorithm.

weight_attributestr (default="weight")

The edge dictionary key that contains the weight of the edge.

regularizerOptional[numbers.Real] (default=None)

Only used when form="R-DAD". Must be None or nonnegative. Constant to be added to the diagonal of degree matrix. If None, average node degree is added. If int or float, must be >= 0.

Returns:
Embeddings
Raises:
beartype.roar.BeartypeCallHintParamViolation if parameters do not match type hints
ValueError if values are not within appropriate ranges or allowed values
Parameters:
  • graph (Graph | DiGraph) --

  • form (typing_extensions.Literal[I-DAD, DAD, R-DAD]) --

  • dimensions (int) --

  • elbow_cut (int | None) --

  • svd_solver_algorithm (typing_extensions.Literal[full, truncated, randomized, eigsh]) --

  • svd_solver_iterations (int) --

  • svd_seed (int | None) --

  • weight_attribute (str) --

  • regularizer (Real | None) --

Return type:

Embeddings

Notes

The singular value decomposition:

\[A = U \Sigma V^T\]

is used to find an orthonormal basis for a matrix, which in our case is the Laplacian matrix of the graph. These basis vectors (in the matrices U or V) are ordered according to the amount of variance they explain in the original matrix. By selecting a subset of these basis vectors (through our choice of dimensionality reduction) we can find a lower dimensional space in which to represent the graph.

References

[1]

Sussman, D.L., Tang, M., Fishkind, D.E., Priebe, C.E. "A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs," Journal of the American Statistical Association, Vol. 107(499), 2012.

[2]

Von Luxburg, Ulrike. "A tutorial on spectral clustering," Statistics and computing, Vol. 17(4), pp. 395-416, 2007.

[3]

Rohe, Karl, Sourav Chatterjee, and Bin Yu. "Spectral clustering and the high-dimensional stochastic blockmodel," The Annals of Statistics, Vol. 39(4), pp. 1878-1915, 2011.

[4]

Zhu, M. and Ghodsi, A. (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, 51(2), pp.918-930.

graspologic.pipeline.embed.omnibus_embedding_pairwise(graphs, dimensions=100, elbow_cut=None, svd_solver_algorithm='randomized', svd_solver_iterations=5, svd_seed=None, weight_attribute='weight', use_laplacian=False)[source]

Generates a pairwise omnibus embedding for each pair of graphs in a list of graphs using the adjacency matrix. If given graphs A, B, and C, the embeddings will be computed for A, B and B, C.

If the node labels differ between each pair of graphs, then those nodes will only be found in the resulting embedding if they exist in the largest connected component of the union of all edges across all graphs in the time series.

Graphs will always have their diagonal augmented. In other words, a self-loop will be created for each node with a weight corresponding to the weighted degree.

Lastly, all weights will be rescaled based on their relative rank in the graph, which is beneficial in minimizing anomalous results if some edge weights are extremely atypical of the rest of the graph.

Parameters:
graphsList[Union[nx.Graph, nx.OrderedGraph, nx.DiGraph, nx.OrderedDiGraph]]

A list of undirected or directed graphs. The graphs must:

  • be fully numerically weighted (every edge must have a real, numeric weight or else it will be treated as an unweighted graph)

  • be a basic graph (meaning it should not be a multigraph; if you have a multigraph you must first decide how you want to handle the weights of the edges between two nodes, whether summed, averaged, last-wins, maximum-weight-only, etc)

dimensionsint (default=100)

Dimensions to use for the svd solver. For undirected graphs, if elbow_cut==None, you will receive an embedding that has nodes rows and dimensions columns. For directed graphs, if elbow_cut==None, you will receive an embedding that has nodes rows and 2*dimensions columns. If elbow_cut is specified to be not None, we will cut the embedding at elbow_cut elbow, but the provided dimensions will be used in the creation of the SVD.

elbow_cutOptional[int] (default=None)

Using a process described by Zhu & Ghodsi in their paper "Automatic dimensionality selection from the scree plot via the use of profile likelihood", truncate the dimensionality of the return on the elbow_cut-th elbow. By default this value is None but can be used to reduce the dimensionality of the returned tensors.

svd_solver_algorithmstr (default="randomized")

allowed values: {'randomized', 'full', 'truncated'}

SVD solver to use:

svd_solver_iterationsint (default=5)

Number of iterations for randomized SVD solver. Not used by 'full' or 'truncated'. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.

svd_seedOptional[int] (default=None)

Used to seed the PRNG used in the randomized svd solver algorithm.

weight_attributestr (default="weight")

The edge dictionary key that contains the weight of the edge.

use_laplacianbool (default=False)

Determine whether to use the Laplacian matrix of each graph in order to calculate the omnibus embedding using the Laplacian spectral embedding technique.

Returns:
List[Tuple[Embeddings, Embeddings]]
Raises:
beartype.roar.BeartypeCallHintParamViolation if parameters do not match type hints
ValueError if values are not within appropriate ranges or allowed values
Parameters:
  • graphs (List[Graph | DiGraph]) --

  • dimensions (int) --

  • elbow_cut (int | None) --

  • svd_solver_algorithm (typing_extensions.Literal[full, truncated, randomized, eigsh]) --

  • svd_solver_iterations (int) --

  • svd_seed (int | None) --

  • weight_attribute (str) --

  • use_laplacian (bool) --

Return type:

List[Tuple[Embeddings, Embeddings]]

References

[1]

Levin, K., Athreya, A., Tang, M., Lyzinski, V., & Priebe, C. E. (2017, November). A central limit theorem for an omnibus embedding of multiple random dot product graphs. In Data Mining Workshops (ICDMW), 2017 IEEE International Conference on (pp. 964-967). IEEE.

[2]

Sussman, D.L., Tang, M., Fishkind, D.E., Priebe, C.E. "A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs," Journal of the American Statistical Association, Vol. 107(499), 2012

[3]

Levin, K., Roosta-Khorasani, F., Mahoney, M. W., & Priebe, C. E. (2018). Out-of-sample extension of graph adjacency spectral embedding. PMLR: Proceedings of Machine Learning Research, 80, 2975-2984.

[4]

Zhu, M. and Ghodsi, A. (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, 51(2), pp.918-930.