Pipeline¶

The pipeline module includes a collection of higher level API abstractions from the functionality exposed elsewhere in graspologic. The classes, functions, and modules elsewhere in graspologic are intended to provide fine-grained, expert-level control over the features they implement. These building blocks provide an excellent backbone of utility, for researchers in mathematics and science, especially as they hew so closely to scikit-learn's programming paradigms and object model.

But for software engineers and datascientists, there is a certain ritualistic cost to preparing a graph, setting up the objects for use, and tearing them down afterwards.

pipeline is intended to smooth the transition between a common developer and a graph machine learning subject matter expert. We make a presumption that most programmers are software developers first, and dabbling in ML second, and our intention is to bridge this gap.

GraphBuilder¶

class graspologic.pipeline.GraphBuilder[source]¶

GraphBuilder is a simple builder for networkx Graphs. To use less memory, it automatically maps all node ids of any hashable type to int.

In other words, if you can use it as a key in a dictionary, it will work.

By default, the main method it provides, add_edge, will sum edge weights if the edge already exists.

Parameters:

directedbool (default=False): Used to create either a networkx.Graph or networkx.DiGraph object.

__init__(directed=False)[source]¶

Parameters:: directed (bool) --

add_edge(source, target, weight=1.0, sum_weight=True, **attributes)[source]¶

Adds a weighted edge between the provided source and target. The source and target id are converted to a unique int.

If no weight is provided, a default weight of 1.0 is used.

If an edge between the source and target already exists, and if the sum_weight argument is True, then the weights are summed.

Otherwise, the last weight provided will be used as the edge's weight.

Any other attributes specified will be added to the edge's data dictionary.

Parameters:

sourceAny: source node id
targetAny: target node id
weightUnion[int, float] (default=1.0): The weight for the edge. If none is provided, the weight is defaulted to 1.
sum_weightbool (default=True): If an edge between the source and target already exist, should we sum the edge weights or overwrite the edge weight with the provided weight value.
attributeskwargs: The attributes kwargs are presumed to be attributes that should be added to the edge dictionary for source and target.

Parameters:

source (Any) --
target (Any) --
weight (int | float) --
sum_weight (bool) --
attributes (Any) --

Return type:

None

build()[source]¶

Returns:

Tuple[Union[nx.Graph, nx.DiGraph], Dict[Any, int], List[Any]]: The returned tuple is either an undirected or directed graph, depending on the constructor argument directed. The second value in the tuple is a dictionary of original node ids to their assigned integer ids. The third and final value in the tuple is a List of original node ids, where the index corresponds to the assigned integer and the value is the corresponding original ID.

Return type:

Tuple[Graph | DiGraph, Dict[Any, int], List[Any]]

Embed¶

The embed module of graspologic.pipeline.embed is intended to provide faster application development support. The functions provided in it reflect common call patterns used when developing data processing pipelines and future consumption by nearest neighbor services and visualization routines.

class graspologic.pipeline.embed.embeddings.Embeddings[source]¶

Embeddings is an iterable, indexed interface over the parallel numpy arrays that are generated by our embedding functions.

>>> import numpy as np
>>> raw_embeddings = np.array([[1,2,3,4,5], [6,4,2,0,8]])
>>> labels = np.array(["monotonic", "not_monotonic"])
>>> embeddings_obj = Embeddings(labels, raw_embeddings)
>>> embeddings_obj[0]
('monotonic', array([1, 2, 3, 4, 5]))
>>> len(embeddings_obj)
2
>>> list(iter(embeddings_obj))
[('monotonic', array([1, 2, 3, 4, 5])), ('not_monotonic', array([6, 4, 2, 0, 8]))]
>>> embeddings_lookup = embeddings_obj.as_dict()
>>> embeddings_lookup["not_monotonic"]
array([6, 4, 2, 0, 8])

Parameters:

labelsnp.ndarray: The node labels that are positionally correlated with the embeddings. The dtype of labels is any object stored in a networkx Graph object, though type uniformity will be required
embeddingsnp.ndarray: The embedded values generated by the embedding technique.

Raises:

beartype.roar.BeartypeCallHintParamViolation if the types are invalid
ValueError if the row count of labels does not equal the row count of embeddings

__init__(labels, embeddings)[source]¶

Embeddings is an iterable, indexed interface over the parallel numpy arrays that are generated by our embedding functions.

>>> import numpy as np
>>> raw_embeddings = np.array([[1,2,3,4,5], [6,4,2,0,8]])
>>> labels = np.array(["monotonic", "not_monotonic"])
>>> embeddings_obj = Embeddings(labels, raw_embeddings)
>>> embeddings_obj[0]
('monotonic', array([1, 2, 3, 4, 5]))
>>> len(embeddings_obj)
2
>>> list(iter(embeddings_obj))
[('monotonic', array([1, 2, 3, 4, 5])), ('not_monotonic', array([6, 4, 2, 0, 8]))]
>>> embeddings_lookup = embeddings_obj.as_dict()
>>> embeddings_lookup["not_monotonic"]
array([6, 4, 2, 0, 8])

Parameters:

labelsnp.ndarray: The node labels that are positionally correlated with the embeddings. The dtype of labels is any object stored in a networkx Graph object, though type uniformity will be required
embeddingsnp.ndarray: The embedded values generated by the embedding technique.

Raises:

beartype.roar.BeartypeCallHintParamViolation if the types are invalid
ValueError if the row count of labels does not equal the row count of embeddings

Parameters:

labels (ndarray) --
embeddings (ndarray) --

labels()[source]¶

Return type:: ndarray

embeddings()[source]¶

Return type:: ndarray

as_dict()[source]¶

Return type:: EmbeddingsView

graspologic.pipeline.embed.adjacency_spectral_embedding(graph, dimensions=100, elbow_cut=None, svd_solver_algorithm='randomized', svd_solver_iterations=5, svd_seed=None, weight_attribute='weight')[source]¶

Given a directed or undirected networkx graph (not multigraph), generate an Embeddings object.

Adjacency spectral embeddings are extremely egocentric, implying that results are slanted toward the core-periphery of each node. This is in contrast to Laplacian spectral embeddings, which look further into the latent space when it captures change.

Adjacency Spectral Embedding Tutorial

Graphs will always have their diagonal augmented. In other words, a self-loop will be created for each node with a weight corresponding to the weighted degree.

Lastly, all weights will be rescaled based on their relative rank in the graph, which is beneficial in minimizing anomalous results if some edge weights are extremely atypical of the rest of the graph.

Parameters:

graphUnion[nx.Graph, nx.OrderedGraph, nx.DiGraph, nx.OrderedDiGraph]

An undirected or directed graph. The graph must:

be fully numerically weighted (every edge must have a real, numeric weight or else it will be treated as an unweighted graph)
be a basic graph (meaning it should not be a multigraph; if you have a multigraph you must first decide how you want to handle the weights of the edges between two nodes, whether summed, averaged, last-wins, maximum-weight-only, etc)

dimensionsint (default=100)

Dimensions to use for the svd solver. For undirected graphs, if elbow_cut==None, you will receive an embedding that has nodes rows and dimensions columns. For directed graphs, if elbow_cut==None, you will receive an embedding that has nodes rows and 2*dimensions columns. If elbow_cut is specified to be not None, we will cut the embedding at elbow_cut elbow, but the provided dimensions will be used in the creation of the SVD.

elbow_cutOptional[int] (default=None)

Using a process described by Zhu & Ghodsi in their paper "Automatic dimensionality selection from the scree plot via the use of profile likelihood", truncate the dimensionality of the return on the elbow_cut-th elbow. By default this value is None but can be used to reduce the dimensionality of the returned tensors.

svd_solver_algorithmstr (default="randomized")

allowed values: {'randomized', 'full', 'truncated'}

SVD solver to use:

'randomized'
Computes randomized svd using sklearn.utils.extmath.randomized_svd()

'full'
Computes full svd using scipy.linalg.svd() Does not support graph input of type scipy.sparse.csr_array

'truncated'
Computes truncated svd using scipy.sparse.linalg.svds()

svd_solver_iterationsint (default=5)

Number of iterations for randomized SVD solver. Not used by 'full' or 'truncated'. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.

svd_seedOptional[int] (default=None)

Used to seed the PRNG used in the randomized svd solver algorithm.

weight_attributestr (default="weight")

The edge dictionary key that contains the weight of the edge.

Returns:

Embeddings

Raises:

beartype.roar.BeartypeCallHintParamViolation if parameters do not match type hints
ValueError if values are not within appropriate ranges or allowed values

Parameters:

graph (Graph | DiGraph) --
dimensions (int) --
elbow_cut (int | None) --
svd_solver_algorithm (typing_extensions.Literal[full, truncated, randomized, eigsh]) --
svd_solver_iterations (int) --
svd_seed (int | None) --
weight_attribute (str) --

Return type:

Embeddings