causica.data_generation.generate_data

Module Contents

Functions

get_graph_distribution(...)

Get a graph from a graph type.

get_functional_relationship_sampler(...)

Get a functional relationship from a function type.

get_noise_module_sampler(...)

Get a noise module sampler from a list of variables.

get_variable_type_dict(→ dict[str, ...)

Get a dictionary mapping variable groups to their types.

get_size_dict(→ dict[str, int])

Get a dictionary mapping variable groups to their sizes.

sample_treatment_and_effect(→ tuple[str, list[str]])

Sample a treatment and effects from a graph.

sample_dataset(...)

Sample a new dataset and returns it as a CausalDataset object.

sample_intervention_dict(→ tensordict.TensorDict)

Sample an intervention from a given SEM.

sample_intervention(...)

Sample an intervention and it's sample mean from a given SEM.

sample_counterfactual(...)

Sample an intervention and it's sample mean from a given SEM.

get_variable_definitions(...)

Get the variable definitions.

generate_sem_sampler(...)

Generates a SEM according to specifications

plot_dataset(→ None)

Plot the data distribution.

generate_save_plot_synthetic_data(→ None)

Generate, save and plot synthetic data.

main(→ None)

Attributes

sns

causica.data_generation.generate_data.sns[source]
causica.data_generation.generate_data.get_graph_distribution(graph_type: str, num_nodes: int | None = None, num_edges: int | None = None, probs: float | None = None, graph_file: str | None = None, **storage_options) causica.distributions.adjacency.AdjacencyDistribution[source]

Get a graph from a graph type.

Parameters:
graph_type: str

The type of graph to generate. Either “er” for Erdos-Renyi or “numpy” for a numpy graph.

num_nodes: int | None = None

The number of nodes in the graph. Not used if using a numpy graph.

num_edges: int | None = None

The number of edges in the graph. Not used if using a numpy graph.

probs: float | None = None

A float of the probability that an edge exists between 2 nodes

graph_file: str | None = None

The path to a graph file if using a numpy graph.

**storage_options

The storage options to pass to fsspec.

Returns:

The adjacency matrix of the graph.

causica.data_generation.generate_data.get_functional_relationship_sampler(function_type: str, shapes_dict: dict[str, torch.Size]) causica.data_generation.samplers.functional_relationships_sampler.FunctionalRelationshipsSampler[source]

Get a functional relationship from a function type.

Parameters:
function_type: str

The type of function to generate. Either “linear” or “rff”.

shapes_dict: dict[str, torch.Size]

A dictionary mapping variable names to their shapes.

Returns:

The functional relationship.

causica.data_generation.generate_data.get_noise_module_sampler(variables: causica.datasets.causica_dataset_format.VariablesMetadata) causica.data_generation.samplers.noise_dist_sampler.JointNoiseModuleSampler[source]

Get a noise module sampler from a list of variables.

Parameters:
variables: causica.datasets.causica_dataset_format.VariablesMetadata

Variable specifications to generate noise modules for.

Returns:

The noise module sampler.

causica.data_generation.generate_data.get_variable_type_dict(variables: causica.datasets.causica_dataset_format.VariablesMetadata) dict[str, causica.datasets.variable_types.VariableTypeEnum][source]

Get a dictionary mapping variable groups to their types.

This also ensures that every variable in a group has the same type.

Parameters:
variables: causica.datasets.causica_dataset_format.VariablesMetadata

Variable specifications to get the types for.

Returns:

The variable type dictionary.

causica.data_generation.generate_data.get_size_dict(variables: causica.datasets.causica_dataset_format.VariablesMetadata) dict[str, int][source]

Get a dictionary mapping variable groups to their sizes.

Parameters:
variables: causica.datasets.causica_dataset_format.VariablesMetadata

Variable specifications to get the sizes for.

Returns:

The size dictionary.

causica.data_generation.generate_data.sample_treatment_and_effect(graph: torch.Tensor, node_names: collections.abc.Sequence[str], ensure_effect: bool = True, num_effects: int = 1) tuple[str, list[str]][source]

Sample a treatment and effects from a graph.

Parameters:
graph: torch.Tensor

The adjacency matrix of the graph.

node_names: collections.abc.Sequence[str]

The names of the nodes in the graph.

ensure_effect: bool = True

Whether to ensure that there is a path from the treatment to the effect.

num_effects: int = 1

The number of effect nodes to sample.

Returns:

The treatment and effects.

causica.data_generation.generate_data.sample_dataset(sem: causica.sem.structural_equation_model.SEM, sample_dataset_size: torch.Size, num_interventions: int = 0, num_intervention_samples: int = 1000, sample_interventions: bool = False, sample_counterfactuals: bool = False) causica.datasets.causal_dataset.CausalDataset[source]

Sample a new dataset and returns it as a CausalDataset object.

Parameters:
sem: causica.sem.structural_equation_model.SEM

The SEM to sample from.

sample_dataset_size: torch.Size

The size of the dataset to sample from the SEM

num_interventions: int = 0

The number of interventions to sample per dataset. If 0, no interventions are sampled.

num_intervention_samples: int = 1000

The number of interventional samples to sample.

sample_interventions: bool = False

Whether to sample interventions.

sample_counterfactuals: bool = False

Whether to sample counterfactuals.

Returns:

A CausalDataset object holding the data, graph and potential interventions and counterfactuals.

causica.data_generation.generate_data.sample_intervention_dict(tensordict_data: tensordict.TensorDict, treatment: str | None = None) tensordict.TensorDict[source]

Sample an intervention from a given SEM.

This samples a random value for the treatment variable from the data. The value is sampled uniformly from the range of the treatment variable in the data.

The treatment variable is chosen randomly across all nodes if not specified.

Parameters:
tensordict_data: tensordict.TensorDict

Base data for sampling an intervention value.

treatment: str | None = None

The name of the treatment variable. If None, a random variable is chosen across the tensordict keys.

Returns:

A TensorDict holding the intervention value.

causica.data_generation.generate_data.sample_intervention(sem: causica.sem.structural_equation_model.SEM, tensordict_data: tensordict.TensorDict, num_intervention_samples: int, treatment: str | None = None) causica.datasets.interventional_data.InterventionData[source]

Sample an intervention and it’s sample mean from a given SEM.

Parameters:
sem: causica.sem.structural_equation_model.SEM

SEM to sample interventional data from.

tensordict_data: tensordict.TensorDict

Base data for sampling an intervention value.

num_intervention_samples: int

The number of samples to draw from the interventional distribution.

treatment: str | None = None

The name of the treatment variable. If None, a random variable is chosen.

Returns:

an intervention data object

causica.data_generation.generate_data.sample_counterfactual(sem: causica.sem.structural_equation_model.SEM, factual_data: tensordict.TensorDict, noise: tensordict.TensorDict, treatment: str | None = None) causica.datasets.interventional_data.CounterfactualData[source]

Sample an intervention and it’s sample mean from a given SEM.

Parameters:
sem: causica.sem.structural_equation_model.SEM

SEM to sample counterfactual data from.

factual_data: tensordict.TensorDict

Base data for sampling an counterfactual value.

noise: tensordict.TensorDict

Base noise for sampling an counterfactual value.

treatment: str | None = None

The name of the treatment variable. If None, a random variable is chosen.

Returns:

an counterfactual data object

causica.data_generation.generate_data.get_variable_definitions(variable_json_path: str | None = None, num_nodes: int | None = None, storage_options: dict[str, str] | None = None) causica.datasets.causica_dataset_format.VariablesMetadata[source]

Get the variable definitions.

Parameters:
variable_json_path: str | None = None

The path to a json file containing the variables. This can be any fsspec compatible url.

num_nodes: int | None = None

The number of nodes in the graph. Not used if using a numpy graph.

storage_options: dict[str, str] | None = None

The storage options to pass to fsspec.

Returns:

The variables.

causica.data_generation.generate_data.generate_sem_sampler(variables: causica.datasets.causica_dataset_format.VariablesMetadata, graph_file: str | None, num_edges: int | None, graph_type: str, function_type: str, probs: float | None = None, storage_options: dict[str, str] | None = None) causica.data_generation.samplers.sem_sampler.SEMSampler[source]

Generates a SEM according to specifications

Parameters:
variable_json_path

The path to a json file containing the variables. This can be any fsspec compatible url.

num_nodes

The number of nodes in the graph. Not used if using a numpy graph.

graph_file: str | None

The path to a graph file if using a numpy graph. This can be any fsspec compatible url.

num_edges: int | None

The number of edges in the graph. Not used if using a numpy graph.

probs: float | None = None

A float of the probability that an edge exists between 2 nodes

graph_type: str

The type of graph to generate. Either “er” for Erdos-Renyi or “numpy” for a numpy graph.

function_type: str

The type of function to generate. Either “linear” or “rff”.

storage_options: dict[str, str] | None = None

The storage options to pass to fsspec.

Returns:

The variables and SEM.

causica.data_generation.generate_data.plot_dataset(data: tensordict.TensorDict, variables: causica.datasets.causica_dataset_format.VariablesMetadata, plot_kind: str = 'kde', plot_num: int = 10, datadir: str = '', storage_options: dict[str, str] | None = None) None[source]

Plot the data distribution.

Parameters:
data: tensordict.TensorDict

The data to plot.

variables: causica.datasets.causica_dataset_format.VariablesMetadata

Variable specifications.

plot_kind: str = 'kde'

Type of joint plot to create.

plot_num: int = 10

Maximum number of variables to plot.

datadir: str = ''

The directory to save the dataset to. This can be any fsspec compatible url.

storage_options: dict[str, str] | None = None

The storage options to pass to fsspec.

causica.data_generation.generate_data.generate_save_plot_synthetic_data(graph_type: str, function_type: str, datadir: str, num_samples_train: int, num_samples_test: int, num_interventions: int, variable_json_path: str | None = None, num_nodes: int | None = None, graph_file: str | None = None, num_edges: int | None = None, probs: float | None = None, overwrite: bool = False, plot_kind: str = '', plot_num: int = 10, storage_options: dict[str, str] | None = None) None[source]

Generate, save and plot synthetic data.

This will sample a SEM from the given specifications, generate data from it, and save it to disk following the Causica dataset format. It optionally plots the joint distribution over the variables.

Parameters:
graph_type: str

The type of graph to generate. Either “er” for Erdos-Renyi or “numpy” for a numpy graph.

function_type: str

The type of function to generate. Either “linear” or “rff”.

datadir: str

The directory to save the dataset to. This can be any fsspec compatible url.

num_samples_train: int

The number of training samples to generate.

num_samples_test: int

The number of validation and test samples to generate.

num_interventions: int

The number of interventions and counterfactuals to generate.

variable_json_path: str | None = None

The path to a json file containing the variables. This can be any fsspec compatible url.

num_nodes: int | None = None

The number of nodes in the graph. Not used if using a numpy graph.

graph_file: str | None = None

The path to a graph file if using a numpy graph. This can be any fsspec compatible url.

num_edges: int | None = None

The number of edges in the graph. Not used if using a numpy graph.

probs: float | None = None

A float of the probability that an edge exists between 2 nodes

overwrite: bool = False

Whether to overwrite the dataset if it already exists.

plot_kind: str = ''

Type of joint plot to create.

plot_num: int = 10

Maximum number of variables to plot.

storage_options: dict[str, str] | None = None

The storage options to pass to fsspec.

causica.data_generation.generate_data.main() None[source]