`causica.data_generation.generate_data`¶

Module Contents¶

Functions¶

`get_graph_distribution`(...)	Get a graph from a graph type.
`get_functional_relationship_sampler`(...)	Get a functional relationship from a function type.
`get_noise_module_sampler`(...)	Get a noise module sampler from a list of variables.
`get_variable_type_dict`(→ dict[str, ...)	Get a dictionary mapping variable groups to their types.
`get_size_dict`(→ dict[str, int])	Get a dictionary mapping variable groups to their sizes.
`sample_treatment_and_effect`(→ tuple[str, list[str]])	Sample a treatment and effects from a graph.
`sample_dataset`(...)	Sample a new dataset and returns it as a CausalDataset object.
`sample_intervention_dict`(→ tensordict.TensorDict)	Sample an intervention from a given SEM.
`sample_intervention`(...)	Sample an intervention and it's sample mean from a given SEM.
`sample_counterfactual`(...)	Sample an intervention and it's sample mean from a given SEM.
`get_variable_definitions`(...)	Get the variable definitions.
`generate_sem_sampler`(...)	Generates a SEM according to specifications
`plot_dataset`(→ None)	Plot the data distribution.
`generate_save_plot_synthetic_data`(→ None)	Generate, save and plot synthetic data.
`main`(→ None)

Attributes¶

`sns`

causica.data_generation.generate_data.sns[source]¶

causica.data_generation.generate_data.get_graph_distribution(graph_type: str, num_nodes: int | None = None, num_edges: int | None = None, probs: float | None = None, graph_file: str | None = None, **storage_options) → causica.distributions.adjacency.AdjacencyDistribution[source]¶

Get a graph from a graph type.

Parameters:¶

graph_type: str¶: The type of graph to generate. Either “er” for Erdos-Renyi or “numpy” for a numpy graph.
num_nodes: int | None = None¶: The number of nodes in the graph. Not used if using a numpy graph.
num_edges: int | None = None¶: The number of edges in the graph. Not used if using a numpy graph.
probs: float | None = None¶: A float of the probability that an edge exists between 2 nodes
graph_file: str | None = None¶: The path to a graph file if using a numpy graph.
**storage_options¶: The storage options to pass to fsspec.

Returns:¶

The adjacency matrix of the graph.

causica.data_generation.generate_data.get_functional_relationship_sampler(function_type: str, shapes_dict: dict[str, torch.Size]) → causica.data_generation.samplers.functional_relationships_sampler.FunctionalRelationshipsSampler[source]¶

Get a functional relationship from a function type.

Parameters:¶

function_type: str¶: The type of function to generate. Either “linear” or “rff”.
shapes_dict: dict[str, torch.Size]¶: A dictionary mapping variable names to their shapes.

Returns:¶

The functional relationship.

causica.data_generation.generate_data.get_noise_module_sampler(variables: causica.datasets.causica_dataset_format.VariablesMetadata) → causica.data_generation.samplers.noise_dist_sampler.JointNoiseModuleSampler[source]¶

Get a noise module sampler from a list of variables.

Parameters:¶

variables: causica.datasets.causica_dataset_format.VariablesMetadata¶: Variable specifications to generate noise modules for.

Returns:¶

The noise module sampler.

causica.data_generation.generate_data.get_variable_type_dict(variables: causica.datasets.causica_dataset_format.VariablesMetadata) → dict[str, causica.datasets.variable_types.VariableTypeEnum][source]¶

Get a dictionary mapping variable groups to their types.

This also ensures that every variable in a group has the same type.

Parameters:¶

variables: causica.datasets.causica_dataset_format.VariablesMetadata¶: Variable specifications to get the types for.

Returns:¶

The variable type dictionary.

causica.data_generation.generate_data.get_size_dict(variables: causica.datasets.causica_dataset_format.VariablesMetadata) → dict[str, int][source]¶

Get a dictionary mapping variable groups to their sizes.

Parameters:¶

variables: causica.datasets.causica_dataset_format.VariablesMetadata¶: Variable specifications to get the sizes for.

Returns:¶

The size dictionary.

causica.data_generation.generate_data.sample_treatment_and_effect(graph: torch.Tensor, node_names: collections.abc.Sequence[str], ensure_effect: bool = True, num_effects: int = 1) → tuple[str, list[str]][source]¶

Sample a treatment and effects from a graph.

Parameters:¶

graph: torch.Tensor¶: The adjacency matrix of the graph.
node_names: collections.abc.Sequence[str]¶: The names of the nodes in the graph.
ensure_effect: bool = True¶: Whether to ensure that there is a path from the treatment to the effect.
num_effects: int = 1¶: The number of effect nodes to sample.

Returns:¶

The treatment and effects.

causica.data_generation.generate_data.sample_dataset(sem: causica.sem.structural_equation_model.SEM, sample_dataset_size: torch.Size, num_interventions: int = 0, num_intervention_samples: int = 1000, sample_interventions: bool = False, sample_counterfactuals: bool = False) → causica.datasets.causal_dataset.CausalDataset[source]¶

Sample a new dataset and returns it as a CausalDataset object.

Parameters:¶

sem: causica.sem.structural_equation_model.SEM¶: The SEM to sample from.
sample_dataset_size: torch.Size¶: The size of the dataset to sample from the SEM
num_interventions: int = 0¶: The number of interventions to sample per dataset. If 0, no interventions are sampled.
num_intervention_samples: int = 1000¶: The number of interventional samples to sample.
sample_interventions: bool = False¶: Whether to sample interventions.
sample_counterfactuals: bool = False¶: Whether to sample counterfactuals.

Returns:¶

A CausalDataset object holding the data, graph and potential interventions and counterfactuals.

causica.data_generation.generate_data.sample_intervention_dict(tensordict_data: tensordict.TensorDict, treatment: str | None = None) → tensordict.TensorDict[source]¶

Sample an intervention from a given SEM.

This samples a random value for the treatment variable from the data. The value is sampled uniformly from the range of the treatment variable in the data.

The treatment variable is chosen randomly across all nodes if not specified.

Parameters:¶

tensordict_data: tensordict.TensorDict¶: Base data for sampling an intervention value.
treatment: str | None = None¶: The name of the treatment variable. If None, a random variable is chosen across the tensordict keys.

Returns:¶

A TensorDict holding the intervention value.

causica.data_generation.generate_data.sample_intervention(sem: causica.sem.structural_equation_model.SEM, tensordict_data: tensordict.TensorDict, num_intervention_samples: int, treatment: str | None = None) → causica.datasets.interventional_data.InterventionData[source]¶

Sample an intervention and it’s sample mean from a given SEM.

Parameters:¶

sem: causica.sem.structural_equation_model.SEM¶: SEM to sample interventional data from.
tensordict_data: tensordict.TensorDict¶: Base data for sampling an intervention value.
num_intervention_samples: int¶: The number of samples to draw from the interventional distribution.
treatment: str | None = None¶: The name of the treatment variable. If None, a random variable is chosen.

Returns:¶

an intervention data object

causica.data_generation.generate_data.sample_counterfactual(sem: causica.sem.structural_equation_model.SEM, factual_data: tensordict.TensorDict, noise: tensordict.TensorDict, treatment: str | None = None) → causica.datasets.interventional_data.CounterfactualData[source]¶

Sample an intervention and it’s sample mean from a given SEM.

Parameters:¶

sem: causica.sem.structural_equation_model.SEM¶: SEM to sample counterfactual data from.
factual_data: tensordict.TensorDict¶: Base data for sampling an counterfactual value.
noise: tensordict.TensorDict¶: Base noise for sampling an counterfactual value.
treatment: str | None = None¶: The name of the treatment variable. If None, a random variable is chosen.

Returns:¶

an counterfactual data object

causica.data_generation.generate_data.get_variable_definitions(variable_json_path: str | None = None, num_nodes: int | None = None, storage_options: dict[str, str] | None = None) → causica.datasets.causica_dataset_format.VariablesMetadata[source]¶

Get the variable definitions.

Parameters:¶

variable_json_path: str | None = None¶: The path to a json file containing the variables. This can be any fsspec compatible url.
num_nodes: int | None = None¶: The number of nodes in the graph. Not used if using a numpy graph.
storage_options: dict[str, str] | None = None¶: The storage options to pass to fsspec.

Returns:¶

The variables.

causica.data_generation.generate_data.generate_sem_sampler(variables: causica.datasets.causica_dataset_format.VariablesMetadata, graph_file: str | None, num_edges: int | None, graph_type: str, function_type: str, probs: float | None = None, storage_options: dict[str, str] | None = None) → causica.data_generation.samplers.sem_sampler.SEMSampler[source]¶

Generates a SEM according to specifications

Parameters:¶

variable_json_path: The path to a json file containing the variables. This can be any fsspec compatible url.
num_nodes: The number of nodes in the graph. Not used if using a numpy graph.
graph_file: str | None¶: The path to a graph file if using a numpy graph. This can be any fsspec compatible url.
num_edges: int | None¶: The number of edges in the graph. Not used if using a numpy graph.
probs: float | None = None¶: A float of the probability that an edge exists between 2 nodes
graph_type: str¶: The type of graph to generate. Either “er” for Erdos-Renyi or “numpy” for a numpy graph.
function_type: str¶: The type of function to generate. Either “linear” or “rff”.
storage_options: dict[str, str] | None = None¶: The storage options to pass to fsspec.

Returns:¶

The variables and SEM.

causica.data_generation.generate_data.plot_dataset(data: tensordict.TensorDict, variables: causica.datasets.causica_dataset_format.VariablesMetadata, plot_kind: str = 'kde', plot_num: int = 10, datadir: str = '', storage_options: dict[str, str] | None = None) → None[source]¶

Plot the data distribution.

Parameters:¶

data: tensordict.TensorDict¶: The data to plot.
variables: causica.datasets.causica_dataset_format.VariablesMetadata¶: Variable specifications.
plot_kind: str = 'kde'¶: Type of joint plot to create.
plot_num: int = 10¶: Maximum number of variables to plot.
datadir: str = ''¶: The directory to save the dataset to. This can be any fsspec compatible url.
storage_options: dict[str, str] | None = None¶: The storage options to pass to fsspec.

causica.data_generation.generate_data.generate_save_plot_synthetic_data(graph_type: str, function_type: str, datadir: str, num_samples_train: int, num_samples_test: int, num_interventions: int, variable_json_path: str | None = None, num_nodes: int | None = None, graph_file: str | None = None, num_edges: int | None = None, probs: float | None = None, overwrite: bool = False, plot_kind: str = '', plot_num: int = 10, storage_options: dict[str, str] | None = None) → None[source]¶

Generate, save and plot synthetic data.

This will sample a SEM from the given specifications, generate data from it, and save it to disk following the Causica dataset format. It optionally plots the joint distribution over the variables.

Parameters:¶

graph_type: str¶: The type of graph to generate. Either “er” for Erdos-Renyi or “numpy” for a numpy graph.
function_type: str¶: The type of function to generate. Either “linear” or “rff”.
datadir: str¶: The directory to save the dataset to. This can be any fsspec compatible url.
num_samples_train: int¶: The number of training samples to generate.
num_samples_test: int¶: The number of validation and test samples to generate.
num_interventions: int¶: The number of interventions and counterfactuals to generate.
variable_json_path: str | None = None¶: The path to a json file containing the variables. This can be any fsspec compatible url.
num_nodes: int | None = None¶: The number of nodes in the graph. Not used if using a numpy graph.
graph_file: str | None = None¶: The path to a graph file if using a numpy graph. This can be any fsspec compatible url.
num_edges: int | None = None¶: The number of edges in the graph. Not used if using a numpy graph.
probs: float | None = None¶: A float of the probability that an edge exists between 2 nodes
overwrite: bool = False¶: Whether to overwrite the dataset if it already exists.
plot_kind: str = ''¶: Type of joint plot to create.
plot_num: int = 10¶: Maximum number of variables to plot.
storage_options: dict[str, str] | None = None¶: The storage options to pass to fsspec.

causica.data_generation.generate_data.main() → None[source]¶

causica.data_generation.generate_data¶

Module Contents¶

Functions¶

Attributes¶

`causica.data_generation.generate_data`¶