causica.data_generation.generate_data¶
Module Contents¶
Functions¶
Get a graph from a graph type. |
|
Get a functional relationship from a function type. |
|
Get a noise module sampler from a list of variables. |
|
|
Get a dictionary mapping variable groups to their types. |
|
Get a dictionary mapping variable groups to their sizes. |
|
Sample a treatment and effects from a graph. |
|
Sample a new dataset and returns it as a CausalDataset object. |
|
Sample an intervention from a given SEM. |
|
Sample an intervention and it's sample mean from a given SEM. |
Sample an intervention and it's sample mean from a given SEM. |
|
Get the variable definitions. |
|
|
Generates a SEM according to specifications |
|
Plot the data distribution. |
Generate, save and plot synthetic data. |
|
|
Attributes¶
-
causica.data_generation.generate_data.get_graph_distribution(graph_type: str, num_nodes: int | None =
None, num_edges: int | None =None, probs: float | None =None, graph_file: str | None =None, **storage_options) causica.distributions.adjacency.AdjacencyDistribution[source]¶ Get a graph from a graph type.
- Parameters:¶
- graph_type: str¶
The type of graph to generate. Either “er” for Erdos-Renyi or “numpy” for a numpy graph.
- num_nodes: int | None =
None¶ The number of nodes in the graph. Not used if using a numpy graph.
- num_edges: int | None =
None¶ The number of edges in the graph. Not used if using a numpy graph.
- probs: float | None =
None¶ A float of the probability that an edge exists between 2 nodes
- graph_file: str | None =
None¶ The path to a graph file if using a numpy graph.
- **storage_options¶
The storage options to pass to fsspec.
- Returns:¶
The adjacency matrix of the graph.
- causica.data_generation.generate_data.get_functional_relationship_sampler(function_type: str, shapes_dict: dict[str, torch.Size]) causica.data_generation.samplers.functional_relationships_sampler.FunctionalRelationshipsSampler[source]¶
Get a functional relationship from a function type.
- causica.data_generation.generate_data.get_noise_module_sampler(variables: causica.datasets.causica_dataset_format.VariablesMetadata) causica.data_generation.samplers.noise_dist_sampler.JointNoiseModuleSampler[source]¶
Get a noise module sampler from a list of variables.
- Parameters:¶
- variables: causica.datasets.causica_dataset_format.VariablesMetadata¶
Variable specifications to generate noise modules for.
- Returns:¶
The noise module sampler.
- causica.data_generation.generate_data.get_variable_type_dict(variables: causica.datasets.causica_dataset_format.VariablesMetadata) dict[str, causica.datasets.variable_types.VariableTypeEnum][source]¶
Get a dictionary mapping variable groups to their types.
This also ensures that every variable in a group has the same type.
- Parameters:¶
- variables: causica.datasets.causica_dataset_format.VariablesMetadata¶
Variable specifications to get the types for.
- Returns:¶
The variable type dictionary.
- causica.data_generation.generate_data.get_size_dict(variables: causica.datasets.causica_dataset_format.VariablesMetadata) dict[str, int][source]¶
Get a dictionary mapping variable groups to their sizes.
- Parameters:¶
- variables: causica.datasets.causica_dataset_format.VariablesMetadata¶
Variable specifications to get the sizes for.
- Returns:¶
The size dictionary.
-
causica.data_generation.generate_data.sample_treatment_and_effect(graph: torch.Tensor, node_names: collections.abc.Sequence[str], ensure_effect: bool =
True, num_effects: int =1) tuple[str, list[str]][source]¶ Sample a treatment and effects from a graph.
-
causica.data_generation.generate_data.sample_dataset(sem: causica.sem.structural_equation_model.SEM, sample_dataset_size: torch.Size, num_interventions: int =
0, num_intervention_samples: int =1000, sample_interventions: bool =False, sample_counterfactuals: bool =False) causica.datasets.causal_dataset.CausalDataset[source]¶ Sample a new dataset and returns it as a CausalDataset object.
- Parameters:¶
- sem: causica.sem.structural_equation_model.SEM¶
The SEM to sample from.
- sample_dataset_size: torch.Size¶
The size of the dataset to sample from the SEM
- num_interventions: int =
0¶ The number of interventions to sample per dataset. If 0, no interventions are sampled.
- num_intervention_samples: int =
1000¶ The number of interventional samples to sample.
- sample_interventions: bool =
False¶ Whether to sample interventions.
- sample_counterfactuals: bool =
False¶ Whether to sample counterfactuals.
- Returns:¶
A CausalDataset object holding the data, graph and potential interventions and counterfactuals.
-
causica.data_generation.generate_data.sample_intervention_dict(tensordict_data: tensordict.TensorDict, treatment: str | None =
None) tensordict.TensorDict[source]¶ Sample an intervention from a given SEM.
This samples a random value for the treatment variable from the data. The value is sampled uniformly from the range of the treatment variable in the data.
The treatment variable is chosen randomly across all nodes if not specified.
-
causica.data_generation.generate_data.sample_intervention(sem: causica.sem.structural_equation_model.SEM, tensordict_data: tensordict.TensorDict, num_intervention_samples: int, treatment: str | None =
None) causica.datasets.interventional_data.InterventionData[source]¶ Sample an intervention and it’s sample mean from a given SEM.
- Parameters:¶
- sem: causica.sem.structural_equation_model.SEM¶
SEM to sample interventional data from.
- tensordict_data: tensordict.TensorDict¶
Base data for sampling an intervention value.
- num_intervention_samples: int¶
The number of samples to draw from the interventional distribution.
- treatment: str | None =
None¶ The name of the treatment variable. If None, a random variable is chosen.
- Returns:¶
an intervention data object
-
causica.data_generation.generate_data.sample_counterfactual(sem: causica.sem.structural_equation_model.SEM, factual_data: tensordict.TensorDict, noise: tensordict.TensorDict, treatment: str | None =
None) causica.datasets.interventional_data.CounterfactualData[source]¶ Sample an intervention and it’s sample mean from a given SEM.
- Parameters:¶
- sem: causica.sem.structural_equation_model.SEM¶
SEM to sample counterfactual data from.
- factual_data: tensordict.TensorDict¶
Base data for sampling an counterfactual value.
- noise: tensordict.TensorDict¶
Base noise for sampling an counterfactual value.
- treatment: str | None =
None¶ The name of the treatment variable. If None, a random variable is chosen.
- Returns:¶
an counterfactual data object
-
causica.data_generation.generate_data.get_variable_definitions(variable_json_path: str | None =
None, num_nodes: int | None =None, storage_options: dict[str, str] | None =None) causica.datasets.causica_dataset_format.VariablesMetadata[source]¶ Get the variable definitions.
- Parameters:¶
- variable_json_path: str | None =
None¶ The path to a json file containing the variables. This can be any fsspec compatible url.
- num_nodes: int | None =
None¶ The number of nodes in the graph. Not used if using a numpy graph.
- storage_options: dict[str, str] | None =
None¶ The storage options to pass to fsspec.
- variable_json_path: str | None =
- Returns:¶
The variables.
-
causica.data_generation.generate_data.generate_sem_sampler(variables: causica.datasets.causica_dataset_format.VariablesMetadata, graph_file: str | None, num_edges: int | None, graph_type: str, function_type: str, probs: float | None =
None, storage_options: dict[str, str] | None =None) causica.data_generation.samplers.sem_sampler.SEMSampler[source]¶ Generates a SEM according to specifications
- Parameters:¶
- variable_json_path
The path to a json file containing the variables. This can be any fsspec compatible url.
- num_nodes
The number of nodes in the graph. Not used if using a numpy graph.
- graph_file: str | None¶
The path to a graph file if using a numpy graph. This can be any fsspec compatible url.
- num_edges: int | None¶
The number of edges in the graph. Not used if using a numpy graph.
- probs: float | None =
None¶ A float of the probability that an edge exists between 2 nodes
- graph_type: str¶
The type of graph to generate. Either “er” for Erdos-Renyi or “numpy” for a numpy graph.
- function_type: str¶
The type of function to generate. Either “linear” or “rff”.
- storage_options: dict[str, str] | None =
None¶ The storage options to pass to fsspec.
- Returns:¶
The variables and SEM.
-
causica.data_generation.generate_data.plot_dataset(data: tensordict.TensorDict, variables: causica.datasets.causica_dataset_format.VariablesMetadata, plot_kind: str =
'kde', plot_num: int =10, datadir: str ='', storage_options: dict[str, str] | None =None) None[source]¶ Plot the data distribution.
- Parameters:¶
- data: tensordict.TensorDict¶
The data to plot.
- variables: causica.datasets.causica_dataset_format.VariablesMetadata¶
Variable specifications.
- plot_kind: str =
'kde'¶ Type of joint plot to create.
- plot_num: int =
10¶ Maximum number of variables to plot.
- datadir: str =
''¶ The directory to save the dataset to. This can be any fsspec compatible url.
- storage_options: dict[str, str] | None =
None¶ The storage options to pass to fsspec.
-
causica.data_generation.generate_data.generate_save_plot_synthetic_data(graph_type: str, function_type: str, datadir: str, num_samples_train: int, num_samples_test: int, num_interventions: int, variable_json_path: str | None =
None, num_nodes: int | None =None, graph_file: str | None =None, num_edges: int | None =None, probs: float | None =None, overwrite: bool =False, plot_kind: str ='', plot_num: int =10, storage_options: dict[str, str] | None =None) None[source]¶ Generate, save and plot synthetic data.
This will sample a SEM from the given specifications, generate data from it, and save it to disk following the Causica dataset format. It optionally plots the joint distribution over the variables.
- Parameters:¶
- graph_type: str¶
The type of graph to generate. Either “er” for Erdos-Renyi or “numpy” for a numpy graph.
- function_type: str¶
The type of function to generate. Either “linear” or “rff”.
- datadir: str¶
The directory to save the dataset to. This can be any fsspec compatible url.
- num_samples_train: int¶
The number of training samples to generate.
- num_samples_test: int¶
The number of validation and test samples to generate.
- num_interventions: int¶
The number of interventions and counterfactuals to generate.
- variable_json_path: str | None =
None¶ The path to a json file containing the variables. This can be any fsspec compatible url.
- num_nodes: int | None =
None¶ The number of nodes in the graph. Not used if using a numpy graph.
- graph_file: str | None =
None¶ The path to a graph file if using a numpy graph. This can be any fsspec compatible url.
- num_edges: int | None =
None¶ The number of edges in the graph. Not used if using a numpy graph.
- probs: float | None =
None¶ A float of the probability that an edge exists between 2 nodes
- overwrite: bool =
False¶ Whether to overwrite the dataset if it already exists.
- plot_kind: str =
''¶ Type of joint plot to create.
- plot_num: int =
10¶ Maximum number of variables to plot.
- storage_options: dict[str, str] | None =
None¶ The storage options to pass to fsspec.