Generate synthetic data for benchmarking
Objectives - By following this tutorial, you will be able to:
- generate multiple synthetic regression datasets to benchmark lightgbm training or inferencing
Requirements - To enjoy this tutorial, you need to:
- have an existing AzureML workspace with relevant compute resource.
- have setup your local environment to run our benchmarking pipelines.
Check out the generation configuration
Open the file under conf/experiments/data-generation.yaml
. It contains in particular a section data_generation_config
that we will look more closely in this section.
The following yaml section contains the parameters to run a pipeline that will automatically generate synthetic data for various tasks, at various sizes.
experiment:
name: "data_generation_dev"
data_generation_config:
# name of your particular benchmark
benchmark_name: "benchmark-dev" # override this with a unique name
# DATA
tasks:
- task: "regression"
train_samples: 100000
test_samples: 10000
inferencing_samples: 10000
n_features: 10
n_informative: 10
- task: "regression"
train_samples: 100000
test_samples: 10000
inferencing_samples: 10000
n_features: 100
n_informative: 100
- task: "regression"
train_samples: 100000
test_samples: 10000
inferencing_samples: 10000
n_features: 1000
n_informative: 1000
register_outputs: false
register_outputs_prefix: "data-synthetic"
In particular, the configuration consists in listing tasks
which are made of key data generation arguments:
- task: <regression or classification>
train_samples: <number of training rows>
test_samples: <number of testing rows>
inferencing_samples: <number of inferencing rows>
n_features: <number of features>
n_informative: <how many features are informative>
Note
This actually corresponds to a dataclass data_generation_task
detailed in src/common/tasks.py.
All the items are required except n_informative
.
The current list corresponds to the default settings required to run the training benchmark pipeline and the inferencing benchmark pipeline.
The option register_outputs
can be turned to true
if you want the pipeline to automatically register its outputs with a naming convention {prefix}-{task}-{n_features}cols-{samples}samples-{train|test|inference}
that will be used in the next steps.
register_outputs: false
register_outputs_prefix: "data-synthetic"
Run the pipeline
Warning
For this section, we'll use custom
as the name for the AzureML reference config files you created during local setup.
Running the data generation pipeline consists in launching a python script with the pipeline configuration file.
python pipelines/azureml/pipelines/data_generation.py --exp-config pipelines/azureml/conf/experiments/data-generation.yaml
The python script will build a pipeline based on the collection of manual scripts, each running in its own python environment. The configuration for the parameters from each scripts will be provided from the configuration file in conf/experiments/data-generation.yaml
.
Running the python command should open a browser to your workspace opening the experiment view.
To activate output registration, you can either modify it in data-generation.yaml
, or override it from the command line:
python pipelines/azureml/pipelines/data_generation.py --exp-config pipelines/azureml/conf/experiments/data-generation.yaml data_generation.register_outputs=True
To find the resulting datasets, go into your workspace under the Datasets tab.
Next Steps
When the pipeline completes and you can see the registered dataset in your workspace, you are now ready to test running the training benchmark pipeline.