Run LightGBM Training on your own data in AzureML
Objectives - By following this tutorial, you will be able to:
- run lightgbm training pipeline on your own train/test data in AzureML
Requirements - To enjoy this tutorial, you need to: - have installed the local python requirements. - have an existing AzureML workspace with relevant compute resource. - have edited your config files to run the pipelines in your workspace.
Get your data into AzureML
There are two ways you could simply get your data into your AzureML workspace.
- Option A: Upload your local data into AzureML
- Option B: Create a dataset from an existing storage
For each of those, you need to create a File dataset with your training and testing data, each provided as one unique file.
Run training on your train/test datasets
1. Check out the file `conf/experiments/lightgbm_training/cpu.yaml (see below):
# to execute:
# > python src/pipelines/azureml/lightgbm_training.py --exp-config conf/experiments/lightgbm_training/cpu.yaml
defaults:
- aml: custom
- compute: custom
### CUSTOM PARAMETERS ###
experiment:
name: "lightgbm_training_dev"
description: "something interesting to say about this"
lightgbm_training_config:
# name of your particular benchmark
benchmark_name: "benchmark-dev" # override this with a unique name
# list all the train/test pairs to train on
tasks:
- train:
name: "data-synthetic-regression-100cols-100000samples-train"
test:
name: "data-synthetic-regression-100cols-10000samples-test"
task_key: "synthetic-regression-100cols" # optional, user to register outputs
# NOTE: this example uses only 1 training (reference)
# see other config files for creating training variants
reference:
framework: lightgbm_python
# input parameters
data:
auto_partitioning: True # inserts partitioning to match expected number of partitions (if nodes*processes > 1)
pre_convert_to_binary: False # inserts convertion of train/test data into binary to speed up training (not compatible with auto_partitioning yet)
header: false
label_column: "0"
group_column: null
# lightgbm training parameters
training:
objective: "regression"
metric: "rmse"
boosting: "gbdt"
tree_learner: "data"
num_iterations: 100
num_leaves: 31
min_data_in_leaf: 20
learning_rate: 0.1
max_bin: 255
feature_fraction: 1.0
# compute parameters
device_type: "cpu"
# you can add anything under custom_params, it will be sent as a dictionary
# to the lightgbm training module to override its parameters (see lightgbm docs for list)
custom_params:
deterministic: True
use_two_round_loading: True
# compute parameters
runtime:
#target: null # optional: force target for this training job
nodes: 1
processes: 1
# model registration
# naming convention: "{register_model_prefix}-{task_key}-{num_iterations}trees-{num_leaves}leaves-{register_model_suffix}"
output:
register_model: False
#register_model_prefix: "model"
#register_model_suffix: null
2. Modify the lines below to reflect the name of your input train/test datasets:
# list all the train/test pairs to train on
tasks:
- train:
name: "NAME OF YOUR TRAINING DATASET HERE"
test:
name: "NAME OF YOUR TESTING DATASET HERE"
Hint
tasks
is actually a list, if you provide multiple entries, the pipeline will train one model per train/test pair.
4. If you want the pipeline to save your model as a dataset, turn register_model
to True and uncomment the lines below to name the output according to the naming convention:
lightgbm_training_config:
reference:
# model registration
# naming convention: "{register_model_prefix}-{task_key}-{num_iterations}trees-{num_leaves}leaves-{register_model_suffix}"
output:
register_model: False
#register_model_prefix: "model"
#register_model_suffix: null
Hint
you can decide to register the output of the pipeline later manually from the AzureML portal.
5. Run the training pipeline:
python src/pipelines/azureml/lightgbm_training.py --exp-config conf/experiments/lightgbm_training/cpu.yaml
That's it.
Options to modify the training parameters
The benchmark training pipeline is entirely configurable. There are a few key parameters in the config yaml file that will provide interesting training scenarios. We've provided a couple of typical setups in distinct config files. Feel free to explore all of them and come up with your own set of parameters.
Scalable multi node training using mpi
Hint
Check out example config file conf/experiments/lightgbm_training/cpu.yaml
.
To enable multi-node training, simple modify the number of nodes under runtime
:
lightgbm_training_config:
reference:
runtime:
nodes: 1
When running the pipeline, it will automatically partition the data to match with the number of nodes, and create multi-node training provisioning the required number of nodes.
python src/pipelines/azureml/lightgbm_training.py --exp-config conf/experiments/lightgbm_training/cpu.yaml
Gpu training (experimental)
Hint
Check out example config file conf/experiments/lightgbm_training/gpu.yaml
.
To enable gpu training, modify the options below to build a GPU-ready docker image and turn on gpu in LightGBM training:
lightgbm_training_config:
reference:
training:
device_type: "gpu"
runtime:
build: "docker/lightgbm-v3.3.0/linux_gpu_pip.dockerfile"
When running the pipeline, it will automatically run on the gpu cluster you've named in your compute/custom.yaml
file.
python src/pipelines/azureml/lightgbm_training.py --exp-config conf/experiments/lightgbm_training/gpu.yaml
Running a custom lightgbm build (experimental)
Hint
Check out example config file conf/experiments/lightgbm_training/cpu-custom.yaml
.
To enable training on a custom build, modify the options below:
lightgbm_training_config:
reference:
runtime:
build: "dockers/lightgbm_cpu_mpi_custom.dockerfile" # relative to lightgbm_python folder
When running the pipeline, it will build the container from this custom dockerfile and use it to run your job.
python src/pipelines/azureml/lightgbm_training.py --exp-config conf/experiments/lightgbm_training/cpu-custom.yaml
Hyperarameter search using Sweep
AzureML has a feature to tune model hyperparameters, we've implemented it in this training pipeline.
Hint
Check out example config file conf/experiments/lightgbm_training/sweep.yaml
.
To enable parameter sweep, just change the "sweepable" parameters (see below) to use the syntax allowed by AzureML sweep:
lightgbm_training_config:
reference:
training:
# "sweepable" training parameters
num_iterations: "choice(100, 200)"
num_leaves: "choice(10,20,30)"
min_data_in_leaf: 20
learning_rate: 0.1
max_bin: 255
feature_fraction: 1.0
Running the pipeline with this config will automatically try multiple values for the parameters and return the best model.
python src/pipelines/azureml/lightgbm_training.py --exp-config conf/experiments/lightgbm_training/sweep.yaml
You can also modify the parameters of Sweep itself, see documentation on the role of each of those settings:
lightgbm_training_config:
reference:
sweep:
#primary_metric: "node_0/valid_0.rmse" # if you comment it out, will use "node_0/valid_0.METRIC"
goal: "minimize"
algorithm: "random"
early_termination:
policy_type: "median_stopping"
evaluation_interval: 1
delay_evaluation: 5
truncation_percentage: 20
limits:
max_total_trials: 100
max_concurrent_trials: 10
timeout_minutes: 60
Running multiple variants of training parameters
The training pipeline allows you do benchmark multiple variants of the training parameters.
The structure of lightgbm_training_config
settings relies on 3 main sections:
- tasks
: a list of train/test dataset pairs
- reference_training
: parameters used as reference for lightgbm training
- variants
: a list of parameter overrides that apply on top of reference_training
parameters.
So you can create as many tasks and variants as you'd like and run them all into one single pipeline.
An example use case is training on cpu versus gpu. See the example file training-cpu-vs-gpu.yaml. In this file, the variant just consists in overriding the num_iterations
:
lightgbm_training_config:
benchmark_name: "benchmark-cpu-num-trees"
# list all the train/test pairs to train on
tasks:
- train:
name: "data-synthetic-regression-10cols-100000samples-train"
test::
name: "data-synthetic-regression-10cols-10000samples-test"
task_key: "synthetic-regression-10cols" # optional, user to register outputs
- train:
name: "data-synthetic-regression-100cols-100000samples-train"
test::
name: "data-synthetic-regression-100cols-10000samples-test"
task_key: "synthetic-regression-100cols" # optional, user to register outputs
- train:
name: "data-synthetic-regression-1000cols-100000samples-train"
test::
name: "data-synthetic-regression-1000cols-10000samples-test"
task_key: "synthetic-regression-1000cols" # optional, user to register outputs
# reference settings for the benchmark
# all variants will be based on this
reference:
# lots of other params here
training:
num_iterations: 100
# variant settings override what is in reference_training
variants:
- training:
num_iterations: 10
- training:
num_iterations: 1000
- training:
num_iterations: 5000