Skip to content

Run benchmark manually

Objectives - By following this tutorial, you will be able to:

  • generate synthetic data for running lightgbm
  • run lightgbm training and inferencing scripts to measure wall time

Requirements - To enjoy this tutorial, you need to have installed python dependencies locally (see instructions).

Generate synthetic data

To generate a synthetic dataset based on sklearn:

python src/scripts/data_processing/generate_data/generate.py \
    --train_samples 30000 \
    --test_samples 3000 \
    --inferencing_samples 30000 \
    --n_features 4000 \
    --n_informative 400 \
    --random_state 5 \
    --output_train ./data/synthetic/train/ \
    --output_test ./data/synthetic/test/ \
    --output_inference ./data/synthetic/inference/ \
    --type regression
python src/scripts/data_processing/generate_data/generate.py `
    --train_samples 30000 `
    --test_samples 3000 `
    --inferencing_samples 30000 `
    --n_features 4000 `
    --n_informative 400 `
    --random_state 5 `
    --output_train ./data/synthetic/train/ `
    --output_test ./data/synthetic/test/ `
    --output_inference ./data/synthetic/inference/ `
    --type regression

Note

Running the synthetic data generation script with these parameter values requires at least 4 GB of RAM available and generates a 754 MB training, a 75 MB testing, and a 744 MB inferencing dataset.

Run training on synthetic data

python src/scripts/training/lightgbm_python/train.py \
    --train ./data/synthetic/train/ \
    --test ./data/synthetic/test/ \
    --export_model ./data/models/synthetic-100trees-4000cols/ \
    --objective regression \
    --boosting_type gbdt \
    --tree_learner serial \
    --metric rmse \
    --num_trees 100 \
    --num_leaves 100 \
    --min_data_in_leaf 400 \
    --learning_rate 0.3 \
    --max_bin 16 \
    --feature_fraction 0.15 \
    --device_type cpu
python src/scripts/training/lightgbm_python/train.py `
    --train ./data/synthetic/train/ `
    --test ./data/synthetic/test/ `
    --export_model ./data/models/synthetic-100trees-4000cols/ `
    --objective regression `
    --boosting_type gbdt `
    --tree_learner serial `
    --metric rmse `
    --num_trees 100 `
    --num_leaves 100 `
    --min_data_in_leaf 400 `
    --learning_rate 0.3 `
    --max_bin 16 `
    --feature_fraction 0.15 `
    --device_type cpu

Note

--device_type cpu is optional here, if you're running on gpu you can use --device_type gpu instead.

Run inferencing on synthetic data (lightgbm python)

python src/scripts/inferencing/lightgbm_python/score.py \
    --data ./data/synthetic/inference/ \
    --model ./data/models/synthetic-100trees-4000cols/ \
    --output ./data/outputs/predictions/ \
    --num_threads 1
python src/scripts/inferencing/lightgbm_python/score.py `
    --data ./data/synthetic/inference/ `
    --model ./data/models/synthetic-100trees-4000cols/ `
    --output ./data/outputs/predictions/ `
    --num_threads 1