Skip to main content
Version: 0.10.2

LightGBM on Apache Spark


LightGBM is an open-source, distributed, high-performance gradient boosting (GBDT, GBRT, GBM, or MART) framework. This framework specializes in creating high-quality and GPU enabled decision tree algorithms for ranking, classification, and many other machine learning tasks. LightGBM is part of Microsoft's DMTK project.

Advantages of LightGBM

  • Composability: LightGBM models can be incorporated into existing SparkML Pipelines, and used for batch, streaming, and serving workloads.
  • Performance: LightGBM on Spark is 10-30% faster than SparkML on the Higgs dataset, and achieves a 15% increase in AUC. Parallel experiments have verified that LightGBM can achieve a linear speed-up by using multiple machines for training in specific settings.
  • Functionality: LightGBM offers a wide array of tunable parameters, that one can use to customize their decision tree system. LightGBM on Spark also supports new types of problems such as quantile regression.
  • Cross platform LightGBM on Spark is available on Spark, PySpark, and SparklyR


In PySpark, you can run the LightGBMClassifier via:

from import LightGBMClassifier
model = LightGBMClassifier(learningRate=0.3,

Similarly, you can run the LightGBMRegressor by setting the application and alpha parameters:

from import LightGBMRegressor
model = LightGBMRegressor(application='quantile',

For an end to end application, check out the LightGBM notebook example.


SynapseML exposes getters/setters for many common LightGBM parameters. In python, you can use the properties as shown above, or in Scala use the fluent setters.

val classifier = new LightGBMClassifier()

LightGBM has far more parameters than SynapseML exposes. For cases where you need to set some parameters that SyanpseML does not expose a setter for, use passThroughArgs. This is just a free string that you can use to add extra parameters to the command SynapseML sends to configure LightGBM.

In python:

from import LightGBMClassifier
model = LightGBMClassifier(passThroughArgs="force_row_wise=true min_sum_hessian_in_leaf=2e-3",

In Scala:

val classifier = new LightGBMClassifier()
.setPassThroughArgs("force_row_wise=true min_sum_hessian_in_leaf=2e-3")

For formatting options and specific argument documentation, see LightGBM docs. Some parameters SynapseML will set specifically for the Spark distributed environment and should not be changed. Some parameters are for cli mode only, and will not work within Spark.

Note that you can mix passThroughArgs and explicit args, as shown above. SynapseML will merge them to create one argument string to send to LightGBM. If you set a parameter in both places, the passThroughArgs will take precedence.


LightGBM on Spark uses the Simple Wrapper and Interface Generator (SWIG) to add Java support for LightGBM. These Java Binding use the Java Native Interface call into the distributed C++ API.

We initialize LightGBM by calling LGBM_NetworkInit with the Spark executors within a MapPartitions call. We then pass each workers partitions into LightGBM to create the in-memory distributed dataset for LightGBM. We can then train LightGBM to produce a model that can then be used for inference.

The LightGBMClassifier and LightGBMRegressor use the SparkML API, inherit from the same base classes, integrate with SparkML pipelines, and can be tuned with SparkML's cross validators.

Models built can be saved as SparkML pipeline with native LightGBM model using saveNativeModel(). Additionally, they are fully compatible with PMML and can be converted to PMML format through the JPMML-SparkML-LightGBM plugin.

Barrier Execution Mode

By default LightGBM uses regular spark paradigm for launching tasks and communicates with the driver to coordinate task execution. The driver thread aggregates all task host:port information and then communicates the full list back to the workers in order for NetworkInit to be called. This requires the driver to know how many tasks there are, and if the expected number of tasks is different from actual this will cause the initialization to deadlock. There is a new UseBarrierExecutionMode flag, which when activated uses the barrier() stage to block all tasks. The barrier execution mode simplifies the logic to aggregate host:port information across all tasks. To use it in scala, you can call setUseBarrierExecutionMode(true), for example:

val lgbm = new LightGBMClassifier()
<train classifier>