Skip to main content
Version: 0.10.1

Regression - Auto Imports

This sample notebook is based on the Gallery Sample 6: Train, Test, Evaluate for Regression: Auto Imports Dataset for AzureML Studio. This experiment demonstrates how to build a regression model to predict the automobile's price. The process includes training, testing, and evaluating the model on the Automobile Imports data set.

This sample demonstrates the use of several members of the synapseml library:

First, import the pandas package so that we can read and parse the datafile using pandas.read_csv()

from pyspark.sql import SparkSession

# Bootstrap Spark Session
spark = SparkSession.builder.getOrCreate()
data = spark.read.parquet(
"wasbs://publicwasb@mmlspark.blob.core.windows.net/AutomobilePriceRaw.parquet"
)

To learn more about the data that was just read into the DataFrame, summarize the data using SummarizeData and print the summary. For each column of the DataFrame, SummarizeData will report the summary statistics in the following subcategories for each column:

  • Feature name
  • Counts
    • Count
    • Unique Value Count
    • Missing Value Count
  • Quantiles
    • Min
    • 1st Quartile
    • Median
    • 3rd Quartile
    • Max
  • Sample Statistics
    • Sample Variance
    • Sample Standard Deviation
    • Sample Skewness
    • Sample Kurtosis
  • Percentiles
    • P0.5
    • P1
    • P5
    • P95
    • P99
    • P99.5

Note that several columns have missing values (normalized-losses, bore, stroke, horsepower, peak-rpm, price). This summary can be very useful during the initial phases of data discovery and characterization.

from synapse.ml.stages import SummarizeData

summary = SummarizeData().transform(data)
summary.toPandas()

Split the dataset into train and test datasets.

# split the data into training and testing datasets
train, test = data.randomSplit([0.6, 0.4], seed=123)
train.limit(10).toPandas()

Now use the CleanMissingData API to replace the missing values in the dataset with something more useful or meaningful. Specify a list of columns to be cleaned, and specify the corresponding output column names, which are not required to be the same as the input column names. CleanMissiongData offers the options of "Mean", "Median", or "Custom" for the replacement value. In the case of "Custom" value, the user also specifies the value to use via the "customValue" parameter. In this example, we will replace missing values in numeric columns with the median value for the column. We will define the model here, then use it as a Pipeline stage when we train our regression models and make our predictions in the following steps.

from synapse.ml.featurize import CleanMissingData

cols = ["normalized-losses", "stroke", "bore", "horsepower", "peak-rpm", "price"]
cleanModel = (
CleanMissingData().setCleaningMode("Median").setInputCols(cols).setOutputCols(cols)
)

Now we will create two Regressor models for comparison: Poisson Regression and Random Forest. PySpark has several regressors implemented:

  • LinearRegression
  • IsotonicRegression
  • DecisionTreeRegressor
  • RandomForestRegressor
  • GBTRegressor (Gradient-Boosted Trees)
  • AFTSurvivalRegression (Accelerated Failure Time Model Survival)
  • GeneralizedLinearRegression -- fit a generalized model by giving symbolic description of the linear preditor (link function) and a description of the error distribution (family). The following families are supported:
    • Gaussian
    • Binomial
    • Poisson
    • Gamma
    • Tweedie -- power link function specified through linkPower Refer to the Pyspark API Documentation for more details.

TrainRegressor creates a model based on the regressor and other parameters that are supplied to it, then trains data on the model.

In this next step, Create a Poisson Regression model using the GeneralizedLinearRegressor API from Spark and create a Pipeline using the CleanMissingData and TrainRegressor as pipeline stages to create and train the model. Note that because TrainRegressor expects a labelCol to be set, there is no need to set linkPredictionCol when setting up the GeneralizedLinearRegressor. Fitting the pipe on the training dataset will train the model. Applying the transform() of the pipe to the test dataset creates the predictions.

# train Poisson Regression Model
from pyspark.ml.regression import GeneralizedLinearRegression
from pyspark.ml import Pipeline
from synapse.ml.train import TrainRegressor

glr = GeneralizedLinearRegression(family="poisson", link="log")
poissonModel = TrainRegressor().setModel(glr).setLabelCol("price").setNumFeatures(256)
poissonPipe = Pipeline(stages=[cleanModel, poissonModel]).fit(train)
poissonPrediction = poissonPipe.transform(test)

Next, repeat these steps to create a Random Forest Regression model using the RandomRorestRegressor API from Spark.

# train Random Forest regression on the same training data:
from pyspark.ml.regression import RandomForestRegressor

rfr = RandomForestRegressor(maxDepth=30, maxBins=128, numTrees=8, minInstancesPerNode=1)
randomForestModel = TrainRegressor(model=rfr, labelCol="price", numFeatures=256).fit(
train
)
randomForestPipe = Pipeline(stages=[cleanModel, randomForestModel]).fit(train)
randomForestPrediction = randomForestPipe.transform(test)

After the models have been trained and scored, compute some basic statistics to evaluate the predictions. The following statistics are calculated for regression models to evaluate:

  • Mean squared error
  • Root mean squared error
  • R^2
  • Mean absolute error

Use the ComputeModelStatistics API to compute basic statistics for the Poisson and the Random Forest models.

from synapse.ml.train import ComputeModelStatistics

poissonMetrics = ComputeModelStatistics().transform(poissonPrediction)
print("Poisson Metrics")
poissonMetrics.toPandas()
randomForestMetrics = ComputeModelStatistics().transform(randomForestPrediction)
print("Random Forest Metrics")
randomForestMetrics.toPandas()

We can also compute per instance statistics for poissonPrediction:

from synapse.ml.train import ComputePerInstanceStatistics


def demonstrateEvalPerInstance(pred):
return (
ComputePerInstanceStatistics()
.transform(pred)
.select("price", "prediction", "L1_loss", "L2_loss")
.limit(10)
.toPandas()
)


demonstrateEvalPerInstance(poissonPrediction)

and with randomForestPrediction:

demonstrateEvalPerInstance(randomForestPrediction)