For the Data Scientist

This page describes the SQL Server VM on Azure HDInsight Spark solution.

For the Data Scientist - Develop with R

Loan ChargeOff Prediction

Analytical Dataset Preprocessing and Feature Engineering

Model Development

Computing ChargeOff Predictions

Deploy and Visualize Results

System Requirements

Template Contents

SQL Server ML Services takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server package) by allowing R to run on the same server as the database. It includes a database service that runs outside the SQL Server process and communicates securely with the R runtime.

This solution package shows how to pre-process data (cleaning and feature engineering), train prediction models, and perform scoring on the SQL Server machine.

Data scientists who are testing and developing solutions can work from the convenience of their R IDE on their client machine, while setting the computation context to SQL (see R folder for code). They can also deploy the completed solutions to SQL Server (2016 or higher) by embedding calls to R in stored procedures (see SQLR folder for code). These solutions can then be further automated by the use of SQL Server Integration Services and SQL Server agent: a PowerShell script (.ps1 file) automates the running of the SQL code.

Loan ChargeOff Prediction

This solution template is showing an end to end solution including platform deployment scripts, R scripts that include data transformation, feature selection, training, scoring and operationalization. This template focused on predicting loan chargeoff risk using simulated data. Loan officer can look at the top N of the loans that have the highest chargeoff probability and formulate incentive plans to encourage the loan holders to continue to payoff the loan.

Data scientist will be able to look at R script that understand the process involved in feature engineering and model performance. The script uses 5 models to train:- rxLogisticRegression, rxFastTrees, rxFastForest, rxFastLinear and rxNeuralNet. The script will pick the model with the best performance and use it in scoring.

The simulated data is loaded to SQL table for training, testing and scoring. The training output with model performance details, selected features and prediction tables are also stored as tables in SQL Server. This template will run training, testing and scoring on 10,000 loans from the simulated datasets. There are also scripts stored in D:\LoanChargeOffSolution\Source\SQLR that data scientist can also experience running the training, testing and scoring using 100,000 and 1 million records. In this solution, the final scored database table in SQL Server gives the loan chargeoff predictions. This data is then visualized in PowerBI.

To try this out yourself, visit the Quick Start page.

Below is a description of what happens in each of the steps: dataset creation, model development, recommendations, and deployment in more detail.

Analytical Dataset Preprocessing and Feature Engineering

In this step, raw csv data is uploaded to SQL database and processed to create denormalized views/tables to include the features and labels. Optionally a feature selection script is also included for understanding how feature selection algorithms can be used in MicrosoftML package. We use feature selection as part of modelling step later. See the following scripts: step1_create_tables.sql : creates tables required for importing raw data, as well as storing models and predictions. step2_features_label_view.sql : creates views with feature and label columns based on raw data tables and persists them into tables for faster processing. The views and tables are created splitting the data into training, testing and scoring (to demonstrate batch scoring in a later step). step2a_optional_feature_selection.sql : demonstrate feature selection using logistic regression model and stores selected features in a table.

Model Development

In this step a stored procedure 'train_model' is created to train a model based on requested algorithm and the model is evaluated and resulting stats are stored along with model binary. During deployment we use five MicrosoftML algorithms for modelling and store the resulting model binary and evaluation stats in 'loan_chargeoff_models_10k' table. Modelling includes feature selection and categorization using 'categorical' and 'selectFeatures' transforms from MicrosoftML. See the following script : step3_train_test_model.sql

Five models, Fast Forest, Logistic Regression, Fast Tree, Fast Linear and Neural Network are developed to predict charge-off loans in next three months. The R code to develop these models and evaluate the models' performance using testing dataset in step3_training_evaluation.R. After creating the training and testing set in step1 and get the selected features' name in step2, fives prediction models are built on the training set using selected features. Once the models are trained, AUC, TPR and TNR of these five models are calculated using the testing set. The R script draws the ROC for each prediction model. It shows the performance of the model in terms of true positive rate and false positive rate, when the decision threshold varies. The AUC is a number between 0 and 1. It corresponds to the area under the ROC curve. It is a performance metric related to how good the model is at separating the two classes (charge-off loans and not charge-off loans), with a good choice of decision threshold separating between the predicted probabilities. The closer the AUC is to 1 and the better the model is. Given that we are not looking for that optimal decision threshold, the AUC is more representative of the prediction performance than the Overall Accuracy. Since the training data is very skew (the number of not charge-off loans is much larger than the number of charge-off loans), TPR and TNR are calculated since AUC should be high with lower TPR and higher TNR for unbalanced training data. TPR and TNR can combine with AUC together to measure the models' performance and guarantee the model with best AUC also has better TPR and TNR. The model with best AUC is selected as the champion model and will be used for prediction. Given training/testing datasets and selected features' name, this script can be run manually. Not that all the algorithms in this step take the selected features, it also can be changed to input all the features in the training set, and let all the algorithm to select the best features by using selectFeatures function in MicrosoftML and then compare the performance of the models based on the same testing set.

Computing ChargeOff Predictions

In this step two stored procedures are created 'predict_chargeoff' and 'predict_chargeoff_ondemand'. 'predict_chargeoff' procedures performs batch scoring on the data split created in preprocessing step and stores the predictions in 'loan_chargeoff_prediction_10k' table. 'predict_chargeoff_ondemand' stored procedure is created for ad-hoc scoring wherein it can be called with a single record and a single prediction result is returned to the caller. See the following scripts: step4_chargeoff_batch_prediction.sql step4a_chargeoff_ondemand_prediction.sql

Deploy and Visualize Results

Chargeoff prediction result stores in SQL Server table. The final step is to connect PowerBI report to SQL Server and visualize the scoring result. The sample PowerBI is shipped in this solution template. However, user can customize the PowerBI report according to their business needs.

You can access this dashboard in either of the following ways:

Open the PowerBI file from the D:\LoanChargeOffSolution\Reports directory on the deployed VM desktop.
Install PowerBI Desktop on your computer and download and open the Loan ChargeOff Prediction Dashboard
Install PowerBI Desktop on your computer and download and open the Loan ChargeOff Prediction HDI Dashboard

System Requirements

The following are required to run the scripts in this solution:

SQL Server (2016 or higher) with Microsoft ML Server (version 9.1.0) installed and configured.
The SQL user name and password, and the user configured properly to execute R scripts in-memory.
SQL Database which the user has write permission and execute stored procedures.
For more information about SQL server and ML Services, please visit: https://docs.microsoft.com/en-us/sql/advanced-analytics/what-s-new-in-sql-server-machine-learning-services

Template Contents

View the contents of this solution template.

To try this out yourself:

View the Quick Start.

< Home