Loan ChargeOff Prediction

For the Data Scientist - Develop with R


SQL Server ML Services takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server package) by allowing R to run on the same server as the database. It includes a database service that runs outside the SQL Server process and communicates securely with the R runtime.

This solution package shows how to pre-process data (cleaning and feature engineering), train prediction models, and perform scoring on the SQL Server machine.
Data scientists who are testing and developing solutions can work from the convenience of their R IDE on their client machine, while setting the computation context to SQL (see R folder for code). They can also deploy the completed solutions to SQL Server (2016 or higher) by embedding calls to R in stored procedures (see SQLR folder for code). These solutions can then be further automated by the use of SQL Server Integration Services and SQL Server agent: a PowerShell script (.ps1 file) automates the running of the SQL code.

Loan ChargeOff Prediction


This solution template is showing an end to end solution including platform deployment scripts, R scripts that include data transformation, feature selection, training, scoring and operationalization. This template focused on predicting loan chargeoff risk using simulated data. Loan officer can look at the top N of the loans that have the highest chargeoff probability and formulate incentive plans to encourage the loan holders to continue to payoff the loan.

Data scientist will be able to look at R script that understand the process involved in feature engineering and model performance. The script uses 5 models to train:- rxLogisticRegression, rxFastTrees, rxFastForest, rxFastLinear and rxNeuralNet. The script will pick the model with the best performance and use it in scoring.

The simulated data is loaded to SQL table for training, testing and scoring. The training output with model performance details, selected features and prediction tables are also stored as tables in SQL Server. This template will run training, testing and scoring on 10,000 loans from the simulated datasets. There are also scripts stored in D:\LoanChargeOffSolution\Source\SQLR that data scientist can also experience running the training, testing and scoring using 100,000 and 1 million records. In this solution, the final scored database table in SQL Server gives the loan chargeoff predictions. This data is then visualized in PowerBI.

To try this out yourself, visit the Quick Start page.

Below is a description of what happens in each of the steps: dataset creation, model development, recommendations, and deployment in more detail.

Analytical Dataset Preprocessing and Feature Engineering


In this step, raw csv data is uploaded to SQL database and processed to create denormalized views/tables to include the features and labels. Optionally a feature selection script is also included for understanding how feature selection algorithms can be used in MicrosoftML package. We use feature selection as part of modelling step later. See the following scripts: step1_create_tables.sql : creates tables required for importing raw data, as well as storing models and predictions. step2_features_label_view.sql : creates views with feature and label columns based on raw data tables and persists them into tables for faster processing. The views and tables are created splitting the data into training, testing and scoring (to demonstrate batch scoring in a later step). step2a_optional_feature_selection.sql : demonstrate feature selection using logistic regression model and stores selected features in a table.

Model Development


In this step a stored procedure 'train_model' is created to train a model based on requested algorithm and the model is evaluated and resulting stats are stored along with model binary. During deployment we use five MicrosoftML algorithms for modelling and store the resulting model binary and evaluation stats in 'loan_chargeoff_models_10k' table. Modelling includes feature selection and categorization using 'categorical' and 'selectFeatures' transforms from MicrosoftML. See the following script : step3_train_test_model.sql

Computing ChargeOff Predictions


In this step two stored procedures are created 'predict_chargeoff' and 'predict_chargeoff_ondemand'. 'predict_chargeoff' procedures performs batch scoring on the data split created in preprocessing step and stores the predictions in 'loan_chargeoff_prediction_10k' table. 'predict_chargeoff_ondemand' stored procedure is created for ad-hoc scoring wherein it can be called with a single record and a single prediction result is returned to the caller. See the following scripts: step4_chargeoff_batch_prediction.sql step4a_chargeoff_ondemand_prediction.sql

Deploy and Visualize Results


Chargeoff prediction result stores in SQL Server table. The final step is to connect PowerBI report to SQL Server and visualize the scoring result. The sample PowerBI is shipped in this solution template. However, user can customize the PowerBI report according to their business needs.

You can access this dashboard in either of the following ways:

System Requirements

The following are required to run the scripts in this solution:

Template Contents


View the contents of this solution template.

To try this out yourself:

< Home