Loan ChargeOff Prediction

Template Contents


The following is the directory structure for this template:

  • Data This contains the copy of the simulated input data with 100K unique customers.
  • R This contains the R code to simulate the input datasets, pre-process them, create the analytical datasets, train the models, identify the champion model and provide predictions.
  • SQLR This contains T-SQL code to pre-process the datasets, train the models, identify the champion model and provide predictions. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
  • HDI This contains the R code to pre-process the datasets, train the models, identify the champion model and provide predictions on a Spark cluster.

In this template with SQL Server ML Services, two versions of the SQL implementation and another version for HDInsight implementation:

  1. Model Development in R IDE . Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).
  2. Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.
  3. HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.

Copy of Input Datasets


File Description
loan_info.csv Raw data about each loan of a lending institution
member_info.csv Raw data about each member of a lending institution
payments_info.csv Raw data about loan payment history
File Description
Loan_Data1000.csv Raw data about loan payment history for 1000 members
Loan_Data10000.csv Raw data about loan payment history for 10000 members
Loan_Data100000.csv Raw data about loan payment history for 100000 members

Model Development in R


These files in the *SQLR directory for the SQL solution.

File Description
step2a_optional_feature_selection.sql R script that performs feature selection
step3_train_test_model.sqlR script that performs data training using 5 different models and select the best performant model and do testing using the model

Operationalize in SQL


These files are in the SQLR directory.

File Description
step4_chargeoff_batch_prediction.sql R script that performs scoring using best model on the data split created in Step 2 and store the predictions in [dbo].[loan_chargeoff_prediction_10k] table.
step4a_chargeoff_ondemand_prediction.sql chargeoff_ondemand_prediction stored procedure is created for ad-hoc scoring wherein it can be called with a single record and a single prediction result is returned to the caller.
Loan_chargeoff.ps1 Loads the input data into the SQL server and automates the running of all .sql files

HDInsight Solution on Spark Cluster


These files are in the HDI/RSparkCluster directory.

File Description
Copy_Dev2Prod.R Copies a model from the dev folder to the prod folder for production use
loanchargeoff_main.R is used to define the data and directories and then run all of the steps to process data, perform feature engineering, training, and scoring.
loanchargeoff_scoring.Ruses the previously trained model and invokes the steps to process data, perform feature engineering and scoring.
loanchargeoff_deployment.R create a web service and test it on the edge node
loanchargeoff_web_scoring.R access the web service on any computer with Microsoft ML Server installed
loan_main.R Main R script that executes the rest of the R scripts
loan_scoring.R Perform loan scoring using the model with the best performance
step1_get_training_testing_data.R Read input data which contains all the history information for all the loans from HDFS. Extract training/testing data based on process date (paydate) from the input data.
step2_feature_engineering.R Use MicrosoftML to do feature selection. Code can be added in this file to create some new features based on existing features. Open source package such as Caret can also be used to do feature selection here. Best features are selected using AUC.
step3_training_evaluation.R This script trains five different models and evaluate each.
step4_prepare_new_data.R This script creates a new data which contains all the opened loans on a pay date which we do not know the status in next three month, the loans in this new data are not included in the training and testing dataset and have the same features as the loans used in training/testing dataset.
step5_loan_prediction.R This script takes the new data created in the step4 and the champion model created in step3, output the predicted label and probability to be charge-off for each loan in next three months.

< Home