Typical Workflow

This page describes the SQL Server VM on Azure HDInsight Spark solution.

Typical Workflow

Step 1: Server Setup and Configuration

Step 2: Data Prep and Modeling

Step 3: Operationalize

Step 4: Deploy and Visualize

There are multiple benefits for lending institutions to equip with loan chargeoff prediction data. Charging off a loan is the last resort that the bank will do on a severely delinquent loan, with the predictive data at hand, the loan officer could offer personalized incentives like lower interest rate or longer repayment period to help customers to keep making loan payments and thus prevent the loan of getting charged off. To get to this type of prediction data, often credit unions or banks manually handcraft the data based on customers' past payment history and performed simple statistical regression analysis. This method is highly subject to data compilation error and not statistically sound.

This solution template demonstrates a solution end to end to run predictive analytics on loan data and produce scoring on chargeoff probability. A PowerBI report will also walk through the analysis and trend of credit loans and prediction of chargeoff probability.

To demonstrate a typical workflow, we’ll introduce you to a few personas. You can follow along by performing the same steps for each persona.

Step 1: Server Setup and Configuration with Danny the DB Analyst Ivan the IT Administrator

Let me introduce you to Danny, the Database Analyst. Danny is the main contact for SQL Server database administration and application integration. Danny was responsible for installing and configuring the SQL Server. He has added a user named with all the necessary permissions to execute R scripts on the server and modify the LoanChargeOff database. This was done through the createuser.sql file. This step has already been done on the VM you deployed using the 'Deploy to Azure' button on the Quick start page. Alternatively, Danny could also run LoanChargeOff.ps1 to run the end to end workflow that includes setting up of SQL Server user login, import raw data to SQL Server tables, view creation, training and testing and prediction.

This step has already been done on your 'Deploy to Azure' VM.

Step 2: Data Prep and Modeling with Debra the Data Scientist

Now let’s meet Debra, the Data Scientist. Debra’s job is to use loan payment data to predict loan chargeoff risk. Debra’s preferred language for developing the models is using R and SQL. She uses Microsoft ML Services with SQL Server 2017 as it provides the capability to run large datasets and also is not constrained by memory restrictions of Open Source R.

After analyzing the data she opted to create multiple models and choose the best one. She will create five machine learning models and compare them, then use the one she likes best to compute a prediction for each loan, and then select the loan with the highest probability of chargeoff.

Debra will work on her own machine, using R Client to execute these R scripts. R Client is already installed on the VM. She will also use an IDE to run R.

On your VM, R Tools for Visual Studio is installed. You will however have to either log in or create a new account for using this tool. If you prefer, you can download and install RStudio on your VM instead.

Debra will develop her R scripts in the Open Source Edition of RStudio Server, installed on her cluster's edge node. You can follow along on your own cluster deployed by using the 'Deploy to Azure' button on the Quick Start page. Access RStudio by using the url of the form:
http://CLUSTERNAME.azurehdinsight.net/rstudio.

When you first visit the url to access RStudio, you will see two different logins. Use the username and password you created when you deployed the HDInsight solution for both of these prompts.

After logging in to RStudio, you will need to upload the files that are used in this solution, if you have not already done so during your deployment. To obtain the files, execute the following code in RStudio:

library(RevoScaleR)
library(MicrosoftML)
library(xgboost)

# spark cc object
sparkContext <- rxSparkConnect(consoleOutput = TRUE, reset = TRUE)
  
# set compute context to local
rxSetComputeContext('local')

# Copy model rds files to local dev folder from HDFS
LocalDir <- paste("/var/RevoShare/", Sys.info()[["user"]], "/LoanChargeOff/dev/model/", sep="" )
if(!dir.exists(LocalDir)){
   system(paste("mkdir -p -m 777 ", LocalDir, sep="")) # create a new directory
}
RemoteFiles <- "/LoanChargeOff/model/*.rds"
rxHadoopCopyToLocal(source = RemoteFiles, dest = LocalDir)

# clean up 
rm(list = ls())

</div>

OPTIONAL: You can execute the R code on your local computer if you wish, but you must first prepare both the VM and your computer. Additionally you can view and execute the R code in a Jupyter Notebook on the VM.

Now you're ready to follow along with Debra as she creates the scripts needed for this solution. If you are using Visual Studio, you will see these file in the Solution Explorer tab on the right. In RStudio, the files can be found in the Files tab, also on the right.

Now that Debra's environment is set up, she opens her IDE and creates a Project. To follow along with her, open the D:\LoanChargeOff\R directory on the VM desktop your computer. Debra can follow the steps listed in For the Database Analyst. To understand each of the steps, Debra should execute each of the steps including the optional step in feature selection.

Step 1: Creating Tables
Step 2: Creating Views with Features and Labels
Step 2a: Demonstrate feature selection using MicrosoftML package
Step 3: Training and Testing Model
Step 4: Chargeoff Prediction (batch)
Step 4a: Chargeoff Prediction (OnDemand)

Details of above steps : SQL Workflow Automation

Below is a summary of the individual steps used for this solution.

step1_get_training_testing_data.R: Read input data which contains all the history information for all the loans from HDFS. Extract training/testing data based on process date (paydate) from the input data. Save training/testing data in HDFS working directory
step2_feature_engineering.R: Here we use MicrosoftML to do feature selection. Code can be added in this file to create some new features based on existing features. Open source package such as Caret can also be used to do feature selection here. Best features are selected using AUC.

You can run these scripts if you wish, but you may also skip them if you want to get right to the modeling. The data that these scripts create already exists in the SQL database.

To run all the scripts described above as well as those in the next few steps, open and execute the file loanchargeoff_main.R.

In both Visual Studio and RStudio, there are multiple ways to execute the code from the R Script window. The fastest way for both IDEs is to use Ctrl-Enter on a single line or a selection. Learn more about R Tools for Visual Studio or RStudio.

Now she is ready for training the models, using step3_training_evaluation.R. This step will train two different models and evaluate each.
The R script draws the ROC or Receiver Operating Characteristic for each prediction model. It shows the performance of the model in terms of true positive rate and false positive rate, when the decision threshold varies.

The AUC is a number between 0 and 1. It corresponds to the area under the ROC curve. It is a performance metric related to how good the model is at separating the two classes (converted clients vs. not converted), with a good choice of decision threshold separating between the predicted probabilities. The closer the AUC is to 1, and the better the model is. Given that we are not looking for that optimal decision threshold, the AUC is more representative of the prediction performance than the Accuracy (which depends on the threshold).

Debra will use the AUC to select the champion model to use in the next step.
step4_prepare_new_data.R creates a new data which contains all the opened loans on a pay date which we do not know the status in next three month, the loans in this new data are not included in the training and testing dataset and have the same features as the loans used in training/testing dataset.
step5_loan_prediction.R takes the new data created in the step4 and the champion model created in step3, output the predicted label and probability to be charge-off for each loan in next three months.
loanchargeoff_xgboost.R This step is optional. For more details : Using XGBoost package in HDInsight Spark Cluster for Loan ChargeOff Prediction
After creating the model, Debra runs Copy_Dev2Prod.R to copy the model information from the dev folder to the prod folder, then runs loanchargeoff_scoring.R to create predictions for her new data.
Once all the above code has been executed, Debra will use PowerBI to visualize the recommendations created from her model.
You can access this dashboard in either of the following ways:
- Open the PowerBI file from the D:\LoanChargeOffSolution\Reports directory on the deployed VM desktop.
- Install PowerBI Desktop on your computer and download and open the Loan ChargeOff Prediction Dashboard
- Install PowerBI Desktop on your computer and download and open the Loan ChargeOff Prediction HDI Dashboard
She uses an ODBC connection to connect to the data, so that it will always show the most recently modeled and scored data.
If you want to refresh data in your PowerBI Dashboard, make sure to follow these instructions to setup and use an ODBC connection to the dashboard.
A summary of this process and all the files involved is described in more detail here.

Step 3: Operationalize with Debra and Danny

Debra has completed her tasks. She has connected to the SQL database, executed code from her R IDE that pushed (in part) execution to the SQL machine to clean the data, create new features, train five models and select the champion model. She has scored data, created predictions, and also created a summary report which she will hand off to Bernie - see below.

While this task is complete for the current set of loans, our company will want to perform these actions for each new loan payment. Instead of going back to Debra each time, Danny can operationalize the code in TSQL files which he can then run himself each month for the newest loan payments.

Debra hands over her scripts to Danny who adds the code to the database as stored procedures, using embedded R code, or SQL queries. You can see these procedures by logging into SSMS and opening the Programmability>Stored Procedures section of the LoanChargeOff database.

Log into SSMS using SQL Server Authentication - the username/password provided during deployment

You can find this script in the SQLR directory, and execute it yourself by following the PowerShell Instructions. As noted earlier, this was already executed when your VM was first created.

Step 4: Deploy and Visualize with Bernie the Business Analyst

Now that the predictions are created and saved, we will meet our last persona - Bernie, the Business Analyst. Bernie will use the Power BI Dashboard to learn more about the loan chargeoff predictions (second tab). He will also review summaries of the loan data used to create the model (first tab).

You can access this dashboard in either of the following ways:

Open the PowerBI file from the D:\LoanChargeOffSolution\Reports directory on the deployed VM desktop.
Install PowerBI Desktop on your computer and download and open the Loan ChargeOff Prediction Dashboard
Install PowerBI Desktop on your computer and download and open the Loan ChargeOff Prediction HDI Dashboard

Bernie will then let the Lending Institution know about the loans chargeoff predictions - the data in the loanchargeoff_predictions table contains the Score and Probability for each loan payment. The team uses these scores to take further business actions.

Remember that before the data in this dashboard can be refreshed to use your scored data, you must configure the dashboard as Debra did in step 2 of this workflow.

Loan ChargeOff Prediction