Typical Workflow
This solution template demonstrates a solution end to end to run predictive analytics on loan data and produce scoring on chargeoff probability. A PowerBI report will also walk through the analysis and trend of credit loans and prediction of chargeoff probability.
To demonstrate a typical workflow, we’ll introduce you to a few personas. You can follow along by performing the same steps for each persona.
Step 1: Server Setup and Configuration with Ivan the IT Administrator
The cluster has been created and data loaded for you when you used the Deploy button on the Quick Start page. Once you complete the walkthrough, you will want to delete this cluster as it incurs expense whether it is in use or not - see HDInsight Cluster Maintenance for more details.
Step 2: Data Prep and Modeling with Debra the Data Scientist
Now let’s meet Debra, the Data Scientist. Debra’s job is to use loan payment data to predict loan chargeoff risk. Debra will develop these models using HDInsight, the managed cloud Hadoop solution with integration to Microsoft ML Server.
After analyzing the data she opted to create multiple models and choose the best one. She will create five machine learning models and compare them, then use the one she likes best to compute a prediction for each loan, and then select the loan with the highest probability of chargeoff.
Debra will develop her R scripts in the Open Source Edition of RStudio Server, installed on her cluster's edge node. You can follow along on your own cluster deployed by using the 'Deploy to Azure' button on the Quick Start page. Access RStudio by using the url of the form:
http://CLUSTERNAME.azurehdinsight.net/rstudio
.
library(RevoScaleR) library(MicrosoftML) library(xgboost) # spark cc object sparkContext <- rxSparkConnect(consoleOutput = TRUE, reset = TRUE) # set compute context to local rxSetComputeContext('local') # Copy model rds files to local dev folder from HDFS LocalDir <- paste("/var/RevoShare/", Sys.info()[["user"]], "/LoanChargeOff/dev/model/", sep="" ) if(!dir.exists(LocalDir)){ system(paste("mkdir -p -m 777 ", LocalDir, sep="")) # create a new directory } RemoteFiles <- "/LoanChargeOff/model/*.rds" rxHadoopCopyToLocal(source = RemoteFiles, dest = LocalDir) # clean up rm(list = ls())</div> Now you're ready to follow along with Debra as she creates the scripts needed for this solution.
-
loanchargeoff_main.R is used to define the data and directories and then run all of the steps to process data, perform feature engineering, training, and scoring.
The default input for this script uses 100,000 loans for training models, and will split this into train and test data. After running this script you will see data files in the /LoanChargeOff/dev/temp directory on your storage account. Models are stored in the /LoanChargeOff/dev/model directory on your storage account. The Hive table
loanchargeoff_predictions
contains the 100,000 records with predictions (Score
,Probability
) created from the best model. - Copy_Dev2Prod.R copies the model information from the dev folder to the prod folder to be used for production. This script must be executed once after loanchargeoff_main.R completes, before running loanchargeoff_scoring.R. It can then be used again as desired to update the production model. After running this script models created during loanchargeoff_main.R are copied into the /var/RevoShare/user/LoanChargeOff/prod/model directory.
-
loanchargeoff_scoring.R uses the previously trained model and invokes the steps to process data, perform feature engineering and scoring. Use this script after first executing loanchargeoff_main.R and Copy_Dev2Prod.R.
The input to this script defaults to 10,000 loans to be scored with the model in the prod directory. After running this script the Hive table
loanchargeoff_predictions
now contains the predictions.
- step1_get_training_testing_data.R: Read input data which contains all the history information for all the loans from HDFS. Extract training/testing data based on process date (paydate) from the input data. Save training/testing data in HDFS working directory
- step2_feature_engineering.R: Here we use MicrosoftML to do feature selection. Code can be added in this file to create some new features based on existing features. Open source package such as Caret can also be used to do feature selection here. Best features are selected using AUC.
- Now she is ready for training the models, using step3_training_evaluation.R. This step will train two different models and evaluate each.
The R script draws the ROC or Receiver Operating Characteristic for each prediction model. It shows the performance of the model in terms of true positive rate and false positive rate, when the decision threshold varies.
The AUC is a number between 0 and 1. It corresponds to the area under the ROC curve. It is a performance metric related to how good the model is at separating the two classes (converted clients vs. not converted), with a good choice of decision threshold separating between the predicted probabilities. The closer the AUC is to 1, and the better the model is. Given that we are not looking for that optimal decision threshold, the AUC is more representative of the prediction performance than the Accuracy (which depends on the threshold).
Debra will use the AUC to select the champion model to use in the next step.
- step4_prepare_new_data.R creates a new data which contains all the opened loans on a pay date which we do not know the status in next three month, the loans in this new data are not included in the training and testing dataset and have the same features as the loans used in training/testing dataset.
- step5_loan_prediction.R takes the new data created in the step4 and the champion model created in step3, output the predicted label and probability to be charge-off for each loan in next three months.
- loanchargeoff_xgboost.R This step is optional. For more details : Using XGBoost package in HDInsight Spark Cluster for Loan ChargeOff Prediction
- After creating the model, Debra runs Copy_Dev2Prod.R to copy the model information from the dev folder to the prod folder, then runs loanchargeoff_scoring.R to create predictions for her new data.
- Once all the above code has been executed, Debra will use PowerBI to visualize the recommendations created from her model.
You can access this dashboard in either of the following ways:
-
Install PowerBI Desktop on your computer and download and open the Loan ChargeOff Prediction HDI Dashboard
If you want to refresh data in your PowerBI Dashboard, make sure to follow these instructions to setup and use an ODBC connection to the dashboard. -
- A summary of this process and all the files involved is described in more detail here.
In RStudio.
RStudio, there are multiple ways to execute the code from the R Script window. The fastest way is to use Ctrl-Enter on a single line or a selection. Learn more aboutStep 3: Operationalize with Debra
Debra has completed her tasks. She has executed code from RStudio that pushed (in part) execution to Hadoop to clean the data, create new features, train five models and select the champion model She has scored data, created predictions, and also created a summary report which she will hand off to Bernie - see below.
While this task is complete for the current set of loans, our company will want to perform these actions for each new loan payment.
In the steps above, we saw the first way of scoring new data, using loanchargeoff_scoring.R script. Debra now creates an analytic web service with ML Server Operationalization that incorporates these same steps: data processing and scoring.
loanchargeoff_deployment.R will create a web service and test it on the edge node. If you wish, you can also download the file loanchargeoff_web_scoring.R and access the web service on any computer with Microsoft ML Server installed.
The service can also be used by application developers, which is not shown here.
Step 4: Deploy and Visualize with Bernie the Business Analyst
Now that the predictions are created and saved, we will meet our last persona - Bernie, the Business Analyst. Bernie will use the Power BI Dashboard to learn more about the loan chargeoff predictions (second tab). He will also review summaries of the loan data used to create the model (first tab).
You can access this dashboard in either of the following ways:
-
Install PowerBI Desktop on your computer and download and open the Loan ChargeOff Prediction HDI Dashboard