Loan ChargeOff Prediction

For the Data Scientist - Develop with R


HDInsight is a cloud Spark and Hadoop service for the enterprise. HDInsight is also the only managed cloud Hadoop solution with integration to Microsoft ML Server.

This solution shows how to pre-process data (cleaning and feature engineering), train prediction models, and perform scoring on an HDInsight Spark cluster with Microsoft ML Server.
Data scientists who are testing and developing solutions can work from the browser-based Open Source Edition of RStudio Server on the HDInsight Spark cluster edge node, while using a compute context to control whether computation will be performed locally on the edge node, or whether it will be distributed across the nodes in the HDInsight Spark cluster.

Loan ChargeOff Prediction


This solution template is showing an end to end solution including platform deployment scripts, R scripts that include data transformation, feature selection, training, scoring and operationalization. This template focused on predicting loan chargeoff risk using simulated data. Loan officer can look at the top N of the loans that have the highest chargeoff probability and formulate incentive plans to encourage the loan holders to continue to payoff the loan.

Data scientist will be able to look at R script that understand the process involved in feature engineering and model performance. The script uses 5 models to train:- rxLogisticRegression, rxFastTrees, rxFastForest, rxFastLinear and rxNeuralNet. The script will pick the model with the best performance and use it in scoring.

This template will run training, testing on 100,000 loans and scoring on 10,000 loans from the simulated datasets. The simulated datasets are stored in HDFS. In this solution, an Apache Hive table will be created to show predictions. This data is then visualized in PowerBI.

To try this out yourself, visit the Quick Start page.

Below is a description of what happens in each of the steps: dataset creation, model development, recommendations, and deployment in more detail.

Analytical Dataset Preprocessing and Feature Engineering


Raw data contains all the loans in past three years and attributes for all the loans like loanId, memberId, loan_open_date, balance, pay date, payment information. Based on these information, for each opened loan and a specific pay date, calculate the payment history before the specific pay date, and then check the status in next three months to get the label to generate the input datasets. After that, this part read the input data from HDFS to create Training/Testing data, and feature engineering to select the best features. step1_get_training_testing_data.R : This script read the input dataset from HDFS, select training/testing samples based on the pay date, performs missing value treatment on the input dataset. step2_feature_engineering.R : This script performs feature selection to generate the features which are predictable for the modeling using featureSelection in MicrosoftML package. 1. Using rxLogisticRegression to do feature selection, figure out what's the best number of features for modeling by input different numbers. The algorithm used to do feature selection in this part can be changed to other algorithm like rxFastForest, rxFastTree and rxFastLinear. This feature selection part may take a little bit long time, so it's optional. 2. If customer do not want run the feature selection part, just give a number to parameter numFeaturesToKeep, then rxLogisitcRegression algorithm will be used the select the features and return the selected features' name for the next step.

Model Development


Five models, Fast Forest, Logistic Regression, Fast Tree, Fast Linear and Neural Network are developed to predict charge-off loans in next three months. The R code to develop these models and evaluate the models' performance using testing dataset in step3_training_evaluation.R. After creating the training and testing set in step1 and get the selected features' name in step2, fives prediction models are built on the training set using selected features. Once the models are trained, AUC, TPR and TNR of these five models are calculated using the testing set. The R script draws the ROC for each prediction model. It shows the performance of the model in terms of true positive rate and false positive rate, when the decision threshold varies. The AUC is a number between 0 and 1. It corresponds to the area under the ROC curve. It is a performance metric related to how good the model is at separating the two classes (charge-off loans and not charge-off loans), with a good choice of decision threshold separating between the predicted probabilities. The closer the AUC is to 1 and the better the model is. Given that we are not looking for that optimal decision threshold, the AUC is more representative of the prediction performance than the Overall Accuracy. Since the training data is very skew (the number of not charge-off loans is much larger than the number of charge-off loans), TPR and TNR are calculated since AUC should be high with lower TPR and higher TNR for unbalanced training data. TPR and TNR can combine with AUC together to measure the models' performance and guarantee the model with best AUC also has better TPR and TNR. The model with best AUC is selected as the champion model and will be used for prediction. Given training/testing datasets and selected features' name, this script can be run manually. Not that all the algorithms in this step take the selected features, it also can be changed to input all the features in the training set, and let all the algorithm to select the best features by using selectFeatures function in MicrosoftML and then compare the performance of the models based on the same testing set.

Computing ChargeOff Predictions


The champion model is used to provide predicted results for all the opened loans about which loans will be charge-off with what probability. The R code to provide the prediction is included in the step4_prepare_new_data.R and step5_loan_prediction.R scripts. step4_prepare_new_data.R : This script creates a new data which contains all the opened loans on a pay date which we do not know the status in next three month, the loans in this new data are not included in the training and testing dataset and have the same features as the loans used in training/testing dataset. step5_loan_prediction.R : This script takes the new data created in the step4 and the champion model created in step3, output the predicted label and probability to be charge-off for each loan in next three months.

Deploy and Visualize Results


Deploy

The script loanchargeoff_deployment.R creates and tests a analytic web service. The web service can then be used from another application to score future data. The file loanchargeoff_web_scoring.R can be downloaded to invoke this web service locally on any computer with Microsoft ML Server installed.

Visualize

The final step of this solution visualizes these recommendations.

You can access this dashboard in either of the following ways:

Template Contents


View the contents of this solution template.

To try this out yourself:

< Home