Loan ChargeOff Prediction

For the IT Administrator


As lending institutions are starting to acknowledge the power of data, leveraging machine learning techniques to grow has become a must. In particular, lending institutions can learn payment patterns from their data to intelligently predict loan charge off risk.

Among the key variables to learn from data are the loan payments, past due and remaining balance through which a given loan can be predicted as a potential charge off. This template provides a lending institution with an analytics tool that helps predict the likelihood of loans getting charged off and run a report on the analytics result stored in HDFS and hive tables.

While this solution demonstrates the code with 100,000 loans for developing the model, using HDInsight Spark clusters makes it simple to extend to large data, both for training and scoring. The only thing that changes is the size of the data and the number of clusters; the code remains exactly the same.

System Requirements


This solution uses:

Cluster Maintenance


HDInsight Spark cluster billing starts once a cluster is created and stops when the cluster is deleted. See these instructions for important information about deleting a cluster and re-using your files on a new cluster.

Workflow Automation


Access RStudio on the cluster edge node by using the url of the form http://CLUSTERNAME.azurehdinsight.net/rstudio Run the script loanchargeoff_main.R to perform all the steps of the solution.

Data Files


The following data files are available in the LoanChargeOff/Data directory in the storage account associated with the cluster:

File Description
Loan_Data1000.csv Raw data about loan payments for 1000 members
Loan_Data10000.csv Raw data about loan payments for 10000 members
Loan_Data100000.csv Raw data about loan payments for 100000 members