For the IT Administrator

This page describes the HDInsight Spark solution.

For the IT Administrator

When a financial institution examines a request for a loan, it is crucial to assess the risk of default to determine whether to grant it. This solution is based on simulated data for a small personal loan financial institution, containing the borrower's financial history as well as information about the requested loan. View more information about the data.

This solution demonstrates the code with 1,000,000 borrowers for developing the model. Using HDInsight Spark clusters makes it simple to extend to very large data, both for training and scoring. As you increase the data size you may want to add more nodes but the code itself remains exactly the same.

System Requirements

This solution uses:

ML Server for HDInsight

Cluster Maintenance

HDInsight Spark cluster billing starts once a cluster is created and stops when the cluster is deleted. See these instructions for important information about deleting a cluster and re-using your files on a new cluster.

Workflow Automation

Access RStudio on the cluster edge node by using the url of the form http://CLUSTERNAME.azurehdinsight.net/rstudio Run the script development_main.R followed by deployment_main.R to perform all the steps of the solution.

Data Files

The following data files are available in the Loans/Data directory in the storage account associated with the cluster:

File	Description
Loan.csv	Loan data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk Loan solution for SQL solution. (Larger data is generated via script for HDInsight solution.)
Borrower.csv	Borrower data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk for SQL solution. (Larger data is generated via script for HDInsight solution.)
Loan_Prod.csv	Loan data with 22 rows of the simulated data used in the Production pipeline
Borrower_Prod.csv	Borrower data with 22 rows of the simulated data used in the Production pipeline
LoanCreditRisk_Data_Dictionary.xlsx	Schema and description of the input tables and variables

Loan Credit Risk