Loan Credit Risk

Template Contents


The following is the directory structure for this template:

  • Data This contains the copy of the simulated input data with 100K unique customers.
  • R This contains the R code to simulate the input datasets, pre-process them, create the analytical datasets, train the models, identify the champion model and provide recommendations.
  • Resources This directory contains other resources for the solution package.
  • SQLR This contains T-SQL code to pre-process the datasets, train the models, identify the champion model and provide recommendations. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
  • RSparkCluster This contains the R code to pre-process the datasets, train the models, identify the champion model and provide recommendations on a Spark cluster.

In this template with SQL Server R Services, two versions of the SQL implementation, and another version for HDInsight implementation:

  1. Model Development in R IDE. Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).

  2. Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.

  3. HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.

Copy of Input Datasets


File Description
Loan.csv Loan data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk Loan solution for SQL solution. (Larger data is generated via script for HDInsight solution.)
Borrower.csvBorrower data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk for SQL solution. (Larger data is generated via script for HDInsight solution.)
Loan_Prod.csvLoan data with 22 rows of the simulated data used in the Production pipeline
Borrower_Prod.csvBorrower data with 22 rows of the simulated data used in the Production pipeline
LoanCreditRisk_Data_Dictionary.xlsx Schema and description of the input tables and variables

Model Development in R


These files in the R directory for the SQL solution.

File Description
LoanCreditRisk.rproj Project file for RStudio or Visual Studio
LoanCreditRisk.rxproj Used with the Visual Studio Solution File
LoanCreditRisk.sln Visual Studio Solution File
Loan_Credit_Risk_Notebook.ipynb Contains the Jupyter Notebook file that runs all the .R scripts
modeling_main.R Specifies input paramaters and calls the different functions for the Modeling Pipeline
step1_preprocessing.RCleans the merged data sets: replaces NAs with the global mean (numeric variables) or global mode (character variables)
step2_feature_engineering.R Performs Feature Engineering
step3_train_score_evaluate.R Builds the logistic regression classification model, scores the test data and evaluates
step4_operational_metrics.R Computes operational metrics and performs scores transformations

Operationalize in SQL Server 2016


These files are in the SQLR directory.

File Description
Load_Data.ps1 Loads initial data into SQL Server
Loan_Credit_Risk.ps1 Automates execution of all .sql files and creates stored procedures
create_tables_prod.sql Creates the production tables
create_user.sql Used during initial SQL Server setup to create the user and password and grant permissions
modeling_proc.sql Stored procedure for the modeling/development pipeline
production_proc.sql Stored procedure for the production pipeline
step1_data_processing.sql Replaces Missing values in dataset with the modes or means
step2a_splitting.sql Splits the analytical dataset into Train and Test
step2b_feature_engineering.sql Performs Feature Engineering and creates the Analytical Dataset
step3a_training.sql Trains a Logistic Regression model
step3b_scoring.sql Scores data using the Logistic Regression model
step3c_evaluating.sql Evaluates the model
step4_operational_metrics.sql Computes operational metrics and performs scores transformations

HDInsight Solution on Spark Cluster


These files are in the RSparkCluster directory.

File Description
copy_dev_to_prod.RDefines function, copy_to_prod, used in development_main.R
data_generation.RUsed to generate data, used in development_main.R
deployment_main.RDeploys web scoring function as a web service
development_main.R Full development process
in_memory_scoring.RPerforms in-memory scoring for batch scoring or for scoring remotely with a web service
production_main.R Scores new data using subset of development steps
step0_directories_creation.RCreates initial directories
step1_preprocessing.RMerges data and then cleans the merged data sets: replace NAs with the global mean (numeric variables) or global mode (character variables), used in both development and production
step2_feature_engineering.RPerforms Feature Engineering, used in both development and production
step3_train_score_evaluate.RBuilds the logistic regression classification model, scores the test data and evaluates, used in both development and production
step4_operational_metrics.R Computes operational metrics and performs scores transformations in development, uses metrics and performs score transformations in production

Resources for the Solution Package


File Description
Images Directory of images used for the Readme.md in this package

< Home