Template Contents

The following is the directory structure for this template:

Data This contains the copy of the simulated input data with 100K unique customers.
R This contains the R code to simulate the input datasets, pre-process them, create the analytical datasets, train the models, identify the champion model and provide recommendations.
Resources This directory contains other resources for the solution package.
SQLR This contains T-SQL code to pre-process the datasets, train the models, identify the champion model and provide recommendations. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
RSparkCluster This contains the R code to pre-process the datasets, train the models, identify the champion model and provide recommendations on a Spark cluster.

In this template with SQL Server ML Services, two versions of the SQL implementation, and another version for HDInsight implementation:

Model Development in R IDE. Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).
Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.
HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.

File	Description
Loan.csv	Loan data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk Loan solution for SQL solution. (Larger data is generated via script for HDInsight solution.)
Borrower.csv	Borrower data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk for SQL solution. (Larger data is generated via script for HDInsight solution.)
Loan_Prod.csv	Loan data with 22 rows of the simulated data used in the Production pipeline
Borrower_Prod.csv	Borrower data with 22 rows of the simulated data used in the Production pipeline
LoanCreditRisk_Data_Dictionary.xlsx	Schema and description of the input tables and variables

These files in the R directory for the SQL solution.

File	Description
LoanCreditRisk.rproj	Project file for RStudio or Visual Studio
LoanCreditRisk.rxproj	Used with the Visual Studio Solution File
LoanCreditRisk.sln	Visual Studio Solution File
Loan_Credit_Risk_Notebook.ipynb	Contains the Jupyter Notebook file that runs all the .R scripts
modeling_main.R	Specifies input paramaters and calls the different functions for the Modeling Pipeline
step1_preprocessing.R	Cleans the merged data sets: replaces NAs with the global mean (numeric variables) or global mode (character variables)
step2_feature_engineering.R	Performs Feature Engineering
step3_train_score_evaluate.R	Builds the logistic regression classification model, scores the test data and evaluates
step4_operational_metrics.R	Computes operational metrics and performs scores transformations

These files are in the SQLR directory.

File	Description
Load_Data.ps1	Loads initial data into SQL Server
Loan_Credit_Risk.ps1	Automates execution of all .sql files and creates stored procedures
create_tables_prod.sql	Creates the production tables
create_user.sql	Used during initial SQL Server setup to create the user and password and grant permissions
modeling_proc.sql	Stored procedure for the modeling/development pipeline
production_proc.sql	Stored procedure for the production pipeline
step1_data_processing.sql	Replaces Missing values in dataset with the modes or means
step2a_splitting.sql	Splits the analytical dataset into Train and Test
step2b_feature_engineering.sql	Performs Feature Engineering and creates the Analytical Dataset
step3a_training.sql	Trains a Logistic Regression model
step3b_scoring.sql	Scores data using the Logistic Regression model
step3c_evaluating.sql	Evaluates the model
step4_operational_metrics.sql	Computes operational metrics and performs scores transformations

See For the Database Analyst for more information about these files.
Follow the PowerShell Instructions to execute the PowerShell script which automates the running of all these .sql files.

These files are in the RSparkCluster directory.

File	Description
copy_dev_to_prod.R	Defines function, copy_to_prod, used in development_main.R
data_generation.R	Used to generate data, used in development_main.R
deployment_main.R	Deploys web scoring function as a web service
development_main.R	Full development process
in_memory_scoring.R	Performs in-memory scoring for batch scoring or for scoring remotely with a web service
production_main.R	Scores new data using subset of development steps
step0_directories_creation.R	Creates initial directories
step1_preprocessing.R	Merges data and then cleans the merged data sets: replace NAs with the global mean (numeric variables) or global mode (character variables), used in both development and production
step2_feature_engineering.R	Performs Feature Engineering, used in both development and production
step3_train_score_evaluate.R	Builds the logistic regression classification model, scores the test data and evaluates, used in both development and production
step4_operational_metrics.R	Computes operational metrics and performs scores transformations in development, uses metrics and performs score transformations in production

File	Description
Images	Directory of images used for the Readme.md in this package

Loan Credit Risk