Template Contents
The following is the directory structure for this template:
- Data This contains the copy of the simulated input data with 100K unique customers.
- R This contains the R code to simulate the input datasets, pre-process them, create the analytical datasets, train the models, identify the champion model and provide recommendations.
- Resources This directory contains other resources for the solution package.
- SQLR This contains T-SQL code to pre-process the datasets, train the models, identify the champion model and provide recommendations. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
- RSparkCluster This contains the R code to pre-process the datasets, train the models, identify the champion model and provide recommendations on a Spark cluster.
In this template with SQL Server ML Services, two versions of the SQL implementation, and another version for HDInsight implementation:
- 
    Model Development in R IDE. Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio). 
- 
    Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script. 
- 
    HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster. 
Copy of Input Datasets
| File | Description | 
|---|---|
| Loan.csv | Loan data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk Loan solution for SQL solution. (Larger data is generated via script for HDInsight solution.) | 
| Borrower.csv | Borrower data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk for SQL solution. (Larger data is generated via script for HDInsight solution.) | 
| Loan_Prod.csv | Loan data with 22 rows of the simulated data used in the Production pipeline | 
| Borrower_Prod.csv | Borrower data with 22 rows of the simulated data used in the Production pipeline | 
| LoanCreditRisk_Data_Dictionary.xlsx | Schema and description of the input tables and variables | 
Model Development in R
These files in the R directory for the SQL solution.
| File | Description | 
|---|---|
| LoanCreditRisk.rproj | Project file for RStudio or Visual Studio | 
| LoanCreditRisk.rxproj | Used with the Visual Studio Solution File | 
| LoanCreditRisk.sln | Visual Studio Solution File | 
| Loan_Credit_Risk_Notebook.ipynb | Contains the Jupyter Notebook file that runs all the .R scripts | 
| modeling_main.R | Specifies input paramaters and calls the different functions for the Modeling Pipeline | 
| step1_preprocessing.R | Cleans the merged data sets: replaces NAs with the global mean (numeric variables) or global mode (character variables) | 
| step2_feature_engineering.R | Performs Feature Engineering | 
| step3_train_score_evaluate.R | Builds the logistic regression classification model, scores the test data and evaluates | 
| step4_operational_metrics.R | Computes operational metrics and performs scores transformations | 
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Operationalize in SQL Server
These files are in the SQLR directory.
| File | Description | 
|---|---|
| Load_Data.ps1 | Loads initial data into SQL Server | 
| Loan_Credit_Risk.ps1 | Automates execution of all .sql files and creates stored procedures | 
| create_tables_prod.sql | Creates the production tables | 
| create_user.sql | Used during initial SQL Server setup to create the user and password and grant permissions | 
| modeling_proc.sql | Stored procedure for the modeling/development pipeline | 
| production_proc.sql | Stored procedure for the production pipeline | 
| step1_data_processing.sql | Replaces Missing values in dataset with the modes or means | 
| step2a_splitting.sql | Splits the analytical dataset into Train and Test | 
| step2b_feature_engineering.sql | Performs Feature Engineering and creates the Analytical Dataset | 
| step3a_training.sql | Trains a Logistic Regression model | 
| step3b_scoring.sql | Scores data using the Logistic Regression model | 
| step3c_evaluating.sql | Evaluates the model | 
| step4_operational_metrics.sql | Computes operational metrics and performs scores transformations | 
- See For the Database Analyst for more information about these files.
- Follow the PowerShell Instructions to execute the PowerShell script which automates the running of all these .sql files.
HDInsight Solution on Spark Cluster
These files are in the RSparkCluster directory.
| File | Description | 
|---|---|
| copy_dev_to_prod.R | Defines function, copy_to_prod, used in development_main.R | 
| data_generation.R | Used to generate data, used in development_main.R | 
| deployment_main.R | Deploys web scoring function as a web service | 
| development_main.R | Full development process | 
| in_memory_scoring.R | Performs in-memory scoring for batch scoring or for scoring remotely with a web service | 
| production_main.R | Scores new data using subset of development steps | 
| step0_directories_creation.R | Creates initial directories | 
| step1_preprocessing.R | Merges data and then cleans the merged data sets: replace NAs with the global mean (numeric variables) or global mode (character variables), used in both development and production | 
| step2_feature_engineering.R | Performs Feature Engineering, used in both development and production | 
| step3_train_score_evaluate.R | Builds the logistic regression classification model, scores the test data and evaluates, used in both development and production | 
| step4_operational_metrics.R | Computes operational metrics and performs scores transformations in development, uses metrics and performs score transformations in production | 
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Resources for the Solution Package
| File | Description | 
|---|---|
| Images | Directory of images used for the Readme.md in this package | 
 
          