Template Contents
The following is the directory structure for this template:
- Data This contains the copy of the simulated input data with 100K unique customers.
- R This contains the R code to simulate the input datasets, pre-process them, create the analytical datasets, train the models, identify the champion model and provide recommendations.
- Resources This directory contains other resources for the solution package.
- SQLR This contains T-SQL code to pre-process the datasets, train the models, identify the champion model and provide recommendations. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
- RSparkCluster This contains the R code to pre-process the datasets, train the models, identify the champion model and provide recommendations on a Spark cluster.
In this template with SQL Server ML Services, two versions of the SQL implementation, and another version for HDInsight implementation:
-
Model Development in R IDE. Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).
-
Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.
-
HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.
Copy of Input Datasets
File | Description |
---|---|
Loan.csv | Loan data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk Loan solution for SQL solution. (Larger data is generated via script for HDInsight solution.) |
Borrower.csv | Borrower data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk for SQL solution. (Larger data is generated via script for HDInsight solution.) |
Loan_Prod.csv | Loan data with 22 rows of the simulated data used in the Production pipeline |
Borrower_Prod.csv | Borrower data with 22 rows of the simulated data used in the Production pipeline |
LoanCreditRisk_Data_Dictionary.xlsx | Schema and description of the input tables and variables |
Model Development in R
These files in the R directory for the SQL solution.
File | Description |
---|---|
LoanCreditRisk.rproj | Project file for RStudio or Visual Studio |
LoanCreditRisk.rxproj | Used with the Visual Studio Solution File |
LoanCreditRisk.sln | Visual Studio Solution File |
Loan_Credit_Risk_Notebook.ipynb | Contains the Jupyter Notebook file that runs all the .R scripts |
modeling_main.R | Specifies input paramaters and calls the different functions for the Modeling Pipeline |
step1_preprocessing.R | Cleans the merged data sets: replaces NAs with the global mean (numeric variables) or global mode (character variables) |
step2_feature_engineering.R | Performs Feature Engineering |
step3_train_score_evaluate.R | Builds the logistic regression classification model, scores the test data and evaluates |
step4_operational_metrics.R | Computes operational metrics and performs scores transformations |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Operationalize in SQL Server
These files are in the SQLR directory.
File | Description |
---|---|
Load_Data.ps1 | Loads initial data into SQL Server |
Loan_Credit_Risk.ps1 | Automates execution of all .sql files and creates stored procedures |
create_tables_prod.sql | Creates the production tables |
create_user.sql | Used during initial SQL Server setup to create the user and password and grant permissions |
modeling_proc.sql | Stored procedure for the modeling/development pipeline |
production_proc.sql | Stored procedure for the production pipeline |
step1_data_processing.sql | Replaces Missing values in dataset with the modes or means |
step2a_splitting.sql | Splits the analytical dataset into Train and Test |
step2b_feature_engineering.sql | Performs Feature Engineering and creates the Analytical Dataset |
step3a_training.sql | Trains a Logistic Regression model |
step3b_scoring.sql | Scores data using the Logistic Regression model |
step3c_evaluating.sql | Evaluates the model |
step4_operational_metrics.sql | Computes operational metrics and performs scores transformations |
- See For the Database Analyst for more information about these files.
- Follow the PowerShell Instructions to execute the PowerShell script which automates the running of all these .sql files.
HDInsight Solution on Spark Cluster
These files are in the RSparkCluster directory.
File | Description |
---|---|
copy_dev_to_prod.R | Defines function, copy_to_prod, used in development_main.R |
data_generation.R | Used to generate data, used in development_main.R |
deployment_main.R | Deploys web scoring function as a web service |
development_main.R | Full development process |
in_memory_scoring.R | Performs in-memory scoring for batch scoring or for scoring remotely with a web service |
production_main.R | Scores new data using subset of development steps |
step0_directories_creation.R | Creates initial directories |
step1_preprocessing.R | Merges data and then cleans the merged data sets: replace NAs with the global mean (numeric variables) or global mode (character variables), used in both development and production |
step2_feature_engineering.R | Performs Feature Engineering, used in both development and production |
step3_train_score_evaluate.R | Builds the logistic regression classification model, scores the test data and evaluates, used in both development and production |
step4_operational_metrics.R | Computes operational metrics and performs scores transformations in development, uses metrics and performs score transformations in production |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Resources for the Solution Package
File | Description |
---|---|
Images | Directory of images used for the Readme.md in this package |