Template Contents
The following is the directory structure for this template:
- Data This contains the copy of the simulated input data with 100K unique customers.
- R This contains the R code to simulate the input datasets, pre-process them, create the analytical datasets, train the models, identify the champion model and provide recommendations.
- Resources This directory contains other resources for the solution package.
- SQLR This contains T-SQL code to pre-process the datasets, train the models, identify the champion model and provide recommendations. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
- RSparkCluster This contains the R code to pre-process the datasets, train the models, identify the champion model and provide recommendations on a Spark cluster.
In this template with SQL Server ML Services, two versions of the SQL implementation, and another version for HDInsight implementation:
-
Model Development in R IDE. Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).
-
Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.
-
HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.
Copy of Input Datasets
| File | Description |
|---|---|
| Loan.csv | Loan data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk Loan solution for SQL solution. (Larger data is generated via script for HDInsight solution.) |
| Borrower.csv | Borrower data with 100K rows of the simulated data used to build the end-to-end Loan Credit Risk for SQL solution. (Larger data is generated via script for HDInsight solution.) |
| Loan_Prod.csv | Loan data with 22 rows of the simulated data used in the Production pipeline |
| Borrower_Prod.csv | Borrower data with 22 rows of the simulated data used in the Production pipeline |
| LoanCreditRisk_Data_Dictionary.xlsx | Schema and description of the input tables and variables |
Model Development in R
These files in the R directory for the SQL solution.
| File | Description |
|---|---|
| LoanCreditRisk.rproj | Project file for RStudio or Visual Studio |
| LoanCreditRisk.rxproj | Used with the Visual Studio Solution File |
| LoanCreditRisk.sln | Visual Studio Solution File |
| Loan_Credit_Risk_Notebook.ipynb | Contains the Jupyter Notebook file that runs all the .R scripts |
| modeling_main.R | Specifies input paramaters and calls the different functions for the Modeling Pipeline |
| step1_preprocessing.R | Cleans the merged data sets: replaces NAs with the global mean (numeric variables) or global mode (character variables) |
| step2_feature_engineering.R | Performs Feature Engineering |
| step3_train_score_evaluate.R | Builds the logistic regression classification model, scores the test data and evaluates |
| step4_operational_metrics.R | Computes operational metrics and performs scores transformations |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Operationalize in SQL Server
These files are in the SQLR directory.
| File | Description |
|---|---|
| Load_Data.ps1 | Loads initial data into SQL Server |
| Loan_Credit_Risk.ps1 | Automates execution of all .sql files and creates stored procedures |
| create_tables_prod.sql | Creates the production tables |
| create_user.sql | Used during initial SQL Server setup to create the user and password and grant permissions |
| modeling_proc.sql | Stored procedure for the modeling/development pipeline |
| production_proc.sql | Stored procedure for the production pipeline |
| step1_data_processing.sql | Replaces Missing values in dataset with the modes or means |
| step2a_splitting.sql | Splits the analytical dataset into Train and Test |
| step2b_feature_engineering.sql | Performs Feature Engineering and creates the Analytical Dataset |
| step3a_training.sql | Trains a Logistic Regression model |
| step3b_scoring.sql | Scores data using the Logistic Regression model |
| step3c_evaluating.sql | Evaluates the model |
| step4_operational_metrics.sql | Computes operational metrics and performs scores transformations |
- See For the Database Analyst for more information about these files.
- Follow the PowerShell Instructions to execute the PowerShell script which automates the running of all these .sql files.
HDInsight Solution on Spark Cluster
These files are in the RSparkCluster directory.
| File | Description |
|---|---|
| copy_dev_to_prod.R | Defines function, copy_to_prod, used in development_main.R |
| data_generation.R | Used to generate data, used in development_main.R |
| deployment_main.R | Deploys web scoring function as a web service |
| development_main.R | Full development process |
| in_memory_scoring.R | Performs in-memory scoring for batch scoring or for scoring remotely with a web service |
| production_main.R | Scores new data using subset of development steps |
| step0_directories_creation.R | Creates initial directories |
| step1_preprocessing.R | Merges data and then cleans the merged data sets: replace NAs with the global mean (numeric variables) or global mode (character variables), used in both development and production |
| step2_feature_engineering.R | Performs Feature Engineering, used in both development and production |
| step3_train_score_evaluate.R | Builds the logistic regression classification model, scores the test data and evaluates, used in both development and production |
| step4_operational_metrics.R | Computes operational metrics and performs scores transformations in development, uses metrics and performs score transformations in production |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Resources for the Solution Package
| File | Description |
|---|---|
| Images | Directory of images used for the Readme.md in this package |