Template Contents
The following is the directory structure for this template:
- Data This contains the copy of the simulated input data with 100K unique customers.
- R This contains the R code to simulate the input datasets, pre-process them, create the analytical datasets, train the models, identify the champion model and provide predictions.
- SQLR This contains T-SQL code to pre-process the datasets, train the models, identify the champion model and provide predictions. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
- HDI This contains the R code to pre-process the datasets, train the models, identify the champion model and provide predictions on a Spark cluster.
In this template with SQL Server ML Services, two versions of the SQL implementation and another version for HDInsight implementation:
- Model Development in R IDE . Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).
- Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.
- HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.
Copy of Input Datasets
File | Description |
---|---|
loan_info.csv | Raw data about each loan of a lending institution |
member_info.csv | Raw data about each member of a lending institution |
payments_info.csv | Raw data about loan payment history |
File | Description |
---|---|
Loan_Data1000.csv | Raw data about loan payment history for 1000 members |
Loan_Data10000.csv | Raw data about loan payment history for 10000 members |
Loan_Data100000.csv | Raw data about loan payment history for 100000 members |
Model Development in R
These files in the *SQLR directory for the SQL solution.
File | Description |
---|---|
step2a_optional_feature_selection.sql | R script that performs feature selection |
step3_train_test_model.sql | R script that performs data training using 5 different models and select the best performant model and do testing using the model |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Operationalize in SQL
These files are in the SQLR directory.
File | Description |
---|---|
step4_chargeoff_batch_prediction.sql | R script that performs scoring using best model on the data split created in Step 2 and store the predictions in [dbo].[loan_chargeoff_prediction_10k] table. |
step4a_chargeoff_ondemand_prediction.sql | chargeoff_ondemand_prediction stored procedure is created for ad-hoc scoring wherein it can be called with a single record and a single prediction result is returned to the caller. |
Loan_chargeoff.ps1 | Loads the input data into the SQL server and automates the running of all .sql files |
- See For the Database Analyst for more information about these files.
- Follow the PowerShell Instructions to execute the PowerShell script which automates the running of all these .sql files.
HDInsight Solution on Spark Cluster
These files are in the HDI/RSparkCluster directory.
File | Description |
---|---|
Copy_Dev2Prod.R | Copies a model from the dev folder to the prod folder for production use |
loanchargeoff_main.R | is used to define the data and directories and then run all of the steps to process data, perform feature engineering, training, and scoring. |
loanchargeoff_scoring.R | uses the previously trained model and invokes the steps to process data, perform feature engineering and scoring. |
loanchargeoff_deployment.R | create a web service and test it on the edge node |
loanchargeoff_web_scoring.R | access the web service on any computer with Microsoft ML Server installed |
loan_main.R | Main R script that executes the rest of the R scripts |
loan_scoring.R | Perform loan scoring using the model with the best performance |
step1_get_training_testing_data.R | Read input data which contains all the history information for all the loans from HDFS. Extract training/testing data based on process date (paydate) from the input data. |
step2_feature_engineering.R | Use MicrosoftML to do feature selection. Code can be added in this file to create some new features based on existing features. Open source package such as Caret can also be used to do feature selection here. Best features are selected using AUC. |
step3_training_evaluation.R | This script trains five different models and evaluate each. |
step4_prepare_new_data.R | This script creates a new data which contains all the opened loans on a pay date which we do not know the status in next three month, the loans in this new data are not included in the training and testing dataset and have the same features as the loans used in training/testing dataset. |
step5_loan_prediction.R | This script takes the new data created in the step4 and the champion model created in step3, output the predicted label and probability to be charge-off for each loan in next three months. |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.