For the Database Analyst - Operationalize with SQL
With the simulated data and R scripts contained in this solution, application analyst is able to use SQL Server 2017 to evaluate the solution end to end, including the steps needed to deploy machine learning model in SQL Server and consumed by business application. This template deploys a Data Science Virtual Machine (DSVM) that has SQL Server 2017 with Microsoft ML Server installed.
For businesses that prefer an on-prem solution, the implementation with SQL Server ML Services is a great option, which takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server). In this template, we implemented all steps in SQL stored procedures: data preprocessing, and feature engineering are implemented in pure SQL, while data cleaning, and the model training, scoring and evaluation steps are implemented with SQL stored procedures calling R (Microsoft ML Server) code.
All the steps can be executed on SQL Server client environment (SQL Server Management Studio). We provide a Windows PowerShell script which invokes the SQL scripts and demonstrates the end-to-end modeling process.
System Requirements
The following are required to run the scripts in this solution:
- SQL Server (2016 or higher) with Microsoft ML Server installed and configured.
- The SQL user name and password, and the user configured properly to execute R scripts in-memory.
- SQL Database which the user has write permission and execute stored procedures.
- For more information about SQL server and ML Services, please visit: https://docs.microsoft.com/en-us/sql/advanced-analytics/what-s-new-in-sql-server-machine-learning-services
</ul>
Workflow Automation
Follow the PowerShell instructions to execute all the scripts described below. Click here to view the details all tables created in this solution.
Step 1: Creating Tables
The following data are provided in the D:\LoanChargeOffSolution\Data directory:
File | Description |
---|---|
loan_info_10k.csv | Raw data about loan and loan details |
member_info_10k.csv | Raw demographics and financial data about each customer |
payments_info_10k.csv | Raw data about loan payment history |
In this step, we create six tables: member_info_10k, loan_info_10k, payment_info_10k, loan_chargeoff_models_10k, selected_features_10k and loan_chargeoff_prediction_10k in a SQL Server database, and the data is uploaded to these tables using bcp command in the PowerShell script.
Input:
- Raw data: loan_info_10k.csv, member_info_10k.csv, and payments_info_10k.csv
Output:
- 6 Tables filled with the raw data:
member_info_10k
,loan_info_10k
,payment_info_10k
,loan_chargeoff_models_10k
,selected_features_10k
andloan_chargeoff_prediction_10k
. In order for them to be filled with the data, power shell script namedLoan_ChargeOff.ps1
should be run
Related files:
- step1_create_tables.sql
Step 2: Creating Views with Features and Labels
In this step, we create 3 views for training, testing, and scoring by selecting features and assign with labels from payments_info
and joined with loan_info
and member_info
table using payment_date as the criterion.
Views | Description |
---|---|
vw_loan_chargeoff_train_10k | For training, we select features from the table joints with payment date between '2016-09-12' and '2016-12-12' and charge_off status = 1 or (payment_date = '2017-01-12') |
vw_loan_chargeoff_test_10k | For testing, we create view from the joined tables with payment date = '2017-02-12' |
vw_loan_chargeoff_score_10k | For scoring, we create view from the joined tables with the payment date > '2017-02-12' |
These three views will get persisted in tables to get faster scoring results.
Input:
- 3 Tables filled with the raw data:
member_info_10k
,loan_info_10k
,payment_info_10k
Output:
- 3 Views created:
vw_loan_chargeoff_train_10k
,vw_loan_chargeoff_test_10k
andvw_loan_chargeoff_score_10k
- 3 tables created:
loan_chargeoff_train_10k
,loan_chargeoff_test_10k
andloan_chargeoff_score_10k
Related files:
- step2_features_label_view.sql
Step 2a (optional): Demonstrate feature selection using MicrosoftML package
In this step, we create a table [dbo].[selected_features]
that stores the feature names from feature selection using MicrosoftML package:
- Select features from training_set
- Remove biased features like memberId, loanId, payment_date, loan_open_date and charge_off
- Demonstrate feature selection using logistic regression model
- Store selected features in a table.
Input:
- Enter [Training_set_table] as parameter
- Enter [Test_set_table] as parameter
- Enter [selected_features_table] as parameter
- Enter [connectionString] as parameter
Output:
Selected_features_10k
table containing features that are selected by applying categorical and selectFeatures transforms from MicrosoftML package.
Related files:
- Loan_ChargeOff.ps1
- step2a_optional_feature_selection.sql
Step 3: Training and Testing Model
In this step, we create a stored procedure for training of models using MicrosoftML algorithms. This also evaluates the models and stores the model stats along with serialized model binary, accuracy, auc, precision, recall, f1score. We will be using 5 algorithms to train :
- rxLogisticRegression
- rxFastTrees
- rxFastForest
- rxFastLinear
- rxNeuralNet
The performance result from each of this model will get stored in Loan_chargeoff_models_10k
table.
Input:
Loan_chargeoff_train_10k
table.
Output:
Loan_chargeoff_models_10k
table containing model name, auc, accuracy, precision, recall and f1score
Related files:
- step3_train_test_model.sql
Step 4: Chargeoff Prediction (batch)
In this step, we create a stored procedure [dbo].[chargeoff_batch_prediction]
that do scoring using best model on the data split created in Step 2 and store the predictions in [dbo].[loan_chargeoff_prediction_10k]
table.
Input:
loan_chargeoff_score_10k
table
Output:
[dbo].[loan_chargeoff_prediction_10k]
table
Related files:
- Chargeoff_batch_prediction.sql
Step 4a: Chargeoff OnDemand
In this step, the business application can call this stored procedure for adhoc scoring scenario. predict_chargeoff_ondemand
stored procedure is created for ad-hoc scoring wherein it can be called with a single record and a single prediction result is returned to the caller.
Input:
- Please see input parameters in
predict_chargeoff_ondemand
stored procedure
Output:
LoanId, Payment_date, predicatedLabel, Score.1, Probability.1
table containing the RF and GBT trained models.
Related files:
- Predict_chargeoff_ondemand.sql