For the Database Analyst - Operationalize with SQL
In this template, we implemented all steps in SQL stored procedures: data preprocessing is implemented in pure SQL, while feature engineering, model training, scoring and evaluation steps are implemented with SQL stored procedures with embedded R (Microsoft ML Server) code. This implementation with SQL Server ML Services is equivalent to the Azure ML template for Online Fraud Detection.
For businesses that prefers an on-prem solution, the implementation with SQL Server ML Services is a great option, which takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server). The implementation with SQL Server ML Services is a great option, which takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server).
All the steps can be executed on SQL Server client environment (such as SQL Server Management Studio). We provide a Windows PowerShell script, SQLR-Fraud-Detection.ps1, which invokes the SQL scripts and demonstrates the end-to-end modeling process.
System Requirements
The following are required to run the scripts in this solution:
- SQL Server (2016 or higher) with Microsoft ML Server (version 9.1.0 or later) installed and configured.
- The SQL user name and password, and the user configured properly to execute R scripts in-memory.
- SQL Database which the user has write permission and execute stored procedures.
- For more information about SQL server 2017 and ML Services, please visit: https://docs.microsoft.com/en-us/sql/advanced-analytics/what-s-new-in-sql-server-machine-learning-services
Workflow Automation
Follow the PowerShell instructions to execute all the scripts described below. Click here to view the details of all tables created in this solution.
Step 0: Data Preparation
The following data are provided in the Data directory:
File | Description |
---|---|
Account_Info.csv | Customer account data |
Fraud_Transactions.csv | Raw fraud transaction data |
Untagged_Transactions.csv | Raw transaction data without fraud tag |
In this step, we’ll create four tables. The first three are: Untagged_Transactions
, Account_Info
and Fraud
, corresponding to the three data sets in Data folder. Once tables have been created, data is uploaded to these tables using bcp command in the powershell script. The fourth table Transaction_History
is created for storing historical transactions which will be used to calculate aggregates. This table will be filled in later steps.
Input:
- untagged data: Untagged_Transactions.csv
- fraud data: Fraud_Transactions.csv
- account data: Account_Info.csv
Output:
Untagged_Transactions
table in SQL serverFraud_Transactions
table in SQL serverAccount_Info
table in SQL serverTransaction_History
table in SQL server
Step 1: Merging with Account Information
In this step, we merge the Untagged_Transactions
table with Account_Info
table by accountID
to get account information for each transaction. Before merging, we will create utility functions and table Account_Info
will be sorted in descent order of accountID
and transactionDateTime
.
Input:
Untagged_Transactions
tableAccount_Info
table
Output:
Untagged_Transactions_Acct
table
Related Files:
- UtilityFunctions.sql: Create utility functions to uniform transactionTime to 6 digit
- SortAcctTable.sql: Create SQL stored procedure named
sortAcctTable
to sortAccount_Info
table. - Step1_MergeAcctInfo.sql: Create stored procedure named
MergeAcctInfo
to mergeUntagged_Transactions
table withAccount_Info
table.
Example:
EXEC sortAcctTable 'Account_Info'
EXEC MergeAcctInfo 'Untagged_Transactions'
Step 2: Tagging
In this step, we tag the untagged data on account level based on the fraud data. The tagging logic is the following. In fraud data, we group it by account ID and sort by time, thus, we have the fraud time period for each fraud account. For each transaction in untagged data, if the account ID is not in fraud data, this transaction is labeled as non fraud (label = 0
); if the account ID is in fraud data and the transaction time is within the fraud time period of this account, this transaction is labeled as fraud (label = 1
); if the account ID is in fraud data and the transaction time is out of the fraud time period of this account, this transaction is labeled as pre-fraud or unknown (label = 2
) which will be removed later. We will also perform re-formatting for some columns. For example, uniform the transactionTime
filed to 6 digits.
Input:
Untagged_Transactions_Acct
tableFraud_Transactions
table
Output:
Tagged
table
Related Files:
- Step2_Tagging.sql: Create SQL stored procedure named
Tagging
to tag the data in account level.
Example:
EXEC Tagging 'Untagged_Transactions_Acct', 'Fraud_Transactions'
Step 3: Splitting Data
In this step, we will hash accountID into 100 different hash code and split the whole data into training(70%) and testing(30%) based on the hash code, e.g., training = hash code <=70 and testing = hash code >70.
Input:
Tagged
table
Output:
Tagged_Training
tableTagged_Testing
table
Related Files:
- Step3_SplitData.sql: Create SQL stored procedure named
SplitData
to split data into training and testing set.
Example:
EXEC SplitData 'Tagged'
Step 4: Preprocessing
In this step, we clean the tagged training data, i.e., filling missing values with 0 and removing transactions with invalid transaction time and amount.
Input:
Tagged_Training
table
Output:
Tagged_Training_Processed
view
Related Files:
- Step4_Preprocess.sql: Create SQL stored procedure named
Preprocess
to do preprocessing
Example:
EXEC Preprocess 'Tagged_Training'
Step 5: Saving Transactions to Historical Table
In this step, we save the transactions to Transaction_History
table which will be used for calculating aggregates.
Input:
Tagged_Training_Processed
view
Output:
Transaction_History
table will be filled
Related Files:
- Step5_Save2History.sql: Create SQL stored procedure named
Save2TransactionHistory
to save transactions to historical table. You may use the flag to control whether the historical table need to be truncated or not. e.g.,Exec Save2TransactionHistory 'Tagged_Training_Processed' ,'1'
means truncating historical table and saving transactions from tableTagged_Training_Processed
to historical table.
Example:
EXEC Save2TransactionHistory 'Tagged_Training_Processed' ,'1'
Step 6: Create Risk Tables
In this step, we create risk tables for bunch of categorical variables, such as location related variables. This is related to the method called “weight of evidence”.
The risk table stores risk (log of smoothed odds ratio) for each level of one categorical variable. For example, variable X
has two levels: A
and B
. For level A
, we compute the following:
- Total number of good transactions,
n_good(A)
, - Total number of bad transactions,
n_bad(A)
. - The smoothed odds,
odds(A) = (n_bad(A)+10)/(n_bad(A)+n_good(A)+100)
. - The the risk of level
A
,Risk(A) = log(odds(A)/(1-odds(A))
.
Similarly, we can compute the risk value for level B. Thus, the risk table of variable X
is constructed as the following:
X | Risk |
---|---|
A | Risk(A) |
B | Risk(B) |
With the risk table, we can assign the risk value to each level. This is how we transform the categorical variable into numerical variable.
Input:
Tagged_Training_Processed
view
Output:
Risk_Var
table: a table stores the name of variables to be converted and the name of risk tablesRisk_xxx
tables: risk tables for variable xxx.
Related Files:
- CreateRiskTable.sql: Create SQL stored procedure named
CreateRiskTable
to generate one certain risk table for a certain variable. - Step6_CreateRiskTables.sql: Create SQL stored procedure named
CreateRiskTable_ForAll
to generate risk tables for all required variables.
Example:
EXEC CreateRiskTable_ForAll
Step 7: Feature Engineering
This step does feature engineering to training data set. We will generate three groups of new features:
- Binary variables. For example, address mismatch flags.
- Numerical risk variables transformed from categorical variables based on the risk tables.
- Aggregates. For example, completed transactions of a customer in past 30 days.
Input:
Tagged_Training_Processed
viewRisk_Var
tableRisk_xxx
tables
Output:
Tagged_Training_Processed_Features
view
Related Files:
- Step7_FeatureEngineer.sql: Create SQL stored procedure named
FeatureEngineer
to do feature engineering.
Example:
EXEC FeatureEngineer 'Tagged_Training_Processed'
Step 8: Model Training
In this step, we train a gradient boosting tree model with the training data set.
Input:
Tagged_Training_Processed_Features
view
Output:
Trained_Model
table: stores the trained model object
Related Files:
- Step8_Training.sql: Create SQL stored procedure named
TrainModelR
to train gradient boosting tree model.
Example:
EXEC TrainModelR 'Tagged_Training_Processed_Features'
Step 9: Batch Scoring
In this step we will do the batch scoring on testing data set including
- Merging with
accountInfo
table if account information doesn’t exist - Preprocessing
- Feature engineering
- Scoring based on the trained model in last step
Input:
Trained_Model
tableTagged_Testing
table
Output:
Predict_Score
table: table stores the predicted scores.
Related Files:
- Step9_Prediction.sql: Create SQL stored procedure named
PredictR
to do merging, preprocessing, feature engineering and scoring for new coming transactions.
Example:
EXEC PredictR 'Tagged_Testing', 'Predict_Score', '0'
Step 10: Evaluation
This step evaluates the performance on both account level and transaction level.
Input:
Predict_Score
table
Output:
Performance
table: stores metrics on account level.Performance_Auc
table: stores metrics on transaction level: AUC of ROC curve.
Related Files:
- Step10A_Evaluation.sql: Create SQL stored procedure named
EvaluateR
to evaluate performance on account level. - Step10B_Evaluation_AUC.sql: Create SQL stored procedure named
EvaluateR_auc
to evaluate performance on transaction level.
Example:
EXEC EvaluateR 'Predict_Score'
EXEC EvaluateR_auc 'Predict_Score'
Step 11: Production Scoring
In this step, we showcase how to score one raw transaction to mimic the real scoring case. This procedure will be called from our example website when a transaction occurs. See Typical Workflow for more information.
Input:
- One hard coded raw transaction
Output:
Predict_Score_Single_Transaction
table: table stores the score of the new input transaction above.
Example:
EXEC ScoreOneTrans 'C34F7C20-6203-42F5-A41B-AF26177345BE,A1055521358474530,2405.33,2405.33,USD,NULL,20130409,102958,14,A,P,NULL,NULL,NULL,92.97,dubayy,0,ae,FALSE,NULL,en-US,CREDITCARD,AMEX,NULL,NULL,NULL,33071,FL,US,NULL,NULL,NULL,NULL,NULL,NULL,M,NULL,0,4,NULL'