For the Database Analyst

This page describes the SQL Server VM on Azure SQL Server On-Premises solution.

For the Database Analyst - Operationalize with SQL

System Requirements

Workflow Automation

Step 0: Data Preparation

Step 1: Merging with Account Information

Step 2: Tagging

Step 3: Splitting Data

Step 4: Preprocessing

Step 5: Saving Transactions to Historical Table

Step 6: Create Risk Tables

Step 7: Feature Engineering

Step 8: Model Training

Step 9: Batch Scoring

Step 10: Evaluation

Step 11: Production Scoring

Fraudulent online transactions are becoming more and more of a problem to online merchants. This solution will show how to build a model to detect potential fraudulent transactions so that the transaction may be rejected. View more information about the data.

In this template, we implemented all steps in SQL stored procedures: data preprocessing is implemented in pure SQL, while feature engineering, model training, scoring and evaluation steps are implemented with SQL stored procedures with embedded R (Microsoft ML Server) code. This implementation with SQL Server ML Services is equivalent to the Azure ML template for Online Fraud Detection.

For businesses that prefers an on-prem solution, the implementation with SQL Server ML Services is a great option, which takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server). The implementation with SQL Server ML Services is a great option, which takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server).

All the steps can be executed on SQL Server client environment (such as SQL Server Management Studio). We provide a Windows PowerShell script, SQLR-Fraud-Detection.ps1, which invokes the SQL scripts and demonstrates the end-to-end modeling process.

System Requirements

The following are required to run the scripts in this solution:

SQL Server (2016 or higher) with Microsoft ML Server (version 9.1.0 or later) installed and configured.
The SQL user name and password, and the user configured properly to execute R scripts in-memory.
SQL Database which the user has write permission and execute stored procedures.
For more information about SQL server 2017 and ML Services, please visit: https://docs.microsoft.com/en-us/sql/advanced-analytics/what-s-new-in-sql-server-machine-learning-services

Workflow Automation

Follow the PowerShell instructions to execute all the scripts described below. Click here to view the details of all tables created in this solution.

Step 0: Data Preparation

The following data are provided in the Data directory:

File	Description
Account_Info.csv	Customer account data
Fraud_Transactions.csv	Raw fraud transaction data
Untagged_Transactions.csv	Raw transaction data without fraud tag

In this step, we’ll create four tables. The first three are: Untagged_Transactions, Account_Info and Fraud, corresponding to the three data sets in Data folder. Once tables have been created, data is uploaded to these tables using bcp command in the powershell script. The fourth table Transaction_History is created for storing historical transactions which will be used to calculate aggregates. This table will be filled in later steps.

Input:

untagged data: Untagged_Transactions.csv
fraud data: Fraud_Transactions.csv
account data: Account_Info.csv

Output:

Untagged_Transactions table in SQL server
Fraud_Transactions table in SQL server
Account_Info table in SQL server
Transaction_History table in SQL server

Step 1: Merging with Account Information

In this step, we merge the Untagged_Transactions table with Account_Info table by accountID to get account information for each transaction. Before merging, we will create utility functions and table Account_Info will be sorted in descent order of accountID and transactionDateTime.

Input:

Untagged_Transactions table
Account_Info table

Output:

Untagged_Transactions_Acct table

UtilityFunctions.sql: Create utility functions to uniform transactionTime to 6 digit
SortAcctTable.sql: Create SQL stored procedure named sortAcctTable to sort Account_Info table.
Step1_MergeAcctInfo.sql: Create stored procedure named MergeAcctInfo to merge Untagged_Transactions table with Account_Info table.

Example:

EXEC sortAcctTable 'Account_Info'
EXEC MergeAcctInfo 'Untagged_Transactions'

Step 2: Tagging

In this step, we tag the untagged data on account level based on the fraud data. The tagging logic is the following. In fraud data, we group it by account ID and sort by time, thus, we have the fraud time period for each fraud account. For each transaction in untagged data, if the account ID is not in fraud data, this transaction is labeled as non fraud (label = 0); if the account ID is in fraud data and the transaction time is within the fraud time period of this account, this transaction is labeled as fraud (label = 1); if the account ID is in fraud data and the transaction time is out of the fraud time period of this account, this transaction is labeled as pre-fraud or unknown (label = 2) which will be removed later. We will also perform re-formatting for some columns. For example, uniform the transactionTime filed to 6 digits.

Input:

Untagged_Transactions_Acct table
Fraud_Transactions table

Output:

Tagged table

Step2_Tagging.sql: Create SQL stored procedure named Tagging to tag the data in account level.

Example:

EXEC Tagging 'Untagged_Transactions_Acct', 'Fraud_Transactions'

Step 3: Splitting Data

In this step, we will hash accountID into 100 different hash code and split the whole data into training(70%) and testing(30%) based on the hash code, e.g., training = hash code <=70 and testing = hash code >70.

Input:

Tagged table

Output:

Tagged_Training table
Tagged_Testing table

Step3_SplitData.sql: Create SQL stored procedure named SplitData to split data into training and testing set.

Example:

EXEC SplitData 'Tagged'

Step 4: Preprocessing

In this step, we clean the tagged training data, i.e., filling missing values with 0 and removing transactions with invalid transaction time and amount.

Input:

Tagged_Training table

Output:

Tagged_Training_Processed view

Step4_Preprocess.sql: Create SQL stored procedure named Preprocess to do preprocessing

Example:

EXEC Preprocess 'Tagged_Training'

Step 5: Saving Transactions to Historical Table

In this step, we save the transactions to Transaction_History table which will be used for calculating aggregates.

Input:

Tagged_Training_Processed view

Output:

Transaction_History table will be filled

Step5_Save2History.sql: Create SQL stored procedure named Save2TransactionHistory to save transactions to historical table. You may use the flag to control whether the historical table need to be truncated or not. e.g., Exec Save2TransactionHistory 'Tagged_Training_Processed' ,'1' means truncating historical table and saving transactions from table Tagged_Training_Processed to historical table.

Example:

EXEC Save2TransactionHistory 'Tagged_Training_Processed' ,'1'

Step 6: Create Risk Tables

In this step, we create risk tables for bunch of categorical variables, such as location related variables. This is related to the method called “weight of evidence”.

The risk table stores risk (log of smoothed odds ratio) for each level of one categorical variable. For example, variable X has two levels: A and B. For level A, we compute the following:

Total number of good transactions, n_good(A),
Total number of bad transactions, n_bad(A).
The smoothed odds, odds(A) = (n_bad(A)+10)/(n_bad(A)+n_good(A)+100).
The the risk of level A, Risk(A) = log(odds(A)/(1-odds(A)).

Similarly, we can compute the risk value for level B. Thus, the risk table of variable X is constructed as the following:

X	Risk
A	Risk(A)
B	Risk(B)

With the risk table, we can assign the risk value to each level. This is how we transform the categorical variable into numerical variable.

Input:

Tagged_Training_Processed view

Output:

Risk_Var table: a table stores the name of variables to be converted and the name of risk tables
Risk_xxx tables: risk tables for variable xxx.

CreateRiskTable.sql: Create SQL stored procedure named CreateRiskTable to generate one certain risk table for a certain variable.
Step6_CreateRiskTables.sql: Create SQL stored procedure named CreateRiskTable_ForAll to generate risk tables for all required variables.

Example:

EXEC CreateRiskTable_ForAll

Step 7: Feature Engineering

This step does feature engineering to training data set. We will generate three groups of new features:

Binary variables. For example, address mismatch flags.
Numerical risk variables transformed from categorical variables based on the risk tables.
Aggregates. For example, completed transactions of a customer in past 30 days.

Input:

Tagged_Training_Processed view
Risk_Var table
Risk_xxx tables

Output:

Tagged_Training_Processed_Features view

Step7_FeatureEngineer.sql: Create SQL stored procedure named FeatureEngineer to do feature engineering.

Example:

EXEC FeatureEngineer 'Tagged_Training_Processed'

Step 8: Model Training

In this step, we train a gradient boosting tree model with the training data set.

Input:

Tagged_Training_Processed_Features view

Output:

Trained_Model table: stores the trained model object

Step8_Training.sql: Create SQL stored procedure named TrainModelR to train gradient boosting tree model.

Example:

EXEC TrainModelR 'Tagged_Training_Processed_Features'

Step 9: Batch Scoring

In this step we will do the batch scoring on testing data set including

Merging with accountInfo table if account information doesn’t exist
Preprocessing
Feature engineering
Scoring based on the trained model in last step

Input:

Trained_Model table
Tagged_Testing table

Output:

Predict_Score table: table stores the predicted scores.

Step9_Prediction.sql: Create SQL stored procedure named PredictR to do merging, preprocessing, feature engineering and scoring for new coming transactions.

Example:

EXEC PredictR 'Tagged_Testing', 'Predict_Score', '0'

Step 10: Evaluation

This step evaluates the performance on both account level and transaction level.

Input:

Predict_Score table

Output:

Performance table: stores metrics on account level.
Performance_Auc table: stores metrics on transaction level: AUC of ROC curve.

Step10A_Evaluation.sql: Create SQL stored procedure named EvaluateR to evaluate performance on account level.
Step10B_Evaluation_AUC.sql: Create SQL stored procedure named EvaluateR_auc to evaluate performance on transaction level.

Example:

EXEC EvaluateR 'Predict_Score'
EXEC EvaluateR_auc 'Predict_Score'

Step 11: Production Scoring

In this step, we showcase how to score one raw transaction to mimic the real scoring case. This procedure will be called from our example website when a transaction occurs. See Typical Workflow for more information.

Input:

One hard coded raw transaction

Output:

Predict_Score_Single_Transaction table: table stores the score of the new input transaction above.

Example:

EXEC ScoreOneTrans 'C34F7C20-6203-42F5-A41B-AF26177345BE,A1055521358474530,2405.33,2405.33,USD,NULL,20130409,102958,14,A,P,NULL,NULL,NULL,92.97,dubayy,0,ae,FALSE,NULL,en-US,CREDITCARD,AMEX,NULL,NULL,NULL,33071,FL,US,NULL,NULL,NULL,NULL,NULL,NULL,M,NULL,0,4,NULL'

Fraud Detection