Fraud Detection

Template Contents


The following is the directory structure for this template:

  • Data This contains the copy of the input data.
  • R This contains the R code to upload the input datasets, pre-process them, create the analytical datasets, train and evaluate a model, and score data.
  • Resources This directory contains other resources for the solution package.
  • SQLR This contains T-SQL code to pre-process the datasets, train the models, and score data. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
  • RSparkCluster This contains the R code to pre-process the datasets, train ttrain and evaluate a model, and score data on a Spark cluster.

In this template with SQL Server ML Services, two versions of the SQL implementation, and another version for HDInsight implementation:

  1. Model Development in R IDE. Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).

  2. Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.

  3. HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.

Copy of Input Datasets


File Description
Account_Info.csv Customer account data
Fraud_Transactions.csv Raw fraud transaction data
Untagged_Transactions.csv Raw transaction data without fraud tag

Model Development in R


These files are in the R directory for development of the model.

File Description
FraudDetection.rproj Project file for RStudio or Visual Studio
FraudDetection.rxproj Used with the Visual Studio Solution File
FraudDetection.sln Visual Studio Solution File
Fraud_Detection_Notebook.ipynb Contains the Jupyter Notebook file that runs all the .R scripts
modeling_main.R Defines parameters and sources the different scripts for the Development Stage
step1_tagging.R Tags transactions on account level
step2_splitting_preprocessing.R Splits the tagged data set into a Training and a Testing set, cleans the training set and performs preprocessing
step3_feature_engineering.R Performs feature engineering
step4_training_evaluation.R Trains a boosted tree classification model on the training set, scores and evaluates on testing set

Operationalize in SQL Server


These files are in the SQLR directory.

File Description
CreateRiskTable.sql Stored procedure to create risk table for each input variable
.\OnlineFraudDetection.ps1 Automates execution of all .sql files and creates stored procedures
ParseString.sql Stored procedure to parse a string and to a sql table
example_user.sql Used during initial SQL Server setup to create the user and password and grant permissions
ScoreOneTrans.sql Stored procedure to score one transaction
SortAcctTable.sql Stored procedure to create recordDateTime column for Account_Info table and sort the table
Step0_CreateTables.sql Creates initial tables from .csv files
Step1_MergeAcctInfo.sql Stored procedure to merge untagged transactions with account level infomation
Step2_Tagging.sql Stored procedure to tag transactions on account level
Step3_SplitData.sql Stored procedure to split data on account level
Step4_Preprocess.sql Stored procedure to clean data and remove prefraud transactions
Step5_Save2History.sql Stored procedure to save transactions to historical table
Step6_CreateRiskTables.sql Stored procedure to create all risk tables
Step7_FeatureEngineer.sql Stored procedure to perform feature engineering
Step8_Training.sql Stored procedure to train and save a gradient boosted tree model
Step9_Prediction.sql Stored procedure to score and save results to a sql table
Step10A_Evaluation.sql Stored procedure to generate fraud account level metrics
Step10B_Evaluation_AUC.sql Stored procedure to calculate AUC
UtilityFunctions.sql Creates functions which will be used

HDInsight Solution on Spark Cluster


These files are in the RSparkCluster directory.

File Description
copy_dev_to_prod.RDefines function, copy_to_prod, used in development_main.R
data_generation.RUsed to generate data, used in development_main.R
in_memory_scoring.RPerforms in-memory scoring for batch scoring or for scoring remotely with a web service
production_main.R Scores new data using subset of development steps
step0_directories_creation.RCreates initial directories
step1_merge_account_info.RMerges the two tables Untagged_Transaction and Account_Info
step2_tagging.RTags transactions on account level
step3_splitting.RSplits the tagged data set into a Training and a Testing set
step4_preprocessing.R Performs preprocessing on an input data
step5_create_risk_tables.R Creates the risk tables for various character variables
step6_feature_engineering.R Performs feature engineering
step7_training.R Trains a gradient boosted trees (GBT) model on input data
step8_prediction.R Performs scoring
step9_evaluation.R Performs evaluation on a scored data set
web_scoring_main.R Defines and publishes the main web scoring function

Resources for the Solution Package


Directory Description
ActionScripts Contains scripts that are only meant to be run from the solution template deployment process

< Home