Template Contents

The following is the directory structure for this template:

Data This contains the copy of the input data.
R This contains the R code to upload the input datasets, pre-process them, create the analytical datasets, train and evaluate a model, and score data.
Resources This directory contains other resources for the solution package.
SQLR This contains T-SQL code to pre-process the datasets, train the models, and score data. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
RSparkCluster This contains the R code to pre-process the datasets, train ttrain and evaluate a model, and score data on a Spark cluster.

In this template with SQL Server ML Services, two versions of the SQL implementation, and another version for HDInsight implementation:

Model Development in R IDE. Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).
Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.
HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.

File	Description
Account_Info.csv	Customer account data
Fraud_Transactions.csv	Raw fraud transaction data
Untagged_Transactions.csv	Raw transaction data without fraud tag

These files are in the R directory for development of the model.

File	Description
FraudDetection.rproj	Project file for RStudio or Visual Studio
FraudDetection.rxproj	Used with the Visual Studio Solution File
FraudDetection.sln	Visual Studio Solution File
Fraud_Detection_Notebook.ipynb	Contains the Jupyter Notebook file that runs all the .R scripts
modeling_main.R	Defines parameters and sources the different scripts for the Development Stage
step1_tagging.R	Tags transactions on account level
step2_splitting_preprocessing.R	Splits the tagged data set into a Training and a Testing set, cleans the training set and performs preprocessing
step3_feature_engineering.R	Performs feature engineering
step4_training_evaluation.R	Trains a boosted tree classification model on the training set, scores and evaluates on testing set

These files are in the SQLR directory.

File	Description
CreateRiskTable.sql	Stored procedure to create risk table for each input variable
.\OnlineFraudDetection.ps1	Automates execution of all .sql files and creates stored procedures
ParseString.sql	Stored procedure to parse a string and to a sql table
example_user.sql	Used during initial SQL Server setup to create the user and password and grant permissions
ScoreOneTrans.sql	Stored procedure to score one transaction
SortAcctTable.sql	Stored procedure to create recordDateTime column for Account_Info table and sort the table
Step0_CreateTables.sql	Creates initial tables from .csv files
Step1_MergeAcctInfo.sql	Stored procedure to merge untagged transactions with account level infomation
Step2_Tagging.sql	Stored procedure to tag transactions on account level
Step3_SplitData.sql	Stored procedure to split data on account level
Step4_Preprocess.sql	Stored procedure to clean data and remove prefraud transactions
Step5_Save2History.sql	Stored procedure to save transactions to historical table
Step6_CreateRiskTables.sql	Stored procedure to create all risk tables
Step7_FeatureEngineer.sql	Stored procedure to perform feature engineering
Step8_Training.sql	Stored procedure to train and save a gradient boosted tree model
Step9_Prediction.sql	Stored procedure to score and save results to a sql table
Step10A_Evaluation.sql	Stored procedure to generate fraud account level metrics
Step10B_Evaluation_AUC.sql	Stored procedure to calculate AUC
UtilityFunctions.sql	Creates functions which will be used

See For the Database Analyst for more information about these files.
Follow the PowerShell Instructions to execute the PowerShell script which automates the running of all these .sql files.

These files are in the RSparkCluster directory.

File	Description
copy_dev_to_prod.R	Defines function, copy_to_prod, used in development_main.R
data_generation.R	Used to generate data, used in development_main.R
in_memory_scoring.R	Performs in-memory scoring for batch scoring or for scoring remotely with a web service
production_main.R	Scores new data using subset of development steps
step0_directories_creation.R	Creates initial directories
step1_merge_account_info.R	Merges the two tables Untagged_Transaction and Account_Info
step2_tagging.R	Tags transactions on account level
step3_splitting.R	Splits the tagged data set into a Training and a Testing set
step4_preprocessing.R	Performs preprocessing on an input data
step5_create_risk_tables.R	Creates the risk tables for various character variables
step6_feature_engineering.R	Performs feature engineering
step7_training.R	Trains a gradient boosted trees (GBT) model on input data
step8_prediction.R	Performs scoring
step9_evaluation.R	Performs evaluation on a scored data set
web_scoring_main.R	Defines and publishes the main web scoring function

Directory	Description
ActionScripts	Contains scripts that are only meant to be run from the solution template deployment process

Fraud Detection