Template Contents
The following is the directory structure for this template:
- Data This contains the copy of the input data.
- R This contains the R code to upload the input datasets, pre-process them, create the analytical datasets, train and evaluate a model, and score data.
- Resources This directory contains other resources for the solution package.
- SQLR This contains T-SQL code to pre-process the datasets, train the models, and score data. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
- RSparkCluster This contains the R code to pre-process the datasets, train ttrain and evaluate a model, and score data on a Spark cluster.
In this template with SQL Server ML Services, two versions of the SQL implementation, and another version for HDInsight implementation:
-
Model Development in R IDE. Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).
-
Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.
-
HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.
Copy of Input Datasets
File | Description |
---|---|
Account_Info.csv | Customer account data |
Fraud_Transactions.csv | Raw fraud transaction data |
Untagged_Transactions.csv | Raw transaction data without fraud tag |
Model Development in R
These files are in the R directory for development of the model.
File | Description |
---|---|
FraudDetection.rproj | Project file for RStudio or Visual Studio |
FraudDetection.rxproj | Used with the Visual Studio Solution File |
FraudDetection.sln | Visual Studio Solution File |
Fraud_Detection_Notebook.ipynb | Contains the Jupyter Notebook file that runs all the .R scripts |
modeling_main.R | Defines parameters and sources the different scripts for the Development Stage |
step1_tagging.R | Tags transactions on account level |
step2_splitting_preprocessing.R | Splits the tagged data set into a Training and a Testing set, cleans the training set and performs preprocessing |
step3_feature_engineering.R | Performs feature engineering |
step4_training_evaluation.R | Trains a boosted tree classification model on the training set, scores and evaluates on testing set |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Operationalize in SQL Server
These files are in the SQLR directory.
File | Description |
---|---|
CreateRiskTable.sql | Stored procedure to create risk table for each input variable |
.\OnlineFraudDetection.ps1 | Automates execution of all .sql files and creates stored procedures |
ParseString.sql | Stored procedure to parse a string and to a sql table |
example_user.sql | Used during initial SQL Server setup to create the user and password and grant permissions |
ScoreOneTrans.sql | Stored procedure to score one transaction |
SortAcctTable.sql | Stored procedure to create recordDateTime column for Account_Info table and sort the table |
Step0_CreateTables.sql | Creates initial tables from .csv files |
Step1_MergeAcctInfo.sql | Stored procedure to merge untagged transactions with account level infomation |
Step2_Tagging.sql | Stored procedure to tag transactions on account level |
Step3_SplitData.sql | Stored procedure to split data on account level |
Step4_Preprocess.sql | Stored procedure to clean data and remove prefraud transactions |
Step5_Save2History.sql | Stored procedure to save transactions to historical table |
Step6_CreateRiskTables.sql | Stored procedure to create all risk tables |
Step7_FeatureEngineer.sql | Stored procedure to perform feature engineering |
Step8_Training.sql | Stored procedure to train and save a gradient boosted tree model |
Step9_Prediction.sql | Stored procedure to score and save results to a sql table |
Step10A_Evaluation.sql | Stored procedure to generate fraud account level metrics |
Step10B_Evaluation_AUC.sql | Stored procedure to calculate AUC |
UtilityFunctions.sql | Creates functions which will be used |
- See For the Database Analyst for more information about these files.
- Follow the PowerShell Instructions to execute the PowerShell script which automates the running of all these .sql files.
HDInsight Solution on Spark Cluster
These files are in the RSparkCluster directory.
File | Description |
---|---|
copy_dev_to_prod.R | Defines function, copy_to_prod, used in development_main.R |
data_generation.R | Used to generate data, used in development_main.R |
in_memory_scoring.R | Performs in-memory scoring for batch scoring or for scoring remotely with a web service |
production_main.R | Scores new data using subset of development steps |
step0_directories_creation.R | Creates initial directories |
step1_merge_account_info.R | Merges the two tables Untagged_Transaction and Account_Info |
step2_tagging.R | Tags transactions on account level |
step3_splitting.R | Splits the tagged data set into a Training and a Testing set |
step4_preprocessing.R | Performs preprocessing on an input data |
step5_create_risk_tables.R | Creates the risk tables for various character variables |
step6_feature_engineering.R | Performs feature engineering |
step7_training.R | Trains a gradient boosted trees (GBT) model on input data |
step8_prediction.R | Performs scoring |
step9_evaluation.R | Performs evaluation on a scored data set |
web_scoring_main.R | Defines and publishes the main web scoring function |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Resources for the Solution Package
Directory | Description |
---|---|
ActionScripts | Contains scripts that are only meant to be run from the solution template deployment process |