Template Contents
The following is the directory structure for this template:
- Data This contains the copy of the input data.
- R This contains the R code to upload the input datasets, pre-process them, create the analytical datasets, train and evaluate a model, and score data.
- Resources This directory contains other resources for the solution package.
- SQLR This contains T-SQL code to pre-process the datasets, train the models, and score data. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
- RSparkCluster This contains the R code to pre-process the datasets, train ttrain and evaluate a model, and score data on a Spark cluster.
In this template with SQL Server ML Services, two versions of the SQL implementation, and another version for HDInsight implementation:
-
Model Development in R IDE. Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).
-
Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.
-
HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.
Copy of Input Datasets
| File | Description |
|---|---|
| Account_Info.csv | Customer account data |
| Fraud_Transactions.csv | Raw fraud transaction data |
| Untagged_Transactions.csv | Raw transaction data without fraud tag |
Model Development in R
These files are in the R directory for development of the model.
| File | Description |
|---|---|
| FraudDetection.rproj | Project file for RStudio or Visual Studio |
| FraudDetection.rxproj | Used with the Visual Studio Solution File |
| FraudDetection.sln | Visual Studio Solution File |
| Fraud_Detection_Notebook.ipynb | Contains the Jupyter Notebook file that runs all the .R scripts |
| modeling_main.R | Defines parameters and sources the different scripts for the Development Stage |
| step1_tagging.R | Tags transactions on account level |
| step2_splitting_preprocessing.R | Splits the tagged data set into a Training and a Testing set, cleans the training set and performs preprocessing |
| step3_feature_engineering.R | Performs feature engineering |
| step4_training_evaluation.R | Trains a boosted tree classification model on the training set, scores and evaluates on testing set |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Operationalize in SQL Server
These files are in the SQLR directory.
| File | Description |
|---|---|
| CreateRiskTable.sql | Stored procedure to create risk table for each input variable |
| .\OnlineFraudDetection.ps1 | Automates execution of all .sql files and creates stored procedures |
| ParseString.sql | Stored procedure to parse a string and to a sql table |
| example_user.sql | Used during initial SQL Server setup to create the user and password and grant permissions |
| ScoreOneTrans.sql | Stored procedure to score one transaction |
| SortAcctTable.sql | Stored procedure to create recordDateTime column for Account_Info table and sort the table |
| Step0_CreateTables.sql | Creates initial tables from .csv files |
| Step1_MergeAcctInfo.sql | Stored procedure to merge untagged transactions with account level infomation |
| Step2_Tagging.sql | Stored procedure to tag transactions on account level |
| Step3_SplitData.sql | Stored procedure to split data on account level |
| Step4_Preprocess.sql | Stored procedure to clean data and remove prefraud transactions |
| Step5_Save2History.sql | Stored procedure to save transactions to historical table |
| Step6_CreateRiskTables.sql | Stored procedure to create all risk tables |
| Step7_FeatureEngineer.sql | Stored procedure to perform feature engineering |
| Step8_Training.sql | Stored procedure to train and save a gradient boosted tree model |
| Step9_Prediction.sql | Stored procedure to score and save results to a sql table |
| Step10A_Evaluation.sql | Stored procedure to generate fraud account level metrics |
| Step10B_Evaluation_AUC.sql | Stored procedure to calculate AUC |
| UtilityFunctions.sql | Creates functions which will be used |
- See For the Database Analyst for more information about these files.
- Follow the PowerShell Instructions to execute the PowerShell script which automates the running of all these .sql files.
HDInsight Solution on Spark Cluster
These files are in the RSparkCluster directory.
| File | Description |
|---|---|
| copy_dev_to_prod.R | Defines function, copy_to_prod, used in development_main.R |
| data_generation.R | Used to generate data, used in development_main.R |
| in_memory_scoring.R | Performs in-memory scoring for batch scoring or for scoring remotely with a web service |
| production_main.R | Scores new data using subset of development steps |
| step0_directories_creation.R | Creates initial directories |
| step1_merge_account_info.R | Merges the two tables Untagged_Transaction and Account_Info |
| step2_tagging.R | Tags transactions on account level |
| step3_splitting.R | Splits the tagged data set into a Training and a Testing set |
| step4_preprocessing.R | Performs preprocessing on an input data |
| step5_create_risk_tables.R | Creates the risk tables for various character variables |
| step6_feature_engineering.R | Performs feature engineering |
| step7_training.R | Trains a gradient boosted trees (GBT) model on input data |
| step8_prediction.R | Performs scoring |
| step9_evaluation.R | Performs evaluation on a scored data set |
| web_scoring_main.R | Defines and publishes the main web scoring function |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Resources for the Solution Package
| Directory | Description |
|---|---|
| ActionScripts | Contains scripts that are only meant to be run from the solution template deployment process |