For the Data Scientist

This page describes the SQL Server VM on Azure SQL Server On-Premises HDInsight Spark solution.

For the Data Scientist - Develop with R

Fraud Detection

System Requirements

Step 0: Create Intermediate Directories

Step 1: Tagging

Step 2: Splitting and Preprocessing

Step 3: Feature Engineering

Step 4: Training, Testing and Evaluating

Template Contents

Development Stage

Step 0: Intermediate Directories Creation
Step 1: Merge with Account Information
Step 2: Tagging
Step 3: Splitting
Step 4: Preprocessing
Step 4: Preprocessing
Step 5: Create Risk Tables
Step 6: Feature Engineering
Step 7: Training
Step 8: Prediction
Step 9: Evaluation

Updating the Production Stage Directory

Production Stage

Deploy as a Web Service

Using Your Own Data Set

Template Contents

For businesses that prefer an on-prem solution, the implementation with SQL Server ML Services is a great option, which takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server).

This implementation on Azure SQL Server ML Services is a great option which takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server).

HDInsight is a cloud Spark and Hadoop service for the enterprise. HDInsight is also the only managed cloud Hadoop solution with integration to Microsoft ML Server.

This solution shows how to pre-process data (cleaning and feature engineering), train prediction models, and perform scoring on an HDInsight Spark cluster with Microsoft ML Server.

Data scientists who are testing and developing solutions can work from the convenience of their R IDE on their client machine, while setting the computation context to SQL (see R folder for code). They can also deploy the completed solutions to SQL Server (2016 or higher) by embedding calls to R in stored procedures (see SQLR folder for code). These solutions can then be further automated by the use of SQL Server Integration Services and SQL Server agent: a PowerShell script (.ps1 file) automates the running of the SQL code.

Data scientists who are testing and developing solutions can work from the browser-based Open Source Edition of RStudio Server on the HDInsight Spark cluster edge node, while using a compute context to control whether computation will be performed locally on the edge node, or whether it will be distributed across the nodes in the HDInsight Spark cluster.

Fraud Detection

Fraud detection is one of the earliest industrial applications of data mining and machine learning. This solution shows how to build and deploy a machine learning model for online retailers to detect fraudulent purchase transactions.

This solution will show how to build a model to detect potential fraudulent transactions so that the transaction may be rejected. This implementation with SQL Server ML Services is equivalent to the Azure ML template for Online Fraud Detection.

View more information about the data.

In this solution, the final scored database table Predict_Scores is created in SQL Server. This data is then visualized in PowerBI.

In this solution, an Apache Hive table will be created to show predicted scores. This data is then visualized in PowerBI.

To try this out yourself, visit the Quick Start page.

The file modeling_main.R development_main.R enables the user to define the input and call all the steps. Inputs are: paths to the raw data files, database name, server name, username and password.

The database is created if it does not not already exist, and the connection string as well as the SQL compute context are defined.

This script also:

Opens the Spark connection.
Lets the user specify the paths to the working directories on the edge node and HDFS. We assume they already exist.
Creates a directory, LocalModelsDir, that will store the model and other tables for use in the Production or Web Scoring stages (inside the fraud_dev main function).
Updates the tables of the Production stage directory, ProdModelDir, with the contents of LocalModelsDir (inside the fraud_dev main function).

System Requirements

The following are required to run the scripts in this solution:

SQL Server (2016 or higher) with Microsoft ML Server (version 9.1.0 or later) installed and configured.
The SQL user name and password, and the user configured properly to execute R scripts in-memory.
SQL Database which the user has write permission and execute stored procedures.
For more information about SQL server 2017 and ML Services, please visit: https://docs.microsoft.com/en-us/sql/advanced-analytics/what-s-new-in-sql-server-machine-learning-services

Step 1: Tagging

In this step, the raw data is loaded into SQL in three tables called Untagged_Transactions, Account_Info, and Fraud_Transactions. The date time variable transactionDateTime is created during this upload.

After sorting the table Account_Info into Account_Info_Sort in decreasing order of recordDateTime for each accountID, we merge the two tables Untagged_Transactions and Account_Info_Sort into Untagged_Transactions_Account. (SQL queries are used here instead of the rxMerge function of RevoScaleR because it is not yet available for SQL data sources.) We then remove duplicate observations with another SQL query executed through rxExecuteSQLddl.

Finally, we create labels for the untagged transactions by using the Fraud table. This is done by:

Aggregating the Fraud table on the account level, creating a start and end datetime.
Joining this data with the Untagged_Transactions_Account data with a left join. Start and end time are the NULL for non fraud.
Tagging the data: 0 for non fraud, 1 for fraud, 2 for pre-fraud.

The tagging logic is the following: the Fraud data is grouped by account ID and sorted by time, thus, we have the fraud time period for each fraud account. For each transaction in the untagged data:

if the account ID is not in the fraud data, this transaction is labeled as non fraud (label = 0).
if the account ID is in the fraud data and the transaction time is within the fraud time period of this account, this transaction is labeled as fraud (label = 1).
if the account ID is in the fraud data and the transaction time is out of the fraud time period of this account, this transaction is labeled as pre-fraud or unknown (label = 2) which will be removed later.

Input:

Raw data: Untagged_Transactions.csv, Account_Info.csv, and Fraud_Transactions.csv.

Output:

Tagged SQL table.

Related files:

step1_tagging.R

Step 2: Splitting and Preprocessing

Because feature engineering will require us to compute risk values, and in order to avoid label leakage, the risk values should be computed on a training set. This is why splitting the Tagged data into a training and a testing set is performed before the feature engineering step.

This is done by selecting randomly 70% of accountID to be part of the training set. In order to ensure repeatability and to make sure that the same accountID ends up in the same data set, accountID values are mapped to integers through a hash function, with the mapping and accountID written to the Hash_Id SQL table. We create the Hash_Id table though a SQL query in order to use the same hash function as the SQL Stored Procedures for coherence.

We then create a pointer to the training set, which, at the same time, removes the pre-fraud labels (label = 2), variables not used in the next steps, observations with ID variables missing, and observations where the transaction amount in USD is negative.

After creating this pointer, we apply the clean_preprocess function on it. After detecting the variables with missing values by using the rxSummary function, it wraps the function preprocessing into rxDataStep, and acts on the data as following:

It replaces the missing observations with 0 (or -99 for the variable localHour since 0 already represents a valid value for this variable).
It fixes some data entries.
It converts a few variables to numeric to ensure correct computations in the following steps.

This function will later be applied to the testing set as well.

Input:

Tagged SQL table.

Output:

Tagged_Training_Processed SQL table containing the cleaned and preprocessed training set.
Hash_Id SQL table containing the accountID and the mapping through the hash function.

Related files:

step2_splitting_preprocessing.R

Step 3: Feature Engineering

For feature engineering, we want to design new features:

Risk values for various numeric variables.
Flags for mismatches between addresses, and flags for high amount transactions.
Aggregates corresponding to the number and amount of transactions in the past day and 30 days for every transaction per accountID.

The risk table stores risk (log of smoothed odds ratio) for each level of one categorical variable. For example, variable X has two levels: A and B. For level A, we compute the following:

Total number of good transactions, n_good(A),
Total number of bad transactions, n_bad(A).
The smoothed odds, odds(A) = (n_bad(A)+10)/(n_bad(A)+n_good(A)+100).
The the risk of level A, Risk(A) = log(odds(A)/(1-odds(A)).

Similarly, we can compute the risk value for level B. Thus, the risk table of variable X is constructed as the following:

X	Risk
A	Risk(A)
B	Risk(B)

With the risk table, we can assign the risk value to each level. This is how we transform the categorical variable into numerical variable.

We first compute the risk values, using the training set, with the function create_risk_table. For a given variable of interest, it uses rxSummary to get the proportion of fraudulent and non-fraudulent transactions for every level of that variable. The risk value for a level of the variable will be the log of a smoothed odd fraud rate. The risks are written to SQL tables to be used in feature engineering on the training and testing sets or for batch scoring.

Then, the function assign_risk_and_flags will apply the function assign_risk, wrapped into rxDataStep on the training set. This assigns the risk values for every variable of interest, using the previously created Risk tables. At the same time, rxDataStep creates the address mismatch flags and a flag for high amounts. The output is written to SQL Server in the table Tagged_Training_Processed_Features1. This function will later be applied to the testing set as well.

Finally, we create the aggregates with the function compute_aggregates. They correspond to the number and USD amount of transactions in the past day and 30 days for every transaction per accountID. Although a SQL query can be used, we used R code here for illustrative purposes. The computation follows a standard split-apply-combine process.

Load the intermediate data set Tagged_Training_Processed_Features1 in memory.
Split the data set into a list of data frames for each accountID.
Compute aggregates for each accountID with the function aggregates_account_level.
Combine the results, use zero values when no aggregates, and write the result back to SQL Server.

On an accountID level, aggregates_account_level works as follows on a given data frame of transactions corresponding to an accountID:

Perform a cross-apply of the data frame on itself, while only keeping for each transactionID, all the other transactions that occurred in the past 30 days.
Split the table in 2:

z1day will conatain the transactions and their corresponding transactions that happened in the past 1 day.
z30day will conatain the transactions and their corresponding transactions that happened in the past 30 days.

For each transaction in each of z1day and z30day, we compute the number and total USD amount of the previous transactions.
The aggregated result is returned as a data frame and is the output of the aggregates_account_level function.

The function compute_aggregates will later be used on the testing set as well.

The final data is written to the SQL table Tagged_Training_Processed_Features. Using stringsAsFactors = TRUE, we convert the character variables to factors and get their levels information in the column_info list, to be used as well for the testing set.

Input:

Tagged_Training_Processed SQL table.

Output:

Tagged_Training_Processed_Features SQL table containing new features.
Various Risk SQL tables containing the risk values for each level of the variables.
column_info list to be used on the training and testing sets to specify the types of variables and levels of the factors in particular.

Related files:

step3_feature_engineering.R

Step 4: Training, Testing and Evaluating

After pointing to the training set with the correct variable types (using column_info), we write the formula to be used for the classification. We build a gradient boosted trees (GBT) model with the rxFastTrees algorithm from the MicrosoftML library. The argument unbalancedSets = TRUE helps deal with the class imbalance that is observed in this data set. The trained model is serialized and uploaded to a SQL table Models if needed later, through an Odbc connection.

We then point with a query to the raw testing set, and using the previously defined functions clean_preprocess, assign_risk_and_flags, and compute_aggregates, we get the testing set Tagged_Testing_Processed_Features ready for scoring.

Finally, we compute predictions on the testing set, written to the SQL table Predict_Scores. It is uploaded in memory, and various performance metrics are computed.

AUC (Area Under the Curve) for the ROC. This represents how well the model can differenciate between the non-fraudulent transactions and the fraudulent ones given a good decision threshold in the testing set. We draw the ROC, representing the true positive rate in function of the false positive rate for various possible cutoffs.
Various account level metrics and graphs.

The metric used for assessing accuracy (performance) depends on how the original cases are processed. If each case is processed on a transaction by transaction basis, you can use a standard performance metric, such as transaction-based ROC curve or AUC. However, for fraud detection, typically account-level metrics are used, based on the assumption that once a transaction is discovered to be fraudulent (for example, via customer contact), an action will be taken to block all subsequent transactions. A major difference between account-level metrics and transaction-level metrics is that, typically an account confirmed as a false positive (that is, fraudulent activity was predicted where it did not exist) will not be contacted again during a short period of time, to avoid inconveniencing the customer. The industry standard fraud detection metrics are ADR vs AFPR and VDR vs AFPR for performance, and transaction level performance, as defined here:

ADR – Fraud Account Detection Rate. The percentage of detected fraud accounts in all fraud accounts.
VDR - Value Detection Rate. The percentage of monetary savings, assuming the current fraud transaction triggered a blocking action on subsequent transactions, over all fraud losses.
AFPR - Account False Positive Ratio. The ratio of detected false positive accounts over detected fraud accounts.

You can see these plots as well in the Plots pane after running modeling_main.R development_main.R .

Input:

Tagged_Training_Processed_Features SQL table containing new features and preprocessed training set.
Hash_Id SQL table containing the accountID and the mapping through the hash function.
Various Risk SQL tables containing the risk values for each level of the variables, to be used for feature engineering on the testing set.
column_info list to be used on the training and testing sets to specify the types of variables and levels of the factors in particular.

Output:

Models SQL table containing the serialized GBT model.
Tagged_Testing_Processed_Features SQL table containing new features and preprocessed testing set.
Predict_Score SQL table containing the predictions made on the testing set.
Performance metrics and graphs.

Related files:

step4_training_evaluation.R

Updating the Production Stage Directory (“Copy Dev to Prod”)

At the end of the main function of the script development_main.R, the copy_dev_to_prod.R script is invoked in order to copy (overwrite if it already exists) the model, statistics and other data from the Development Stage to a directory of the Production or Web Scoring stage.

If you do not wish to overwrite the model currently in use in a Production stage, you can either save them to a different directory, or set update_prod_flag to 0 inside the main function.

Production Stage

The R code from each of the above steps is operationalized in the SQL Server as stored procedures. See For the Database Analyst for more details.

Visualize Results

The final scores for the test data reside in the table Predict_Score of the Fraud database. The test data itself is in the Tagged_Testing table. The next step of this solution is to visualize both tables in PowerBI.

The final scores for the test data reside in the Hive table Predict_Score. The test data itself is in the Tagged_Testing table. The next step of this solution is to visualize both tables in PowerBI.

See For the Business Manager for details of the PowerBI dashboard.

Step 0: Intermediate Directories Creation

In this step, we create or clean intermediate directories both on the edge node and HDFS. These directories will hold all the intermediate processed data sets in subfolders.

Related Files

step0_directories_creation.R

Step 1: Merge with Account Information

In this step, we convert the untagged transaction data and account info data into hive table and merge them by the account ID using hive command. Then every transaction will get corresponding account information. The duplicated rows will be removed as well. If this step is used in “production scoring” or “web scoring” stage, we will add a fake label column to the data for rxPredict function work properly later.

Input

Two data files: Untagged_Transactions and Account_Info.
The working directory on HDFS.
Stage: “Dev” for development, “Prod” for batch scoring, “Web” for web scoring.

Output

Hive table: UntaggedTransactionsAccountUnique (Stage = “Dev”) or TaggedProd (Stage = “Prod” or “Web”)

Related Files

step1_merge_account_info.R

Step 2: Tagging

In this step, we tag the untagged data on account level based on the fraud data. The tagging logic is the following. In fraud data, we group it by account ID and sort by time, thus, we have the fraud time period for each fraud account. For each transaction in untagged data, if the account ID is not in fraud data, this transaction is labeled as non fraud (label = 0); if the account ID is in fraud data and the transaction time is within the fraud time period of this account, this transaction is labeled as fraud (label = 1); if the account ID is in fraud data and the transaction time is out of the fraud time period of this account, this transaction is labeled as pre-fraud or unknown (label = 2) which will be removed later.

Input

Input_Hive_Table: name of the hive table from the merging step with the untagged transactions and account info.
Path to csv Fraud files with the raw data Fraud_Transactions.csv
HDFSWorkDir: Working directory on HDFS.

Output

Tagged data.

Related Files

step2_tagging.R

Step 3: Splitting

In this step, we will hash accountID into 100 different hash code and split the whole data into training(70%) and testing(30%) based on the hash code, e.g., training = hash code <=70 and testing = hash code >70. In the same time, transactions with label = 2 will be removed.

Input

Tagged data set.

Output

Training and Testing sets.

Related Files

step3_splitting.R

Step 4: Preprocessing

In this step, We use rxSummary function to get the missing information. Missing values of localHour will be filled with -99. Missing values for the rest columns will be filled with 0. We also fix some data entries and convert a few variables to numeric.

Input

HDFSWorkDir: Working directory on HDFS.
HiveTable: Input data name of Hive table to be preprocessed.

Output

Hive table with preprocessed data.

Related Files

step4_preprocessing.R

Step 5: Create Risk Tables

In this step, we create risk tables for categorical variables, such as location related variables.

The risk table stores risk (log of smoothed odds ratio) for each level of one categorical variable. For example, variable X has two levels: A and B. For level A, we compute the following:

Total number of good transactions, n_good(A),
Total number of bad transactions, n_bad(A).
The smoothed odds, odds(A) = (n_bad(A)+10)/(n_bad(A)+n_good(A)+100).
The the risk of level A, Risk(A) = log(odds(A)/(1-odds(A)).

Similarly, we can compute the risk value for level B. Thus, the risk table of variable X is constructed as the following:

X	Risk
A	Risk(A)
B	Risk(B)

With the risk table, we can assign the risk value to each level. This is how we transform the categorical variable into numerical variable.

rxSummary function is used to get the count of fraud and non-fraud for variables need to be converted. Then, for each variable, we combine the count for fraud and non-fraud to calculate risk table. All risk tables will be put into one list and saved to model directory on edge node.

Input

LocalWorkDir and HDFSWorkDir: working directories on HDFS and local edge node.
HiveTable: name of the Hive table containing the preprocessed training set to be used to create risk tables.
smooth1 and smooth2: smoothing parameters used to compute the risk values.

Output

Risk tables embedded in a list Risk_list, saved on the edge node.

Related Files

step5_create_risk_tables.R

Step 6: Feature Engineering

This step does feature engineering for an input data set. We will generate three groups of new features:

Numerical risk variables transformed from categorical variables based on the risk tables generated in step 5.
Binary variables. For example, high amount flag and address mismatch flags.
Aggregates. For example, completed transactions of a customer in past 1 day and 30 days.

To calculate the aggregates, since there is no existing rx-function we can use, we have to use regular R functions. In order to make it scalable to big data, we hash the account ID and split the data into small chunks by hash code. Chunks are account ID exclusive so that we can safely apply aggregates calculation to each chunk in parallel. Finally, the output chunks will be combined to one xdf file.

Input

LocalWorkDir and HDFSWorkDir: working directories on HDFS and local edge node.
HiveTable: name of the Hive table containing the preprocessed data set to which new features will be added.
Stage: “Dev” for development, “Prod” for batch scoring, “Web” for web scoring.

Output

Preprocessed xdf file with new features and correct variable types.

Related Files

step6_feature_engineering.R

Step 7: Training

In this step, we will train a GBT model with the training data. Note that, the label is imbalanced (much more non-fraud than fraud), and this can be handled by “unbalancedSets” argument in “rxFastTrees” function. The trained model will be saved to the model directory on local edge node.

Input

LocalWorkDir and HDFSWorkDir: working directories on HDFS and local edge node.
Input_Data_Xdf: training data.

Output

Trained GBT model object.

Related Files

step7_training.R

Step 8: Prediction

In this step, we do prediction (scoring) on the model created in step 7. If Stage is “Dev” or “Prod”, the model object is loaded from model directory on edge node. If Stage is “Web", the model object is directly passed. In “Dev” Stage, we will also create a hive table for the scored data set. The hive table will be ingested by PowerBI for visualization.

Input

LocalWorkDir and HDFSWorkDir: working directories on HDFS and local edge node.
Input_Data_Xdf: input data name of xdf file to be scored.
Stage: “Dev” for development, “Prod” for batch scoring, “Web” for web scoring.

Output

Scored data set.

Related Files

step8_prediction.R

Step 9: Evaluation

In this step, we will evaluate the scored data set if the ground truth label exists (thus, only performed in development stage). We create both transaction and account level metrics. For transaction level metrics, rxRoc function is used to get the ROC curve and rxAuc function is used to calculate the AUC. For account level metrics, we import the data in memory and use customized function to get the result.

Input

HDFSWorkDir: working directories on HDFS and local edge node
Scored_Data_Xdf: scored data set

Output

AUC.
Plotted ROC curve.
Account level metrics and plots

Related Files

step9_evaluation.R

Updating the Production Stage Directory (“Copy Dev to Prod”)

If you do not wish to overwrite the model currently in use in a Production stage, you can either save them to a different directory, or set update_prod_flag to 0. If you are running the solution at the very first time, make sure to set the flag to 1.

Production Stage

In the Production stage, the goal is to perform a batch scoring.

The script production_main.R will complete this task by invoking the scripts described above. The batch scoring can be done either:

In-memory : The input should be provided as data frames. All the preprocessing and scoring steps are done in-memory on the edge node (local compute context). In this case, the main batch scoring function calls the R script in_memory_scoring.R.
Using data stored on HDFS: The input should be provided as paths to the Production data sets. All the preprocessing and scoring steps are one on HDFS in Spark Compute Context.

When the data set to be scored is relatively small and can fit in memory on the edge node, it is recommended to perform an in-memory scoring because of the overhead of using Spark which would make the scoring much slower.

The script:

Lets the user specify the paths to the Production working directories on the edge node and HDFS (only used for Spark compute context).
Lets the user specify the paths to the Production data sets Untagged_Transactions and Account_Info (Spark Compute Context) or point to them if they are data frames loaded in memory on the edge node (In-memory scoring).

The computations described in the Development stage are performed, with the following differences:

No tagging step. A fake label column will be given instead.
No splitting into a training and testing set, since the whole data is used for scoring.
No creating risk tables. The risk tables generated in development stage will be loaded for use.
No training. The GBT model created in the Development Stage is loaded and used for predictions
No model evaluation since usually we don’t have ground truth label for new coming data.

Warning: in case you get the following error: “Error: file.exists(inData1) is not TRUE”, you should reset your R session with Ctrl + Shift + F10 (or Session -> Restart R) and try running it again.

Deploy as a Web Service

In the script web_scoring_main.R, we define a scoring function and deploy it as a web service so that customers can score their own data sets locally/remotely through the API. Again, the scoring can be done either:

In-memory : The input should be provided as data frames. All the preprocessing and scoring steps are done in-memory on the edge node (local compute context). In this case, the main batch scoring function calls the R script in_memory_scoring.R.
Using data stored on HDFS: The input should be provided as paths to the Production data sets. All the preprocessing and scoring steps are one on HDFS in Spark Compute Context.

This is done in the following way:

Log into the ML server that hosts the web services as admin. Note that even if you are already on the edge node, you still need to perform this step for authentication purpose.
Specify the paths to the working directories on the edge node and HDFS.
Specify the paths to the input data sets Untagged_Transactions and Account_Info or point to them if they are data frames loaded in memory on the edge node.
Load the static .rds files needed for scoring and created in the Development Stage. They are wrapped into a list called model_objects which will be published along with the scoring function.
Define the web scoring function which calls the steps like for the Production stage.
Publish as a web service using the publishService function. Two web services are published: one for the string input (Spark Compute Context) and one for a data frame input (In-memory scoring in local compute context). In order to update an existing web service, use the updateService function to do so. Note that you cannot publish a new web service with the same name and version twice, so you might have to change the version number.
Verification:
- Verify the API locally: call the API from the edge node.
- Verify the API remotely: call the API from your local machine. You still need to remote login as admin from your local machine in the beginning. It is not allowed to connect to the edge node which hosts the service directly from other machines. The workaround is to open an ssh session with port 12800 and leave this session on. Then, you can remote login. Use getService function to get the published API and call the API on your local R console.

Using Your Own Data Set

A new data set can be used for either or both the Modeling and the Production pipeline. However, for this template to work as is, it should follow these requirements:

The data schema should be the same for files Untagged_Transactions, Account_Info and Fraud_Transactions.

Template Contents

View the contents of this solution template.

To try this out yourself:

View the Quick Start.

< Home