Fraud Detection

For the Data Scientist - Develop with R


This implementation on Azure SQL Server ML Services is a great option which takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server).
Data scientists who are testing and developing solutions can work from the convenience of their R IDE on their client machine, while setting the computation context to SQL (see R folder for code). They can also deploy the completed solutions to SQL Server (2016 or higher) by embedding calls to R in stored procedures (see SQLR folder for code). These solutions can then be further automated by the use of SQL Server Integration Services and SQL Server agent: a PowerShell script (.ps1 file) automates the running of the SQL code.

Fraud Detection


Fraud detection is one of the earliest industrial applications of data mining and machine learning. This solution shows how to build and deploy a machine learning model for online retailers to detect fraudulent purchase transactions.

This solution will show how to build a model to detect potential fraudulent transactions so that the transaction may be rejected. This implementation with SQL Server ML Services is equivalent to the Azure ML template for Online Fraud Detection.

View more information about the data.

In this solution, the final scored database table Predict_Scores is created in SQL Server. This data is then visualized in PowerBI.

To try this out yourself, visit the Quick Start page.

The file modeling_main.R enables the user to define the input and call all the steps. Inputs are: paths to the raw data files, database name, server name, username and password.

The database is created if it does not not already exist, and the connection string as well as the SQL compute context are defined.

Step 1: Tagging


In this step, the raw data is loaded into SQL in three tables called Untagged_Transactions, Account_Info, and Fraud_Transactions. The date time variable transactionDateTime is created during this upload.

After sorting the table Account_Info into Account_Info_Sort in decreasing order of recordDateTime for each accountID, we merge the two tables Untagged_Transactions and Account_Info_Sort into Untagged_Transactions_Account. (SQL queries are used here instead of the rxMerge function of RevoScaleR because it is not yet available for SQL data sources.) We then remove duplicate observations with another SQL query executed through rxExecuteSQLddl.

Finally, we create labels for the untagged transactions by using the Fraud table. This is done by:

  1. Aggregating the Fraud table on the account level, creating a start and end datetime.

  2. Joining this data with the Untagged_Transactions_Account data with a left join. Start and end time are the NULL for non fraud.

  3. Tagging the data: 0 for non fraud, 1 for fraud, 2 for pre-fraud.

The tagging logic is the following: the Fraud data is grouped by account ID and sorted by time, thus, we have the fraud time period for each fraud account. For each transaction in the untagged data:

  • if the account ID is not in the fraud data, this transaction is labeled as non fraud (label = 0).
  • if the account ID is in the fraud data and the transaction time is within the fraud time period of this account, this transaction is labeled as fraud (label = 1).
  • if the account ID is in the fraud data and the transaction time is out of the fraud time period of this account, this transaction is labeled as pre-fraud or unknown (label = 2) which will be removed later.

Input:

  • Raw data: Untagged_Transactions.csv, Account_Info.csv, and Fraud_Transactions.csv.

Output:

  • Tagged SQL table.

Related files:

  • step1_tagging.R

Step 2: Splitting and Preprocessing


Because feature engineering will require us to compute risk values, and in order to avoid label leakage, the risk values should be computed on a training set. This is why splitting the Tagged data into a training and a testing set is performed before the feature engineering step.

This is done by selecting randomly 70% of accountID to be part of the training set. In order to ensure repeatability and to make sure that the same accountID ends up in the same data set, accountID values are mapped to integers through a hash function, with the mapping and accountID written to the Hash_Id SQL table. We create the Hash_Id table though a SQL query in order to use the same hash function as the SQL Stored Procedures for coherence.

We then create a pointer to the training set, which, at the same time, removes the pre-fraud labels (label = 2), variables not used in the next steps, observations with ID variables missing, and observations where the transaction amount in USD is negative.

After creating this pointer, we apply the clean_preprocess function on it. After detecting the variables with missing values by using the rxSummary function, it wraps the function preprocessing into rxDataStep, and acts on the data as following:

  • It replaces the missing observations with 0 (or -99 for the variable localHour since 0 already represents a valid value for this variable).
  • It fixes some data entries.
  • It converts a few variables to numeric to ensure correct computations in the following steps.

This function will later be applied to the testing set as well.

Input:

  • Tagged SQL table.

Output:

  • Tagged_Training_Processed SQL table containing the cleaned and preprocessed training set.
  • Hash_Id SQL table containing the accountID and the mapping through the hash function.

Related files:

  • step2_splitting_preprocessing.R

Step 3: Feature Engineering


For feature engineering, we want to design new features:

  • Risk values for various numeric variables.
  • Flags for mismatches between addresses, and flags for high amount transactions.
  • Aggregates corresponding to the number and amount of transactions in the past day and 30 days for every transaction per accountID.

The risk table stores risk (log of smoothed odds ratio) for each level of one categorical variable. For example, variable X has two levels: A and B. For level A, we compute the following:

  • Total number of good transactions, n_good(A),
  • Total number of bad transactions, n_bad(A).
  • The smoothed odds, odds(A) = (n_bad(A)+10)/(n_bad(A)+n_good(A)+100).
  • The the risk of level A, Risk(A) = log(odds(A)/(1-odds(A)).

Similarly, we can compute the risk value for level B. Thus, the risk table of variable X is constructed as the following:

X Risk
A Risk(A)
B Risk(B)

With the risk table, we can assign the risk value to each level. This is how we transform the categorical variable into numerical variable.

We first compute the risk values, using the training set, with the function create_risk_table. For a given variable of interest, it uses rxSummary to get the proportion of fraudulent and non-fraudulent transactions for every level of that variable. The risk value for a level of the variable will be the log of a smoothed odd fraud rate. The risks are written to SQL tables to be used in feature engineering on the training and testing sets or for batch scoring.

Then, the function assign_risk_and_flags will apply the function assign_risk, wrapped into rxDataStep on the training set. This assigns the risk values for every variable of interest, using the previously created Risk tables. At the same time, rxDataStep creates the address mismatch flags and a flag for high amounts. The output is written to SQL Server in the table Tagged_Training_Processed_Features1. This function will later be applied to the testing set as well.

Finally, we create the aggregates with the function compute_aggregates. They correspond to the number and USD amount of transactions in the past day and 30 days for every transaction per accountID. Although a SQL query can be used, we used R code here for illustrative purposes. The computation follows a standard split-apply-combine process.

  1. Load the intermediate data set Tagged_Training_Processed_Features1 in memory.
  2. Split the data set into a list of data frames for each accountID.
  3. Compute aggregates for each accountID with the function aggregates_account_level.
  4. Combine the results, use zero values when no aggregates, and write the result back to SQL Server.

On an accountID level, aggregates_account_level works as follows on a given data frame of transactions corresponding to an accountID:

  • Perform a cross-apply of the data frame on itself, while only keeping for each transactionID, all the other transactions that occurred in the past 30 days.
  • Split the table in 2:
    • z1day will conatain the transactions and their corresponding transactions that happened in the past 1 day.
    • z30day will conatain the transactions and their corresponding transactions that happened in the past 30 days.
  • For each transaction in each of z1day and z30day, we compute the number and total USD amount of the previous transactions.
  • The aggregated result is returned as a data frame and is the output of the aggregates_account_level function.

The function compute_aggregates will later be used on the testing set as well.

The final data is written to the SQL table Tagged_Training_Processed_Features. Using stringsAsFactors = TRUE, we convert the character variables to factors and get their levels information in the column_info list, to be used as well for the testing set.

Input:

  • Tagged_Training_Processed SQL table.

Output:

  • Tagged_Training_Processed_Features SQL table containing new features.
  • Various Risk SQL tables containing the risk values for each level of the variables.
  • column_info list to be used on the training and testing sets to specify the types of variables and levels of the factors in particular.

Related files:

  • step3_feature_engineering.R

Step 4: Training, Testing and Evaluating


After pointing to the training set with the correct variable types (using column_info), we write the formula to be used for the classification. We build a gradient boosted trees (GBT) model with the rxFastTrees algorithm from the MicrosoftML library. The argument unbalancedSets = TRUE helps deal with the class imbalance that is observed in this data set. The trained model is serialized and uploaded to a SQL table Models if needed later, through an Odbc connection.

We then point with a query to the raw testing set, and using the previously defined functions clean_preprocess, assign_risk_and_flags, and compute_aggregates, we get the testing set Tagged_Testing_Processed_Features ready for scoring.

Finally, we compute predictions on the testing set, written to the SQL table Predict_Scores. It is uploaded in memory, and various performance metrics are computed.

  • AUC (Area Under the Curve) for the ROC. This represents how well the model can differenciate between the non-fraudulent transactions and the fraudulent ones given a good decision threshold in the testing set. We draw the ROC, representing the true positive rate in function of the false positive rate for various possible cutoffs.

  • Various account level metrics and graphs.

The metric used for assessing accuracy (performance) depends on how the original cases are processed. If each case is processed on a transaction by transaction basis, you can use a standard performance metric, such as transaction-based ROC curve or AUC. However, for fraud detection, typically account-level metrics are used, based on the assumption that once a transaction is discovered to be fraudulent (for example, via customer contact), an action will be taken to block all subsequent transactions. A major difference between account-level metrics and transaction-level metrics is that, typically an account confirmed as a false positive (that is, fraudulent activity was predicted where it did not exist) will not be contacted again during a short period of time, to avoid inconveniencing the customer. The industry standard fraud detection metrics are ADR vs AFPR and VDR vs AFPR for performance, and transaction level performance, as defined here:

  • ADR – Fraud Account Detection Rate. The percentage of detected fraud accounts in all fraud accounts.
  • VDR - Value Detection Rate. The percentage of monetary savings, assuming the current fraud transaction triggered a blocking action on subsequent transactions, over all fraud losses.
  • AFPR - Account False Positive Ratio. The ratio of detected false positive accounts over detected fraud accounts.

You can see these plots as well in the Plots pane after running modeling_main.R .

Input:

  • Tagged_Training_Processed_Features SQL table containing new features and preprocessed training set.
  • Hash_Id SQL table containing the accountID and the mapping through the hash function.
  • Various Risk SQL tables containing the risk values for each level of the variables, to be used for feature engineering on the testing set.
  • column_info list to be used on the training and testing sets to specify the types of variables and levels of the factors in particular.

Output:

  • Models SQL table containing the serialized GBT model.
  • Tagged_Testing_Processed_Features SQL table containing new features and preprocessed testing set.
  • Predict_Score SQL table containing the predictions made on the testing set.
  • Performance metrics and graphs.

Related files:

  • step4_training_evaluation.R

Production Stage


Visualize Results


The final scores for the test data reside in the table Predict_Score of the Fraud database. The test data itself is in the Tagged_Testing table. The next step of this solution is to visualize both tables in PowerBI.

Template Contents


To try this out yourself:

< Home