Fraud Detection

Typical Workflow


Fraud detection is one of the earliest industrial applications of data mining and machine learning. This solution shows how to build and deploy a machine learning model for online retailers to detect fraudulent purchase transactions.

This solution package shows how to pre-process data (cleaning and feature engineering), train prediction models, and perform scoring on the

To demonstrate a typical workflow, we’ll introduce you to a few personas. You can follow along by performing the same steps for each persona.

Step 1: Server Setup and Configuration with Ivan the IT Administrator


Let me introduce you to Ivan, the IT Administrator. Ivan is responsible for implementation as well as ongoing administration of the Hadoop infrastructure at his company, which uses Hadoop in the Azure Cloud from Microsoft. Ivan created the HDInsight cluster with ML Server for Debra. He also uploaded the data onto the storage account associated with the cluster.

The cluster has been created and data loaded for you when you used the 'Deploy to Azure' button on the Quick Start page. Once you complete the walkthrough, you will want to delete this cluster as it incurs expense whether it is in use or not - see HDInsight Cluster Maintenance for more details.

Step 2: Data Prep and Modeling with Debra the Data Scientist


Now let’s meet Debra, the Data Scientist. Debra’s job is to use historical data to predict a model to detect fraud. Debra will develop these models using HDInsight, the managed cloud Hadoop solution with integration to Microsoft ML Server.  

Debra will develop her R scripts in the Open Source Edition of RStudio Server, installed on her cluster's edge node. You can follow along on your own cluster deployed by using the 'Deploy to Azure' button on the Quick Start page. Access RStudio by using the url of the form:
http://CLUSTERNAME.azurehdinsight.net/rstudio.

You are ready to follow along with Debra as she creates the model needed for this solution.

development_main.R is used to define the input and call all these steps.

The steps to create and evaluate the model are described in detail on the For the Data Scientist page. Open and execute the file development_main.R to run all these steps. You may see some warnings regarding strptime and rxClose. You can ignore these warnings.

After executing this code, you can examine the ROC curve for the Gradient Boosted Tree model in the Plots pane. This gives a transaction level metric on the model.

The metric used for assessing accuracy (performance) depends on how the original cases are processed. If each case is processed on a transaction by transaction basis, you can use a standard performance metric, such as transaction-based ROC curve or AUC.

However, for fraud detection, typically account-level metrics are used, based on the assumption that once a transaction is discovered to be fraudulent (for example, via customer contact), an action will be taken to block all subsequent transactions.

A major difference between account-level metrics and transaction-level metrics is that, typically an account confirmed as a false positive (that is, fraudulent activity was predicted where it did not exist) will not be contacted again during a short period of time, to avoid inconveniencing the customer.

The industry standard fraud detection metrics are ADR vs AFPR and VDR vs AFPR for performance, and transaction level performance, as defined here:

  • ADR – Fraud Account Detection Rate. The percentage of detected fraud accounts in all fraud accounts.
  • VDR - Value Detection Rate. The percentage of monetary savings, assuming the current fraud transaction triggered a blocking action on subsequent transactions, over all fraud losses.
  • AFPR - Account False Positive Ratio. The ratio of detected false positive accounts over detected fraud accounts.

You can see these plots as well in the Plots pane after running development_main.R .

Step 3: Operationalize with Debra


Debra has completed her tasks. She has executed code from RStudio that pushed (in part) execution to Hadoop to create the fraud model. She has preprocessed the data, created features, built and evaluated a model. Finally, she created a summary dashboard which she will hand off to Bernie - see below.

Now that we have evaluated the model, it is time to put it to use in predicting fraud during an online transaction. Debra now creates an analytic web service with ML Server Operationalization that incorporates these same steps: data processing, feature engineering, and scoring.

web_scoring_main.R will create a web service and test it on the edge node.

Step 4: Deploy and Visualize with Bernie the Business Analyst


Now that the predictions are created we will meet our last persona - Bernie, the Business Analyst. Bernie will use the Power BI Dashboard to examine the data and assess the model prediction using the test data.

You can access this dashboard in any of the following ways: