For the IT Administrator

This page describes the HDInsight Spark solution.

For the IT Administrator

Fraud detection is one of the earliest industrial applications of data mining and machine learning. This solution shows how to build and deploy a machine learning model for online retailers to detect fraudulent purchase transactions. View more information about the data.

This solution demonstrates the code with approximately 200,000 transactions. Using HDInsight Spark clusters makes it simple to extend to very large data, both for training and scoring. As you increase the data size you may want to add more nodes but the code itself remains exactly the same.

System Requirements

This solution uses:

ML Server for HDInsight

Cluster Maintenance

HDInsight Spark cluster billing starts once a cluster is created and stops when the cluster is deleted. See these instructions for important information about deleting a cluster and re-using your files on a new cluster.

Workflow Automation

Access RStudio on the cluster edge node by using the url of the form http://CLUSTERNAME.azurehdinsight.net/rstudio Run the script development_main.R followed by web_scoring_main.R to perform all the steps of the solution.

Data Files

The following data files are available in the Fraud/Data directory in the storage account associated with the cluster:

File	Description
Account_Info.csv	Customer account data
Fraud_Transactions.csv	Raw fraud transaction data
Untagged_Transactions.csv	Raw transaction data without fraud tag

Fraud Detection