For the Data Scientist

This page describes the SQL Server VM on Azure SQL Server On-Premises HDInsight Spark solution.

For the Data Scientist - Develop with R

Analytical Dataset Preprocessing and Feature Engineering

HDInsight is a cloud Spark and Hadoop service for the enterprise. HDInsight is also the only managed cloud Hadoop solution with integration to Microsoft ML Server.

This solution shows how to pre-process data (cleaning and feature engineering), train prediction models, and perform scoring on an HDInsight Spark cluster with Microsoft ML Server.

Data scientists who are testing and developing solutions can work from the browser-based Open Source Edition of RStudio Server on the HDInsight Spark cluster edge node, while using a compute context to control whether computation will be performed locally on the edge node, or whether it will be distributed across the nodes in the HDInsight Spark cluster.

Campaign Optimization

This template is focused on marketing campaign optimization. In particular, customer-oriented businesses can learn patterns from their data to intelligently design acquisition campaigns and convert the highest possible number of customers.

Among the key variables to learn from data are the best communication channel (e.g. SMS, Email, Call), the day of the week and the time of the day through which/ during which a given potential customer is targeted by a marketing campaign.

In this solution, an Apache Hive table will be created to show predicted recommendations for how and when to contact each lead. This data is then visualized in PowerBI.

To try this out yourself, visit the Quick Start page.

Below is a description of what happens in each of the steps: dataset creation, model development, recommendations, and deployment in more detail.

Analytical Dataset Preprocessing and Feature Engineering

This part simulates input data and performs preprocessing and feature engineering to create the analytical dataset.

One of the key functions used is the rxExec function. This function is similar to the foreach function in open source R, while it can parallel computation across different nodes under the Spark compute context. In this implementation, we split the whole data set into multiple subsets by Lead_ID and apply transformation functions to each subset with the rxExec function. The computational time is significantly reduced since the computation is fully paralleled across nodes and cores on the cluster.

The script campaign_main.R is used to run each of the scripts step1*-step4*, described in the sections below.

step1_data_processing.R

This script reads the 4 input data sets, merges them, and then performs missing value treatment on the raw table in-database.

step2_feature_engineering.R

This script performs feature engineering in-database to generate the Analytical Dataset.

SMS_Count, Call_Count, Email_Count: number of times every customer (Lead_Id) has been contacted through every Channel.
Previous_Channel: the previous channel used towards every customer for every campaign activity.

Finally, only the latest campaign activity for every customer is kept.

Model Development

Two models, Random Forest and Gradient Boosted Trees are developed to model Campaign Responses. The R code to develop these models is included in the step3_training_evaluation.R script

step3_training_evaluation.R

In this step, after splitting the analytical data set into a training and a testing set, two prediction models are built (Random Forest and Gradient Boosted Trees) on the training set. Once the models are trained, AUC of both the models are calculated using the testing set.

The R script draws the ROC or Receiver Operating Characteristic for each prediction model. It shows the performance of the model in terms of true positive rate and false positive rate, when the decision threshold varies.

The AUC is a number between 0 and 1. It corresponds to the area under the ROC curve. It is a performance metric related to how good the model is at separating the two classes (converted clients vs. not converted), with a good choice of decision threshold separating between the predicted probabilities. The closer the AUC is to 1, and the better the model is. Given that we are not looking for that optimal decision threshold, the AUC is more representative of the prediction performance than the Accuracy (which depends on the threshold).

The model with the best AUC is selected as the champion model and will be used for recommendations.

Computing Recommendations

The champion model is used to provide recommendations about how and when to contact each customer. The R code to provide the recommendations is inlcuded in the step4_campaign_recommendations.R script.

step4_campaign_recommendations.R

This script creates a full table with 63 rows for each customer, corresponding to the possible 63 combinations of day, time and channel (7 Days x 3 Times x 3 Channels = 63). Each combination is scored with the champion model and, for each customer, the one with the highest conversion probability is used as the recommendation for that lead. Results from this step are stored in the Recommendations Hive database table.

The script campaign_scoring.R can be used to score new data - it goes through step1_data_processing, step2_feature_engineering, and step4_campaign_recommendations to create recommendations for the new data.

Deploy and Visualize Results

Deploy

The script campaign_deployment.R creates and tests a analytic web service. The web service can then be used from another application to score future data. The file web_scoring.R can be downloaded to invoke this web service locally on any computer with Microsoft ML Server 9.0.1 or later installed.

Before running campaign_web_scoring.R on any computer, you must first connect to edge node from that computer. Once you have connected you can also use the web server admin utility to reconfigure or check on the status of the server.

Follow instructions here to connect to the edge node and/or use the admin utility.