For the Data Scientist

This page describes the SQL Server VM on Azure SQL Server On-Premises HDInsight Spark solution.

For the Data Scientist - Develop with R

Campaign Optimization

Analytical Dataset Preprocessing and Feature Engineering

Model Development

Computing Recommendations

Deploy and Visualize Results

System Requirements

Template Contents

SQL Server ML Services takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server package) by allowing R to run on the same server as the database. It includes a database service that runs outside the SQL Server process and communicates securely with the R runtime.

This solution package shows how to pre-process data (cleaning and feature engineering), train prediction models, and perform scoring on the SQL Server machine.

Data scientists who are testing and developing solutions can work from the convenience of their R IDE on their client machine, while setting the computation context to SQL (see R folder for code). They can also deploy the completed solutions to SQL Server 2016 by embedding calls to R in stored procedures (see SQLR folder for code). These solutions can then be further automated by the use of SQL Server Integration Services and SQL Server agent: a PowerShell script (.ps1 file) automates the running of the SQL code.

Campaign Optimization

This template is focused on marketing campaign optimization. In particular, customer-oriented businesses can learn patterns from their data to intelligently design acquisition campaigns and convert the highest possible number of customers.

Among the key variables to learn from data are the best communication channel (e.g. SMS, Email, Call), the day of the week and the time of the day through which/ during which a given potential customer is targeted by a marketing campaign.

In this solution, the final scored database table in SQL Server gives the recommendations for how and when to contact each lead. This data is then visualized in PowerBI.

To try this out yourself, visit the Quick Start page.

Below is a description of what happens in each of the steps: dataset creation, model development, recommendations, and deployment in more detail.

Analytical Dataset Preprocessing and Feature Engineering

This part simulates input data and performs preprocessing and feature engineering to create the analytical dataset.

The R code to perform these steps can be run from an R client with the following scripts:

One of the key functions used is the rxExec function. This function is similar to the foreach function in open source R, while it can parallel computation across different nodes under the Spark compute context. In this implementation, we split the whole data set into multiple subsets by Lead_ID and apply transformation functions to each subset with the rxExec function. The computational time is significantly reduced since the computation is fully paralleled across nodes and cores on the cluster.

The script campaign_main.R is used to run each of the scripts step1*-step4*, described in the sections below.

step1_data_processing.R

This script exports the 4 input data sets to SQL tables, merges them, and then performs missing value treatment on the raw table in-database.

step2_feature_engineering.R

This script performs feature engineering in-database to generate the Analytical Dataset.

SMS_Count, Call_Count, Email_Count: number of times every customer (Lead_Id) has been contacted through every Channel.
Previous_Channel: the previous channel used towards every customer for every campaign activity.

Finally, only the latest campaign activity for every customer is kept.

The corresponding SQL stored procedures can be run manually after loading the data into tables with PowerShell. They can be found in the scripts step1_data_processing.sql and step2_feature_engineering.sql.

Model Development

Two models, Random Forest and Gradient Boosted Trees are developed to model Campaign Responses. The R code to develop these models is included in the step3_training_evaluation.R script

step3_training_evaluation.R

In this step, after splitting the analytical data set into a training and a testing set, two prediction models are built (Random Forest and Gradient Boosted Trees) on the training set. Once the models are trained, AUC of both the models are calculated using the testing set.

The R script draws the ROC or Receiver Operating Characteristic for each prediction model. It shows the performance of the model in terms of true positive rate and false positive rate, when the decision threshold varies.

The AUC is a number between 0 and 1. It corresponds to the area under the ROC curve. It is a performance metric related to how good the model is at separating the two classes (converted clients vs. not converted), with a good choice of decision threshold separating between the predicted probabilities. The closer the AUC is to 1, and the better the model is. Given that we are not looking for that optimal decision threshold, the AUC is more representative of the prediction performance than the Accuracy (which depends on the threshold).

The model with the best AUC is selected as the champion model and will be used for recommendations.

The corresponding SQL stored procedures can be run manually. They can be found in the scripts step3a_splitting.sql, step3b_train_model.sql, and step3c_test_model.sql. Note that the SQL stored procedures use a normalized version of the analytical dataset in order to speed up the computations. The normalization step can be found in step3_normalization.sql, and should be run before steps 3a, 3b, and 3c.

Computing Recommendations

The champion model is used to provide recommendations about how and when to contact each customer. The R code to provide the recommendations is inlcuded in the step4_campaign_recommendations.R script.

step4_campaign_recommendations.R

This script creates a full table with 63 rows for each customer, corresponding to the possible 63 combinations of day, time and channel (7 Days x 3 Times x 3 Channels = 63). Each combination is scored with the champion model and, for each customer, the one with the highest conversion probability is used as the recommendation for that lead. Results from this step are stored in the Recommendations Hive database table.

The corresponding SQL stored procedures can be run manually in step4_campaign_recommendations.sql.

Deploy and Visualize Results

The deployed data resides in a newly created database table, showing recommendations for each lead. The final step of this solution visualizes these recommendations, and once the new campaigns have been completed we can also visualize a summary of how well the model worked.

You can access this dashboard in either of the following ways:

Visit the online version.
Open the PowerBI file from the Campaign directory on the deployed VM desktop.
Install PowerBI Desktop on your computer and download and open the Campaign Optimization Dashboard
Install PowerBI Desktop on your computer and download and open the Campaign Optimization HDI Dashboard

System Requirements

The following are required to run the scripts in this solution:

SQL Server 2016 with Microsoft R Server (version 9.0.1) installed and configured.
The SQL user name and password, and the user configured properly to execute R scripts in-memory.
SQL Database which the user has write permission and execute stored procedures.
For more information about SQL server 2016 and ML Services, please visit: https://docs.microsoft.com/en-us/sql/advanced-analytics/what-s-new-in-sql-server-machine-learning-services

Template Contents

View the contents of this solution template.

To try this out yourself:

View the Quick Start.

< Home

Campaign Optimization - Predicting How and When to Contact Leads