For the Database Analyst - Operationalize with SQL
Among the key variables to learn from data are the best communication channel (e.g. SMS, Email, Call), the day of the week and the time of the day through which/ during which a given potential customer is targeted by a marketing campaign. This template provides a customer-oriented business with an analytics tool that helps determine the best combination of these three variables for each customer, based (among others) on financial and demographic data.
For businesses that prefer an on-prem solution, the implementation with SQL Server ML Services is a great option, which takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server). In this template, we implemented all steps in SQL stored procedures: data preprocessing, and feature engineering are implemented in pure SQL, while data cleaning, and the model training, scoring and evaluation steps are implemented with SQL stored procedures calling R (Microsoft ML Server) code.
All the steps can be executed on SQL Server client environment (SQL Server Management Studio). We provide a Windows PowerShell script which invokes the SQL scripts and demonstrates the end-to-end modeling process.
System Requirements
The following are required to run the scripts in this solution:
- SQL Server 2016 or higher with Microsoft ML Server installed and configured.
- The SQL user name and password, and the user configured properly to execute R scripts in-memory.
- SQL Database which the user has write permission and execute stored procedures.
- For more information about SQL server 2017 and ML Services, please visit: https://docs.microsoft.com/en-us/sql/advanced-analytics/what-s-new-in-sql-server-machine-learning-services
Workflow Automation
Follow the PowerShell instructions to execute all the scripts described below. Click here to view the details all tables created in this solution.
Step 0: Creating Tables
The following data are provided in the Data directory:
File | Description |
---|---|
Campaign_Detail.csv | Raw data about each marketing campaign that occurred |
Lead_Demography.csv | Raw demographics and financial data about each customer |
Market_Touchdown.csv | Raw channel-day-time data used for every customer of Lead_Demography in every campaign he was targeted |
Product.csv | Raw data about the product marketed in each campaign |
In this step, we create four tables: Campaign_Detail
, Lead_Demography
, Market_Touchdown
, and Product
in a SQL Server database, and the data is uploaded to these tables using bcp command in the PowerShell script.
One other empty table is created and will be filled in the next steps.
Input:
- Raw data: Campaign_Detail.csv, Lead_Demography.csv, Market_Touchdown.csv, and Product.csv
Output:
- 4 Tables filled with the raw data:
Campaign_Detail
,Lead_Demography
,Market_Touchdown
, andProduct
. In order for them to be filled with the data, step 0 of the PowerShell script should be run - 1 Table that will be filled in the next steps:
CM_AD
.
Related files:
- step0_create_tables.sql
Step 1: Pre-Processing and Cleaning
In this step, we first merge the 4 raw data tables that we filled in Step 0 through inner joins: Campaign_Detail
and Product
are merged into an intermediate table, Campaign_Product
, through an inner join on Product_Id
.
Lead_Demography
and Market_Touchdown
are merged into an intermediate table Market_Lead
through an inner join on Lead_Id
. Finally, the two intermediate tables are merged into one, CM_AD0
, through an inner join on Lead_Id
.
Intermediate tables are then deleted from the database. This is implemented though a stored procedure [dbo].[merging_raw_tables]
.
Then, the raw data in Merged is cleaned through a SQL stored procedure calling [dbo].[fill_NA_all]
. This assumes that the ID variables (Lead_Id
, Product_Id
, Campaign_Id
, Comm_Id
) as well as dates and phone numbers do not contain blanks. For every variable that might contain missing values, we compute the mode (most repeated value) and replace the missing values with it. This stored procedure outputs a table, CM_AD0
.
Input:
- 4 Tables filled with the raw data:
Campaign_Detail
,Lead_Demography
,Market_Touchdown
, andProduct
.
Output:
CM_AD0
table containing the merged data set without missing values.
Related files:
- step1_data_processing.sql
Example:
exec [dbo].[merging_raw_tables]
exec [dbo].[fill_NA_all]
Step 2: Feature Engineering
In this step, we create a stored procedure [dbo].[feature_engineering]
that designs new features by following these steps:
SMS_Count
,Call_Count
,Email_Count
: they are created by counting the number of times every customer (Lead_Id) has been contacted through every Channel. The new data is stored in an intermediate table called “Intermediate”.Previous_Channel
: the previous channel used towards every customer for every campaign activity. It is created by adding a lag variable on Channel. The first record for each customer should thus be disregarded. In this SQL statement, we also order the records of every customer by putting the latest campaign activity on top. The resulting data is stored in an intermediate table calledIntermediate2
.- Keeping the last record: only the latest campaign activity for every customer is kept. The resulting analytical data is stored in the table
CM_AD
. Intermediate tables are dropped.
Input:
CM_AD0
table.
Output:
CM_AD
table containing new features and one record per customer.
Related files:
- step2_feature_engineering.sql
Example:
exec [dbo].[feature_engineering]
Step 3: Normalization
In this step, we normalize the data by converting the factor variables with character levels to numeric character levels. For example, the levels “Young”, “Middle Age” and “Senior Citizen” for variable Age
would become “1”, “2”, and “3”. Although the factor levels are still characters, using shorter ones increase the speed of the modeling and testing steps.
Input:
CM_AD
table.
Output:
CM_AD_N
table containing normalized factor variables.
Related files:
- step3_normalization.sql
Example:
exec [dbo].[normalization]
Step 3a: Splitting the data set
In this step, we create a stored procedure [dbo].[splitting]
that splits the data into a training set and a testing set. The user has to specify a splitting percentage. For example, if the splitting percentage is 70, 70% of the data will be put in the training set, while the other 30% will be assigned to the testing set. The leads that will end in the training set, are stored in the table Train_Id
.
Input:
CM_AD_N
table.
Output:
Train_Id
table containing containing theLead_Id
that will end in the training set.
Related files:
- step3a_splitting.sql
Example:
exec [dbo].[splitting] @splitting_percent = 70
Step 3b: Training
In this step, we create a stored procedure [dbo].[train_model]
that trains a Random Forest (RF) or a Gradient Boosted Trees (GBT) on the training set. The trained models are serialized then stored in a table called Campaign_Models
. The PowerShell script automatically calls the procedure twice in order to train both models.
Input:
CM_AD_N
andTrain_Id
tables.
Output:
Campaign_Models
table containing the RF and GBT trained models.
Related files:
- step3b_train_model.sql
Example:
exec [dbo].[train_model] @modelName ='RF'
exec [dbo].[train_model] @modelName ='GBT'
Step 3c: Predicting (Scoring)
In this step, we create a stored procedure [dbo].[test_evaluate_models]
that scores the two trained models on the testing set, and then compute performance metrics (AUC and accuracy among others). The performance metrics are written in Metrics_Table
, and the best model name based on AUC is written to the table Best_Model
.
Before proceeding to the next step, the user might want to inspect the metrics_table in order to select himself the model to be used for campaign recommendations. Since the AUC does not depend on the chosen decision threshold for the models, it is wiser to use it for model comparison. The PowerShell script will use by default the best model based on AUC but will give the user the option of choosing the other model.
Input:
CM_AD_N
andTrain_Id
tables.
Output:
Metrics_Table
table containing the performance metrics of the two models.Best_Model
table containing the name of the model that gave the highest AUC on the testing set.
Related files:
- step3c_test_evaluate_models.sql
Example:
exec [dbo].[test_evaluate_models]
Step 4: Channel-day-time Recommendations
In this step, we create two stored procedures that use the selected prediction model to provide channel-day-time recommendations for every customer.
The first one, [dbo].[scoring]
, scores the full data set through ML services, by using the selected model stored in table Campaign_Models
. The results are written in a table called Prob_Id
, containing part of the full table in addition to the scored probability. The full data set is created on the fly through a SQL query: every row (every customer Lead_Id) of CM_AD_N
is replicated 63 times, each row corresponding to a different combination of Channel
, Day_Of_Time
and Day_Of_Week
.
The second one, [dbo].[campaign_recommendation]
, is the only one called by the PowerShell. It first executes the first stored procedure, and then select from Prob_Id
the rows corresponding to the channel-day-time with the highest predicted conversion probability. If for a given customer, two or more combinations give the highest predicted probability, one is selected randomly. This is stored in an intermediate table called Recommended_Combinations
. Finally, an inner join adds back demographic and financial variables to the previous table, and the result is stored in Recommendations
_N. Denormalization is performed through this query for the variables of interest. This table will be used for visualization in PowerBI.
Input:
Campaign_Models
table containing the trained models.CM_AD_N
table containing the analytical data set.
Output:
Recommendations
table containing demographic and financial data as well as the recommended channel, day and time to target every customer.
Related files:
- step4_campaign_recommendations.sql
Example:
exec [dbo].[campaign_recommendation] @bestModel = 'RF'