Template Contents
The following is the directory structure for this template:
- Data This contains the copy of the simulated input data with 100K unique customers.
- R This contains the R code to simulate the input datasets, pre-process them, create the analytical datasets, train the models, identify the champion model and provide recommendations.
- Resources This directory contains other resources for the solution package.
- SQLR This contains T-SQL code to pre-process the datasets, train the models, identify the champion model and provide recommendations. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
- RSparkCluster This contains the R code to pre-process the datasets, train the models, identify the champion model and provide recommendations on a Spark cluster.
In this template with SQL Server ML Services, two versions of the SQL implementation and another version for HDInsight implementation:
- Model Development in R IDE . Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).
- Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.
- HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.
Copy of Input Datasets
| File | Description |
|---|---|
| Campaign_Detail.csv | Raw data about each marketing campaign that occurred |
| Lead_Demography.csv | Raw demographics and financial data about each customer |
| Market_Touchdown.csv | Raw channel-day-time data used for every customer of Lead_Demography in every campaign he was targeted |
| Product.csv | Raw data about the product marketed in each campaign |
Model Development in R
These files in the R directory for the SQL solution.
| File | Description |
|---|---|
| Campaign Optimization R Notebook.ipynb | Contains the Jupyter Notebook file that runs all the .R scripts |
| sql_connection.R | Contains details of connection to SQL Server used in all other scripts |
| step0_data_generation.R | Simulates the 4 input datasets, not needed unless you wish to regenerate data |
| step1_data_processing.R | Uploads .csv files to SQL and performs data preprocessing steps such as inner joins and missing value treatment |
| step2_feature_engineering.R | Performs Feature Engineering and creates the Analytical Dataset |
| step3_training_evaluation.R | Builds the Random Forest & Gradient Boosting models, identifies the champion model |
| step4_campaign_recommendations.R | Builds final recommendations from scoring 63 combinations per lead and selecting combo with highest conversion probability |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Operationalize in SQL 2017
These files are in the SQLR directory.
| File | Description |
|---|---|
| step0_create_tables.sql | SQL Script to create empty tables in SQL. PowerShell script should be used to load the input data |
| step1_data_processing.sql | Replaces Missing values in dataset with the modes |
| step2_feature_engineering.sql | Performs Feature Engineering and creates the Analytical Dataset |
| step3a_splitting.sql | Splits the analytical dataset into Train and Test |
| step3b_train_model.sql | Trains either RF or GBT model, depending on input parameter |
| step3c_test_evaluate_models.sql | Tests both RF and GBT models |
| step4_campaign_recommendations.sql | Scores data with best model and outputs recommendations |
| execute_yourself.sql | Executes every stored procedure after running all the other .sql files |
| Campaign_Optimization.ps1 | Loads the input data into the SQL server and automates the running of all .sql files |
| Readme.md | Describes the stored procedures in more detail |
- See For the Database Analyst for more information about these files.
- Follow the PowerShell Instructions to execute the PowerShell script which automates the running of all these .sql files.
HDInsight Solution on Spark Cluster
These files are in the RSparkCluster directory.
| File | Description |
|---|---|
| Copy_Dev2Prod.R | Copies a model from the dev folder to the prod folder for production use |
| Create_LeadDemo_MarketTouch.R | Generates Lead_Demography and Market_touchdown tables, used from step0. Not needed unless you wish to regenerate data |
| campaign_deployment.R | Publishes the scoring function as an analytic web service |
| campaign_main.R | Runs all steps of the solution |
| campaign_scoring.R | Scores new data using a model developed from campaign_main.R |
| campaign_web_scoring.R | Uses the scoring funtion created by campaign_deployment.R. |
| step0_data_generation.R | Simulates the 4 input datasets, not needed unless you wish to regenerate data |
| step1_data_processing.R | Uploads .csv files to SQL and performs data preprocessing steps such as inner joins and missing value treatment |
| step2_feature_engineering.R | Performs Feature Engineering and creates the Analytical Dataset |
| step3_training_evaluation.R | Builds the Random Forest & Gradient Boosting models, identifies the champion model |
| step4_campaign_recommendations.R | Builds final recommendations from scoring 63 combinations per lead and selecting combo with highest conversion probability |
| step5_create_hive_table.R | Stores recommendations in a Hive table for use in PowerBI |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Resources for the Solution Package
| File | Description |
|---|---|
| createuser.sql | Used during initial SQL Server setup to create the user and password and grant permissions |
| Campaign_Data_Dictionnary.xlsx | Schema and description of the 4 input tables and variables |
| Images | Directory of images used for the Readme.md in this package |