Template Contents
The following is the directory structure for this template:
- Data This contains the copy of the simulated input data with 100K unique customers.
- R This contains the R code to simulate the input datasets, pre-process them, create the analytical datasets, train the models, identify the champion model and provide recommendations.
- Resources This directory contains other resources for the solution package.
- SQLR This contains T-SQL code to pre-process the datasets, train the models, identify the champion model and provide recommendations. It also contains a PowerShell script to automate the entire process, including loading the data into the database (not included in the T-SQL code).
- RSparkCluster This contains the R code to pre-process the datasets, train the models, identify the champion model and provide recommendations on a Spark cluster.
In this template with SQL Server ML Services, two versions of the SQL implementation and another version for HDInsight implementation:
- Model Development in R IDE . Run the R code in R IDE (e.g., RStudio, R Tools for Visual Studio).
- Operationalize in SQL. Run the SQL code in SQL Server using SQLR scripts from SSMS or from the PowerShell script.
- HDInsight Solution on Spark Cluster. Run this R code in RStudio on the edge node of the Spark cluster.
Copy of Input Datasets
File | Description |
---|---|
Campaign_Detail.csv | Raw data about each marketing campaign that occurred |
Lead_Demography.csv | Raw demographics and financial data about each customer |
Market_Touchdown.csv | Raw channel-day-time data used for every customer of Lead_Demography in every campaign he was targeted |
Product.csv | Raw data about the product marketed in each campaign |
Model Development in R
These files in the R directory for the SQL solution.
File | Description |
---|---|
Campaign Optimization R Notebook.ipynb | Contains the Jupyter Notebook file that runs all the .R scripts |
sql_connection.R | Contains details of connection to SQL Server used in all other scripts |
step0_data_generation.R | Simulates the 4 input datasets, not needed unless you wish to regenerate data |
step1_data_processing.R | Uploads .csv files to SQL and performs data preprocessing steps such as inner joins and missing value treatment |
step2_feature_engineering.R | Performs Feature Engineering and creates the Analytical Dataset |
step3_training_evaluation.R | Builds the Random Forest & Gradient Boosting models, identifies the champion model |
step4_campaign_recommendations.R | Builds final recommendations from scoring 63 combinations per lead and selecting combo with highest conversion probability |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Operationalize in SQL 2017
These files are in the SQLR directory.
File | Description |
---|---|
step0_create_tables.sql | SQL Script to create empty tables in SQL. PowerShell script should be used to load the input data |
step1_data_processing.sql | Replaces Missing values in dataset with the modes |
step2_feature_engineering.sql | Performs Feature Engineering and creates the Analytical Dataset |
step3a_splitting.sql | Splits the analytical dataset into Train and Test |
step3b_train_model.sql | Trains either RF or GBT model, depending on input parameter |
step3c_test_evaluate_models.sql | Tests both RF and GBT models |
step4_campaign_recommendations.sql | Scores data with best model and outputs recommendations |
execute_yourself.sql | Executes every stored procedure after running all the other .sql files |
Campaign_Optimization.ps1 | Loads the input data into the SQL server and automates the running of all .sql files |
Readme.md | Describes the stored procedures in more detail |
- See For the Database Analyst for more information about these files.
- Follow the PowerShell Instructions to execute the PowerShell script which automates the running of all these .sql files.
HDInsight Solution on Spark Cluster
These files are in the RSparkCluster directory.
File | Description |
---|---|
Copy_Dev2Prod.R | Copies a model from the dev folder to the prod folder for production use |
Create_LeadDemo_MarketTouch.R | Generates Lead_Demography and Market_touchdown tables, used from step0. Not needed unless you wish to regenerate data |
campaign_deployment.R | Publishes the scoring function as an analytic web service |
campaign_main.R | Runs all steps of the solution |
campaign_scoring.R | Scores new data using a model developed from campaign_main.R |
campaign_web_scoring.R | Uses the scoring funtion created by campaign_deployment.R. |
step0_data_generation.R | Simulates the 4 input datasets, not needed unless you wish to regenerate data |
step1_data_processing.R | Uploads .csv files to SQL and performs data preprocessing steps such as inner joins and missing value treatment |
step2_feature_engineering.R | Performs Feature Engineering and creates the Analytical Dataset |
step3_training_evaluation.R | Builds the Random Forest & Gradient Boosting models, identifies the champion model |
step4_campaign_recommendations.R | Builds final recommendations from scoring 63 combinations per lead and selecting combo with highest conversion probability |
step5_create_hive_table.R | Stores recommendations in a Hive table for use in PowerBI |
- See For the Data Scientist for more details about these files.
- See Typical Workflow for more information about executing these scripts.
Resources for the Solution Package
File | Description |
---|---|
createuser.sql | Used during initial SQL Server setup to create the user and password and grant permissions |
Campaign_Data_Dictionnary.xlsx | Schema and description of the 4 input tables and variables |
Images | Directory of images used for the Readme.md in this package |