<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# AzureML Pipeline, AutoML, AKS Deployment for Sentence Similarity

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/nlp/examples/sentence_similarity/automl_with_pipelines_deployment_aks.png)

This notebook builds off of the [AutoML Local Deployment ACI](automl_local_deployment_aci.ipynb) notebook and demonstrates how to use [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning-service/
) pipelines and Automated Machine Learning ([AutoML](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-automated-ml
)) to streamline the creation of a machine learning workflow for predicting sentence similarity. The pipeline contains two steps:   
1. PythonScriptStep: embeds sentences using a popular sentence embedding model, Google Universal Sentence Encoder
2. AutoMLStep: demonstrates how to use Automated Machine Learning (AutoML) to automate model selection for predicting sentence similarity (regression)

After creating the pipeline, the notebook demonstrates the deployment of our sentence similarity model using Azure Kubernetes Service ([AKS](https://docs.microsoft.com/en-us/azure/aks/intro-kubernetes
)).

This notebook showcases how to use the following AzureML features:  
- AzureML Pipelines (PythonScriptStep and AutoMLStep)
- Automated Machine Learning
- AmlCompute
- Datastore
- Logging

## Table of Contents
1. [Introduction](#1.-Introduction)  
    * 1.1 [What are AzureML Pipelines?](#1.1-What-are-AzureML-Pipelines?)  
    * 1.2 [What is Azure AutoML?](#1.2-What-is-Azure-AutoML?)  
    * 1.3 [Modeling Problem](#1.3-Modeling-Problem)  
2. [Data Preparation](#2.-Data-Preparation)  
3. [AzureML Setup](#3.-AzureML-Setup)  
    * 3.1 [Link to or create a `Workspace`](#3.1-Link-to-or-create-a-Workspace)  
    * 3.2 [Set up an `Experiment` and Logging](#3.2-Set-up-an-Experiment-and-Logging)  
    * 3.3 [Link `AmlCompute` compute target](#3.3-Link-AmlCompute-compute-target)  
    * 3.4 [Upload data to `Datastore`](#3.4-Upload-data-to-Datastore)  
4. [Create AzureML Pipeline](#4.-Create-AzureML-Pipeline)  
    * 4.1 [Set up run configuration file](#4.1-Set-up-run-configuration-file)  
    * 4.2 [PythonScriptStep](#4.2-PythonScriptStep)  
        * 4.2.1 [Define python script to run](#4.2.1-Define-python-script-to-run)
        * 4.2.2 [Create PipelineData object](#4.2.2-Create-PipelineData-object)
        * 4.2.3 [Create PythonScriptStep](#4.2.3-Create-PythonScriptStep)
    * 4.3 [AutoMLStep](#4.3-AutoMLStep)
        * 4.3.1 [Define get_data script to load data](#4.3.1-Define-get_data-script-to-load-data)
        * 4.3.2 [Create AutoMLConfig object](#4.3.2-Create-AutoMLConfig-object)
        * 4.3.3 [Create AutoMLStep](#4.3.3-Create-AutoMLStep)    
5. [Run Pipeline](#5.-Run-Pipeline)  
6. [Deploy Sentence Similarity Model](#6.-Deploy-Sentence-Similarity-Model)
    * 6.1 [Register/Retrieve AutoML and Google Universal Sentence Encoder Models for Deployment](#6.1-Register/Retrieve-AutoML-and-Google-Universal-Sentence-Encoder-Models-for-Deployment)  
    * 6.2 [Create Scoring Script](#6.2-Create-Scoring-Script)
    * 6.3 [Create a YAML File for the Environment](#6.3-Create-a-YAML-File-for-the-Environment)   
    * 6.4 [Image Creation](#6.4-Image-Creation) 
    * 6.5 [Provision the AKS Cluster](#6.5-Provision-the-AKS-Cluster)   
    * 6.6 [Deploy the image as a Web Service to Azure Kubernetes Service](#6.6-Deploy-the-image-as-a-Web-Service-to-Azure-Kubernetes-Service)  
    * 6.7 [Test Deployed Model](#6.7-Test-Deployed-Webservice)  
    


## 1. Introduction

### 1.1 What are AzureML Pipelines?

[AzureML Pipelines](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines) define reusable machine learning workflows that can be used as a template for your machine learning scenarios. Pipelines allow you to optimize your workflow and spend time on machine learning rather than infrastructure. A Pipeline is defined by a series of steps; the following steps are available: AdlaStep, AutoMLStep, AzureBatchStep, DataTransferStep, DatabricksStep, EstimatorStep, HyperDriveStep, ModuleStep, MpiStep, and PythonScriptStep (see [here](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/?view=azure-ml-py) for details of each step). When the pipeline is run, cached results are used for all steps that have not changed, optimizing the run time. Data sources and intermediate data can be used across multiple steps in a pipeline, saving time and resources. Below we see an example of an AzureML pipeline.

![](https://nlpbp.blob.core.windows.net/images/pipelines.png)

### 1.2 What is Azure AutoML?

Automated machine learning ([AutoML](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-automated-ml)) is a capability of Microsoft's [Azure Machine Learning service](https://azure.microsoft.com/en-us/services/machine-learning-service/
). The goal of AutoML is to improve the productivity of data scientists and democratize AI by allowing for the rapid development and deployment of machine learning models. To acheive this goal, AutoML automates the process of selecting a ML model and tuning the model. All the user is required to provide is a dataset (suitable for a classification, regression, or time-series forecasting problem) and a metric to optimize in choosing the model and hyperparameters. The user is also given the ability to set time and cost constraints for the model selection and tuning.

![](automl.png)

The AutoML model selection and tuning process can be easily tracked through the Azure portal or directly in python notebooks through the use of widgets. AutoML quickly selects a high quality machine learning model tailored for your prediction problem. In this notebook, we walk through the steps of preparing data, setting up an AutoML experiment, and evaluating the results of our best model. More information about running AutoML experiments in Python can be found [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train). 

### 1.3 Modeling Problem

The regression problem we will demonstrate is predicting sentence similarity scores on the STS Benchmark dataset. The [STS Benchmark dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#STS_benchmark_dataset_and_companion_dataset) contains a selection of English datasets that were used in Semantic Textual Similarity (STS) tasks 2012-2017. The dataset contains 8,628 sentence pairs with a human-labeled integer representing the sentences' similarity (ranging from 0, for no meaning overlap, to 5, meaning equivalence).

For each sentence in the sentence pair, we will use Google's pretrained Universal Sentence Encoder (details provided below) to generate a $512$-dimensional embedding. Both embeddings in the sentence pair will be concatenated and the resulting $1024$-dimensional vector will be used as features in our regression problem. Our target variable is the sentence similarity score.

In [1]:
# Set the environment path to find NLP
import sys

sys.path.append("../../")
import time
import logging
import csv
import os
import pandas as pd
import shutil
import numpy as np
import sys
from scipy.stats import pearsonr
from scipy.spatial import distance
from sklearn.externals import joblib
import json

# Import utils
from utils_nlp.azureml import azureml_utils
from utils_nlp.dataset import stsbenchmark
from utils_nlp.dataset.preprocess import (
    to_lowercase,
    to_spacy_tokens,
    rm_spacy_stopwords,
)
from utils_nlp.common.timer import Timer

# Google Universal Sentence Encoder loader
import tensorflow_hub as hub

# AzureML packages
import azureml as aml
import logging
from azureml.telemetry import set_diagnostics_collection

set_diagnostics_collection(send_diagnostics=True)
from azureml.core import Datastore, Experiment, Workspace
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.webservice import AksWebservice, Webservice
from azureml.core.compute import AksCompute, ComputeTarget
from azureml.core.image import ContainerImage
from azureml.core.model import Model
from azureml.train.automl import AutoMLStep, AutoMLStepRun, AutoMLConfig
from azureml.pipeline.core import Pipeline, PipelineData, TrainingOutput
from azureml.pipeline.steps import PythonScriptStep
from azureml.data.data_reference import DataReference
from azureml.widgets import RunDetails

print("System version: {}".format(sys.version))
print("Azure ML SDK Version:", aml.core.VERSION)
print("Pandas version: {}".format(pd.__version__))



Turning diagnostics collection on. 
System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]
Azure ML SDK Version: 1.0.48
Pandas version: 0.24.2


In [2]:
BASE_DATA_PATH = "../../data"

In [3]:
automl_settings = {
    "task": 'regression',  # type of task: classification, regression or forecasting
    "iteration_timeout_minutes": 15,  # How long each iteration can take before moving on
    "iterations": 50,  # Number of algorithm options to try
    "primary_metric": "spearman_correlation",  # Metric to optimize
    "preprocess": True,  # Whether dataset preprocessing should be applied
    "verbosity": logging.INFO,
    "blacklist_models": ['XGBoostRegressor'] #this model is blacklisted due to installation issues
}

config_path = (
    "./.azureml"
)  # Path to the directory containing config.json with azureml credentials

# Azure resources
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "YOUR_RESOURCE_GROUP_NAME"  
workspace_name = "YOUR_WORKSPACE_NAME"  
workspace_region = "YOUR_WORKSPACE_REGION" #Possible values eastus, eastus2 and so on.

# 2. Data Preparation

**STS Benchmark Dataset**

As described above, the STS Benchmark dataset contains 8.6K sentence pairs along with a human-annotated score for how similar the two sentences are. We will load the training, development (validation), and test sets provided by STS Benchmark and preprocess the data (lowercase the text, drop irrelevant columns, and rename the remaining columns) using the utils contained in this repo. Each dataset will ultimately have three columns: _sentence1_ and _sentence2_ which contain the text of the sentences in the sentence pair, and _score_ which contains the human-annotated similarity score of the sentence pair.

In [4]:
# Load in the raw datasets as pandas dataframes
train_raw = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split="train")
dev_raw = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split="dev")
test_raw = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split="test")

100%|██████████████████████████████████████████████████████████████████████████████████| 401/401 [00:02<00:00, 198KB/s]


Data downloaded to ../../data\raw\stsbenchmark


100%|██████████████████████████████████████████████████████████████████████████████████| 401/401 [00:02<00:00, 174KB/s]


Data downloaded to ../../data\raw\stsbenchmark


100%|██████████████████████████████████████████████████████████████████████████████████| 401/401 [00:02<00:00, 148KB/s]


Data downloaded to ../../data\raw\stsbenchmark


In [5]:
# Clean each dataset by lowercasing text, removing irrelevant columns,
# and renaming the remaining columns
train_clean = stsbenchmark.clean_sts(train_raw)
dev_clean = stsbenchmark.clean_sts(dev_raw)
test_clean = stsbenchmark.clean_sts(test_raw)

In [6]:
# Convert all text to lowercase
train = to_lowercase(train_clean)
dev = to_lowercase(dev_clean)
test = to_lowercase(test_clean)

In [7]:
print("Training set has {} sentences".format(len(train)))
print("Development set has {} sentences".format(len(dev)))
print("Testing set has {} sentences".format(len(test)))

Training set has 5749 sentences
Development set has 1500 sentences
Testing set has 1379 sentences


In [8]:
train.head()

Unnamed: 0,score,sentence1,sentence2
0,5.0,a plane is taking off.,an air plane is taking off.
1,3.8,a man is playing a large flute.,a man is playing a flute.
2,3.8,a man is spreading shreded cheese on a pizza.,a man is spreading shredded cheese on an uncoo...
3,2.6,three men are playing chess.,two men are playing chess.
4,4.25,a man is playing the cello.,a man seated is playing the cello.


In [9]:
# Save the cleaned data
if not os.path.isdir("data"):
    os.mkdir("data")

train.to_csv("data/train.csv", index=False)
test.to_csv("data/test.csv", index=False)
dev.to_csv("data/dev.csv", index=False)

# 3. AzureML Setup

Now, we set up the necessary components for running this as an AzureML experiment
1. Create or link to an existing `Workspace`
2. Set up an `Experiment` with `logging`
3. Create or attach existing `AmlCompute`
4. Upload our data to a `Datastore`

## 3.1 Link to or create a Workspace

The following cell looks to set up the connection to your [Azure Machine Learning service Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace). You can choose to connect to an existing workspace or create a new one. 

**To access an existing workspace:**
1. If you have a `config.json` file, you do not need to provide the workspace information; you will only need to update the `config_path` variable that is defined above which contains the file.
2. Otherwise, you will need to supply the following:
    * The name of your workspace
    * Your subscription id
    * The resource group name

**To create a new workspace:**

Set the following information:
* A name for your workspace
* Your subscription id
* The resource group name
* [Azure region](https://azure.microsoft.com/en-us/global-infrastructure/regions/) to create the workspace in, such as `eastus2`. 

This will automatically create a new resource group for you in the region provided if a resource group with the name given does not already exist. 

In [10]:
ws = azureml_utils.get_or_create_workspace(
    config_path=config_path,
    subscription_id=subscription_id,
    resource_group=resource_group,
    workspace_name=workspace_name,
    workspace_region=workspace_region,
)

Performing interactive authentication. Please follow the instructions on the terminal.




Interactive authentication successfully completed.


In [11]:
print(
    "Workspace name: " + ws.name,
    "Azure region: " + ws.location,
    "Subscription id: " + ws.subscription_id,
    "Resource group: " + ws.resource_group,
    sep="\n",
)

## 3.2 Set up an Experiment and Logging

In [12]:
# Make a folder for the project
project_folder = "./automl-sentence-similarity"
os.makedirs(project_folder, exist_ok=True)

# Set up an experiment
experiment_name = "NLP-SS-googleUSE"
experiment = Experiment(ws, experiment_name)

# Add logging to our experiment
run = experiment.start_logging()

## 3.3 Link AmlCompute Compute Target

To use AzureML Pipelines we need to link a compute target as they can not be run locally. The different options include AmlCompute, Azure Databricks, Remote VMs, etc. All [compute options](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#supported-compute-targets) can be found in this table with details about whether the given options work with automated ML, pipelines, and GPU. For the following example, we will use an AmlCompute target because it supports Azure Pipelines and GPU. 

In [13]:
# choose your cluster
cluster_name = "gpu-test"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing compute target.")
except ComputeTargetException:
    print("Creating a new compute target...")
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_NC6", max_nodes=4
    )

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current AmlCompute.
print(compute_target.get_status().serialize())

Found existing compute target.
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-07-16T01:06:19.136000+00:00', 'errors': None, 'creationTime': '2019-07-09T16:20:30.625908+00:00', 'modifiedTime': '2019-07-09T16:20:46.601973+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


## 3.4 Upload data to Datastore

This step uploads our local data to a `Datastore` so that the data is accessible from the remote compute target and creates a `DataReference` to point to the location of the data on the Datastore. A DataStore is backed either by a Azure File Storage (default option) or Azure Blob Storage ([how to decide between these options](https://docs.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks)) and data is made accessible by mounting or copying data to the compute target. `ws.datastores` lists all options for datastores and `ds.account_name` gets the name of the datastore that can be used to find it in the Azure portal.

In [14]:
# Select a specific datastore or you can call ws.get_default_datastore()
datastore_name = "workspacefilestore"
ds = ws.datastores[datastore_name]

# Upload files in data folder to the datastore
ds.upload(
    src_dir="./data",
    target_path="stsbenchmark_data",
    overwrite=True,
    show_progress=True,
)

Uploading an estimated of 3 files
Uploading ./data\dev.csv
Uploading ./data\test.csv
Uploading ./data\train.csv
Uploaded ./data\dev.csv, 1 files out of an estimated total of 3
Uploaded ./data\test.csv, 2 files out of an estimated total of 3
Uploaded ./data\train.csv, 3 files out of an estimated total of 3
Uploaded 3 files


$AZUREML_DATAREFERENCE_6a3eb209d2a04cc6b66a68fa25213980

We also set up a `DataReference` object that points to the data we just uploaded into the stsbenchmark_data folder. DataReference objects point to data that is accessible from a datastore and will be used an an input into our pipeline.

In [15]:
input_data = DataReference(
    datastore=ds,
    data_reference_name="stsbenchmark",
    path_on_datastore="stsbenchmark_data/",
    overwrite=False,
)

# 4. Create AzureML Pipeline

Now we set up our pipeline which is made of two steps:  
1. `PythonScriptStep`: takes each sentence pair from the data in the `Datastore` and concatenates the Google USE embeddings for each sentence into one vector. This step saves the embedding feature matrix back to our `Datastore` and uses a `PipelineData` object to represent this intermediate data.  
2. `AutoMLStep`: takes the intermediate data produced by the previous step and passes it to an `AutoMLConfig` which performs the automatic model selection

## 4.1 Set up run configuration file

First we set up a `RunConfiguration` object which configures the execution environment for an experiment (sets up the conda dependencies, etc.)

In [16]:
# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to AmlCompute
conda_run_config.target = compute_target

conda_run_config.environment.docker.enabled = True
conda_run_config.environment.docker.base_image = aml.core.runconfig.DEFAULT_CPU_IMAGE

# Specify our own conda dependencies for the execution environment
conda_run_config.environment.python.user_managed_dependencies = False
conda_run_config.environment.python.conda_dependencies = CondaDependencies.create(
    pip_packages=[
        "azureml-sdk[automl]==1.0.48",
        "azureml-dataprep==1.1.8",
        "azureml-train-automl==1.0.48",
    ],
    conda_packages=[
        "numpy",
        "py-xgboost<=0.80",
        "pandas",
        "tensorflow",
        "tensorflow-hub",
        "scikit-learn",
    ],
    pin_sdk_version=False,
)

print("run config is ready")

run config is ready


## 4.2 PythonScriptStep

`PythonScriptStep` is a step which runs a user-defined Python script ([documentation](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep?view=azure-ml-py) here). In this `PythonScriptStep`, we will convert our sentences into a numerical representation in order to use them in our machine learning model. We will embed both sentences using the Google Universal Sentence Encoder (provided by tensorflow-hub) and concatenate their representations into a $1024$-dimensional vector to use as features for AutoML.

**Google Universal Sentence Encoder:**
We'll use a popular sentence encoder called Google Universal Sentence Encoder (see [original paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46808.pdf)). Google provides two pretrained models based on different design goals: a Transformer model (targets high accuracy even if this reduces model complexity) and a Deep Averaging Network model (DAN; targets efficient inference). Both models are trained on a variety of web sources (Wikipedia, news, question-answers pages, and discussion forums) and produced 512-dimensional embeddings. This notebook utilizes the Transformer-based encoding model which can be downloaded [here](https://tfhub.dev/google/universal-sentence-encoder-large/3) because of its better performance relative to the DAN model on the STS Benchmark dataset (see Table 2 in Google Research's [paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46808.pdf)). The Transformer model produces sentence embeddings using the "encoding sub-graph of the transformer architecture" (original architecture introduced [here](https://arxiv.org/abs/1706.03762)). "This sub-graph uses attention to compute context aware representations of words in a sentence that take into account both the ordering and identity of all the other workds. The context aware word representations are converted to a fixed length sentence encoding vector by computing the element-wise sum of the representations at each word position." The input to the model is lowercase PTB-tokenized strings and the model is designed to be useful for multiple different tasks by using multi-task learning. More details about the model can be found in the [paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46808.pdf) by Google Research.

### 4.2.1 Define python script to run

Define the script (called embed.py) that the `PythonScriptStep` will execute:

In [17]:
%%writefile $project_folder/embed.py
import argparse
import os
import azureml.core
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub

tf.logging.set_verbosity(tf.logging.ERROR)  # reduce logging output


def google_encoder(dataset):
    """ Function that embeds sentences using the Google Universal
    Sentence Encoder pretrained model
    
    Parameters:
    ----------
    dataset: pandas dataframe with sentences and scores
    
    Returns:
    -------
    emb1: 512-dimensional representation of sentence1
    emb2: 512-dimensional representation of sentence2
    """
    sts_input1 = tf.placeholder(tf.string, shape=(None))
    sts_input2 = tf.placeholder(tf.string, shape=(None))

    # Apply embedding model and normalize the input
    sts_encode1 = tf.nn.l2_normalize(embedding_model(sts_input1), axis=1)
    sts_encode2 = tf.nn.l2_normalize(embedding_model(sts_input2), axis=1)

    with tf.Session() as session:
        session.run(tf.global_variables_initializer())
        session.run(tf.tables_initializer())
        emb1, emb2 = session.run(
            [sts_encode1, sts_encode2],
            feed_dict={
                sts_input1: dataset["sentence1"],
                sts_input2: dataset["sentence2"],
            },
        )
    return emb1, emb2


def feature_engineering(dataset):
    """Extracts embedding features from the dataset and returns
    features and target in a dataframe
    
    Parameters:
    ----------
    dataset: pandas dataframe with sentences and scores
    
    Returns:
    -------
    df: pandas dataframe with embedding features
    scores: list of target variables
    """
    google_USE_emb1, google_USE_emb2 = google_encoder(dataset)
    n_google = google_USE_emb1.shape[1]  # length of the embeddings
    df = np.concatenate((google_USE_emb1, google_USE_emb2), axis=1)
    names = ["USEEmb1_" + str(i) for i in range(n_google)] + [
        "USEEmb2_" + str(i) for i in range(n_google)
    ]
    df = pd.DataFrame(df, columns=names)
    return df, dataset["score"]


def write_output(df, path, name):
    """Write dataframes to correct path"""
    os.makedirs(path, exist_ok=True)
    print("%s created" % path)
    df.to_csv(path + "/" + name, index=False)


# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument("--sentence_data", type=str)
parser.add_argument("--embedded_data", type=str)
args = parser.parse_args()

# Import the Universal Sentence Encoder's TF Hub module
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embedding_model = hub.Module(module_url)

# Read data
train = pd.read_csv(args.sentence_data + "/train.csv")
dev = pd.read_csv(args.sentence_data + "/dev.csv")

# Get Google USE features
training_data, training_scores = feature_engineering(train)
validation_data, validation_scores = feature_engineering(dev)

# Write out training data to Datastore
write_output(training_data, args.embedded_data, "X_train.csv")
write_output(
    pd.DataFrame(training_scores, columns=["score"]), args.embedded_data, "y_train.csv"
)

# Write out validation data to Datastore
write_output(validation_data, args.embedded_data, "X_dev.csv")
write_output(
    pd.DataFrame(validation_scores, columns=["score"]), args.embedded_data, "y_dev.csv"
)

Writing ./automl-sentence-similarity/embed.py


### 4.2.2 Create PipelineData object

`PipelineData` objects represent a piece of intermediate data in a pipeline. Generally they are produced by one step (as an output) and then consumed by the next step (as an input), introducing an implicit order between steps in a pipeline. We create a PipelineData object that can represent the data produced by our first pipeline step that will be consumed by our second pipeline step.

In [18]:
embedded_data = PipelineData("embedded_data", datastore=ds)

### 4.2.3 Create PythonScriptStep

This step defines the `PythonScriptStep`. We give the step a name, tell the step which python script to run (embed.py) and what directory that script is located in (source_directory). 

We also link the compute target and run configuration that we made previously. Our input is the `DataReference` object (input_data) where our raw sentence data was uploaded and our ouput is the `PipelineData` object (embedded_data) where the embedded data produced by this step will be stored. These are also passed in as arguments so that we have access to the correct data paths.

In [19]:
embed_step = PythonScriptStep(
    name="Embed",
    script_name="embed.py",
    arguments=["--embedded_data", embedded_data, "--sentence_data", input_data],
    inputs=[input_data],
    outputs=[embedded_data],
    compute_target=compute_target,
    runconfig=conda_run_config,
    source_directory=project_folder,
    allow_reuse=True,
)

## 4.3 AutoMLStep

`AutoMLStep` creates an AutoML step in a pipeline (see [documentation](https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlstep?view=azure-ml-py) and [basic example](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.ipynb)). When using AutoML on remote compute, rather than passing our data directly into the `AutoMLConfig` object as we did in the local example, we must define a get_data.py script with a get_data() function to pass as the data_script argument. This workflow can be used for both local and remote executions (see [details](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-auto-train-remote)). 



### 4.3.1 Define get_data script to load data

Define the get_data.py file and get_data() function that the `AutoMLStep` will execute to collect data. When AutoML is used with a remote compute, the data can not be passed directly as parameters. Rather, a get_data function must be defined to access the data (see [this resource](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-auto-train-remote) for further details). Note that we can directly access the path of the intermediate data (called embedded_data) through `os.environ['AZUREML_DATAREFERENCE_embedded_data']`. This is necessary because the AutoMLStep does not accept additional parameters like the PythonScriptStep does with `arguments`.

In [20]:
%%writefile $project_folder/get_data.py

import os
import pandas as pd

# get location of the embedded_data for future use
EMBEDDED_DATA_REF = os.environ["AZUREML_DATAREFERENCE_embedded_data"]

def get_data():
    """Function needed to load data for use on remote AutoML experiments"""
    X_train = pd.read_csv(EMBEDDED_DATA_REF + "/X_train.csv")
    y_train = pd.read_csv(EMBEDDED_DATA_REF + "/y_train.csv")
    X_dev = pd.read_csv(EMBEDDED_DATA_REF + "/X_dev.csv")
    y_dev = pd.read_csv(EMBEDDED_DATA_REF + "/y_dev.csv")
    return {"X": X_train.values, "y": y_train.values.flatten(), "X_valid": X_dev.values, "y_valid": y_dev.values.flatten()}

Writing ./automl-sentence-similarity/get_data.py


### 4.3.2 Create AutoMLConfig object

Now, we specify the parameters for the `AutoMLConfig` class:

**task**  
AutoML supports the following base learners for the regression task: Elastic Net, Light GBM, Gradient Boosting, Decision Tree, K-nearest Neighbors, LARS Lasso, Stochastic Gradient Descent, Random Forest, Extremely Randomized Trees, XGBoost, DNN Regressor, Linear Regression. In addition, AutoML also supports two kinds of ensemble methods: voting (weighted average of the output of multiple base learners) and stacking (training a second "metalearner" which uses the base algorithms' predictions to predict the target variable). Specific base learners can be included or excluded in the parameters for the AutoMLConfig class (whitelist_models and blacklist_models) and the voting/stacking ensemble options can be specified as well (enable_voting_ensemble and enable_stack_ensemble)

**preprocess**  
AutoML also has advanced preprocessing methods, eliminating the need for users to perform this manually. Data is automatically scaled and normalized but an additional parameter in the AutoMLConfig class enables the use of more advanced techniques including imputation, generating additional features, transformations, word embeddings, etc. (full list found [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-portal-experiments#preprocess)). Note that algorithm-specific preprocessing will be applied even if preprocess=False. 

**primary_metric**  
The regression metrics available are the following: Spearman Correlation (spearman_correlation), Normalized RMSE (normalized_root_mean_squared_error), Normalized MAE (normalized_mean_absolute_error), and R2 score (r2_score) 

**Constraints:**  
There is a cost_mode parameter to set cost prediction modes (see options [here](https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlconfig?view=azure-ml-py)). To set constraints on time there are multiple parameters including experiment_exit_score (target score to exit the experiment after achieving), experiment_timeout_minutes (maximum amount of time for all combined iterations), and iterations (total number of different algorithm and parameter combinations to try).

In [21]:
automl_config = AutoMLConfig(
    debug_log="automl_errors.log",
    path=project_folder,
    compute_target=compute_target,
    run_configuration=conda_run_config,
    data_script=project_folder
    + "/get_data.py",  # local path to script with get_data() function
    **automl_settings #where the AutoML main settings are defined
)

### 4.3.3 Create AutoMLStep

Finally, we create `PipelineData` objects for the model data (our outputs) and then create the `AutoMLStep`. The `AutoMLStep` requires a `AutoMLConfig` object and we pass our intermediate data (embedded_data) in as the inputs. 

In [22]:
# Create PipelineData objects for tracking AutoML metrics

metrics_data = PipelineData(
    name="metrics_data",
    datastore=ds,
    pipeline_output_name="metrics_output",
    training_output=TrainingOutput(type="Metrics"),
)
model_data = PipelineData(
    name="model_data",
    datastore=ds,
    pipeline_output_name="best_model_output",
    training_output=TrainingOutput(type="Model"),
)

In [23]:
automl_step = AutoMLStep(
    name="AutoML",
    automl_config=automl_config,  # the AutoMLConfig object created previously
    inputs=[
        embedded_data
    ],  # inputs is the PipelineData that was the output of the previous pipeline step
    outputs=[
        metrics_data,
        model_data,
    ],  # PipelineData objects to reference metric and model information
    allow_reuse=True,
)

# 5. Run Pipeline

Now we set up our pipeline which requires specifying our `Workspace` and the ordering of the steps that we created (steps parameter). We submit the pipeline and inspect the run details using a RunDetails widget. For remote runs, the execution of iterations is asynchronous.

In [24]:
pipeline = Pipeline(
    description="pipeline_embed_automl",  # give a name for the pipeline
    workspace=ws,
    steps=[embed_step, automl_step],
)



In [25]:
pipeline_run = experiment.submit(pipeline)

Created step Embed [d14b211c][a0500165-a3a6-4963-9d8a-7b3dc5981478], (This step will run and generate new outputs)
Created step AutoML [b676f3ac][f37c2b71-e7be-486e-85cf-65c95d59d04f], (This step will run and generate new outputs)
Using data reference stsbenchmark for StepId [35ea3ed1][e3340790-c54f-4147-8dd0-bcb80a9b7b46], (Consumers of this data are eligible to reuse prior runs.)
Submitted pipeline run: 61516c51-af97-458b-a743-93fd3cbd7abf


In [26]:
# Inspect the run details using the provided widget
RunDetails(pipeline_run).show()

![](https://nlpbp.blob.core.windows.net/images/pipelineWidget.PNG)

Alternatively, block until the run has completed.

In [27]:
pipeline_run.wait_for_completion(
    show_output=True
)  # show console output while run is in progress

**Cancel the Run**

Interrupting/Restarting the jupyter kernel will not properly cancel the run, which can lead to wasted compute resources. To avoid this, we recommend explicitly canceling a run with the following code:

`pipeline_run.cancel()`

# 6. Deploy Sentence Similarity Model

Deploying an Azure Machine Learning model as a web service creates a REST API. You can send data to this API and receive the prediction returned by the model.
In general, you create a webservice by deploying a model as an image to a Compute Target.

Some of the Compute Targets are: 
1. Azure Container Instance
2. Azure Kubernetes Service
3. Local web service

The general workflow for deploying a model is as follows:
1. Register a model
2. Prepare to deploy
3. Deploy the model to the compute target
4. Test the deployed model (webservice)

In this notebook we walk you through the process of creating a webservice running on Azure Kubernetes Service ([AKS](https://docs.microsoft.com/en-us/azure/aks/intro-kubernetes
)) by deploying the model as an image. AKS is good for high-scale production deployments. It provides fast response time and autoscaling of the deployed service. Cluster autoscaling is not supported through the Azure Machine Learning SDK. 

You can find more information on deploying and serving models [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where)


## 6.1 Register/Retrieve AutoML and Google Universal Sentence Encoder Models for Deployment

Registering a model means registering one or more files that make up a model. The Machine Learning models are registered in your current Aure Machine Learning Workspace. The model can either come from Azure Machine Learning or another location, such as your local machine.

See other ways to register a model [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where)

Below we show how to register a new model and also how to retrieve and register an existing model.

### Register a new automl model
Register the best AutoML model based on the pipeline results or load the saved model

In [28]:
automl_step_run = AutoMLStepRun(step_run=pipeline_run.find_step_run("AutoML")[0])

# to register the fitted_mode
description = "Pipeline AutoML Model"
tags = {"area": "nlp", "type": "sentencesimilarity pipelines"}
model = automl_step_run.register_model(description=description, tags=tags)
automl_model_name = automl_step_run.model_id
print(
    automl_step_run.model_id
)  # Use this id to deploy the model as a web service in Azure.



Registering model 711e9373160c4a8best
711e9373160c4a8best


### Retrieve existing model from Azure
If you already have a best model then you can skip registering the model by just retrieving the latest version of model by providing its name

In [29]:
automl_model_name = "711e9373160c4a8best"  # best fit model registered in the workspace
model = Model(ws, name=automl_model_name)
print("Found model with name", automl_model_name)

Found model with name 711e9373160c4a8best


### Register Google Universal Sentence Encoder Model
Register the Google Universal Sentence Encoder model if not already registered in your workspace

In [30]:
# set location for where to download google tensorflow model
os.environ["TFHUB_CACHE_DIR"] = "./googleUSE"
# download model
hub.Module("https://tfhub.dev/google/universal-sentence-encoder-large/3")
# register model
embedding_model = Model.register(
    model_path="googleUSE",
    model_name="googleUSEmodel",
    tags={"Model": "GoogleUSE"},
    description="Google Universal Sentence Embedding pretrained model",
    workspace=ws,
)
print("Registered googleUSEembeddings model")

Registering model googleUSEmodel
Registered googleUSEembeddings model


### Retrieve existing Google USE model from Azure

In [31]:
embedding_model = Model(ws, name="googleUSEmodel")
print("Found model with name googleUSEembeddings")

Found model with name googleUSEembeddings


## 6.2 Create Scoring Script

In this section we show an example of an entry script, which is called from the deployed webservice. `score.py` is our entry script. The script must contain:
1. init() - This function loads the model in a global object.
2. run() - This function is used for model prediction. The inputs and outputs to `run()` typically use JSON for serialization and deserilization. 

In [32]:
%%writefile score.py
import pickle
import json
import numpy as np
import azureml.train.automl
from sklearn.externals import joblib
from azureml.core.model import Model
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import os

tf.logging.set_verbosity(tf.logging.ERROR)  # reduce logging output


def google_encoder(dataset):
    """ Function that embeds sentences using the Google Universal
    Sentence Encoder pretrained model
    
    Parameters:
    ----------
    dataset: pandas dataframe with sentences and scores
    
    Returns:
    -------
    emb1: 512-dimensional representation of sentence1
    emb2: 512-dimensional representation of sentence2
    """
    global embedding_model, sess
    sts_input1 = tf.placeholder(tf.string, shape=(None))
    sts_input2 = tf.placeholder(tf.string, shape=(None))

    # Apply embedding model and normalize the input
    sts_encode1 = tf.nn.l2_normalize(embedding_model(sts_input1), axis=1)
    sts_encode2 = tf.nn.l2_normalize(embedding_model(sts_input2), axis=1)

    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    emb1, emb2 = sess.run(
        [sts_encode1, sts_encode2],
        feed_dict={sts_input1: dataset["sentence1"], sts_input2: dataset["sentence2"]},
    )
    return emb1, emb2


def feature_engineering(dataset):
    """Extracts embedding features from the dataset and returns
    features and target in a dataframe
    
    Parameters:
    ----------
    dataset: pandas dataframe with sentences and scores
    
    Returns:
    -------
    df: pandas dataframe with embedding features
    scores: list of target variables
    """
    google_USE_emb1, google_USE_emb2 = google_encoder(dataset)
    return np.concatenate((google_USE_emb1, google_USE_emb2), axis=1)


def init():
    global model, googleUSE_dir_path
    model_path = Model.get_model_path(
        model_name="<<modelid>>"
    )  # this name is model.id of model that we want to deploy
    # deserialize the model file back into a sklearn model
    model = joblib.load(model_path)

    # load the path for google USE embedding model
    googleUSE_dir_path = Model.get_model_path(model_name="googleUSEmodel")
    os.environ["TFHUB_CACHE_DIR"] = googleUSE_dir_path


def run(rawdata):
    global embedding_model, sess, googleUSE_dir_path, model
    try:
        # load data and convert to dataframe
        data = json.loads(rawdata)["data"]
        data_df = pd.DataFrame(data, columns=["sentence1", "sentence2"])

        # begin a tensorflow session and load tensorhub module
        sess = tf.Session()
        embedding_model = hub.Module(
            googleUSE_dir_path + "/96e8f1d3d4d90ce86b2db128249eb8143a91db73"
        )

        # Embed sentences using Google USE model
        embedded_data = feature_engineering(data_df)
        # Predict using AutoML saved model
        result = model.predict(embedded_data)

    except Exception as e:
        result = str(e)
        sess.close()
        return json.dumps({"error": result})

    sess.close()
    return json.dumps({"result": result.tolist()})

Writing score.py


In [33]:
# Substitute the actual model id in the script file.
script_file_name = "score.py"

with open(script_file_name, "r") as cefr:
    content = cefr.read()

with open(script_file_name, "w") as cefw:
    cefw.write(content.replace("<<modelid>>", automl_model_name))

## 6.3 Create a YAML File for the Environment

To ensure the fit results are consistent with the training results, the SDK dependency versions need to be the same as the environment that trains the model. The following cells create a file, pipeline_env.yml, which specifies the dependencies from the run.

In [34]:
myenv = CondaDependencies.create(
    conda_packages=[
        "numpy",
        "scikit-learn",
        "py-xgboost<=0.80",
        "pandas",
        "tensorflow",
        "tensorflow-hub",
    ],
    pip_packages=["azureml-sdk[automl]==1.0.48.*"],
    python_version="3.6.8",
)

conda_env_file_name = "pipeline_env.yml"
myenv.save_to_file(".", conda_env_file_name)

'pipeline_env.yml'

## 6.4 Image Creation

In this step we create a container image which is wrapper containing the entry script, yaml file with package dependencies and the model. The created image is then deployed as a webservice in the next step. This step can take up to 10 minutes and even longer if the model is large.

In [35]:
# trying to add dependencies
image_config = ContainerImage.image_configuration(
    execution_script=script_file_name,
    runtime="python",
    conda_file=conda_env_file_name,
    description="Image with aml pipeline model",
    tags={"area": "nlp", "type": "sentencesimilarity pipeline"},
)

image = ContainerImage.create(
    name="pipeline-automl-image",
    # this is the model object
    models=[model, embedding_model],  # add both embedding and autoML models
    image_config=image_config,
    workspace=ws,
)

image.wait_for_creation(show_output=True)

Creating image
Running..................................................................................
Succeeded
Image creation operation finished for image pipeline-automl-image:1, operation "Succeeded"


If the above step fails, then use below command to see logs.

In [None]:
# image.get_logs()

## 6.5 Provision the AKS Cluster

**Time estimate:** Approximately 20 minutes.

Creating or attaching an AKS cluster is a one time process for your workspace. You can reuse this cluster for multiple deployments. If you delete the cluster or the resource group that contains it, you must create a new cluster the next time you need to deploy. You can have multiple AKS clusters attached to your workspace.

**Note:** Check the Azure Portal to make sure that the AKS Cluster has been provisioned properly before moving forward with this notebook

In [36]:
# create aks cluser

# Use the default configuration (can also provide parameters to customize)
prov_config = AksCompute.provisioning_configuration()

# Create the cluster
aks_target = ComputeTarget.create(
    workspace=ws, name="nlp-aks-cluster", provisioning_configuration=prov_config
)


## 6.6 Deploy the Image as a Web Service on Azure Kubernetes Service

In the case of deployment on AKS, in addition to the Docker image, we need to define computational resources. This is typically a cluster of CPUs or a cluster of GPUs. If we already have a Kubernetes-managed cluster in our workspace, we can use it, otherwise, we can create a new one.

In this notebook we will use the cluster in the above cell.

In [37]:
# Set the web service configuration
aks_config = AksWebservice.deploy_configuration()

We are now ready to deploy our web service. We will deploy from the Docker image. It contains our AutoML model as well as the Google Universal Sentence Encoder model and the conda environment needed for the scoring script to work properly. The parameters to pass to the Webservice.deploy_from_image() command are similar to those used for deployment on Azure Container Instance ([ACI](https://azure.microsoft.com/en-us/services/container-instances/
)). The only major difference is the compute target (aks_target), i.e. the CPU cluster we just spun up.

**Note:** This deployment takes a few minutes to complete.

In [38]:
# deploy image as web service
aks_service_name = "aks-pipelines-service"

aks_service = Webservice.deploy_from_image(
    workspace=ws,
    name=aks_service_name,
    image=image,
    deployment_config=aks_config,
    deployment_target=aks_target,
)
aks_service.wait_for_deployment(show_output=True)
print(aks_service.state)

Creating service
Running.........................
SucceededAKS service creation operation finished, operation "Succeeded"
Healthy


If the above step fails then use below command to see logs

In [None]:
# aks_service.get_logs()

## 6.7 Test Deployed Webservice

Testing the deployed model means running the created webservice. <br>
The deployed model can be tested by passing a list of sentence pairs. The output will be a score between 0 and 5, with 0 indicating no meaning overlap between the sentences and 5 meaning equivalence.

The run method expects input in json format. The Run() method retrieves API keys behind the scenes to make sure that the call is authenticated. The service has a timeout (default of ~30 seconds) which does not allow passing the large test dataset. To overcome this, you can batch data and send it to the service.

In [39]:
sentences = [
    ["This is sentence1", "This is sentence1"],
    ["A hungry cat.", "A sleeping cat"],
    ["Its summer time ", "Winter is coming"],
]
data = {"data": sentences}
data = json.dumps(data)

In [40]:
# Set up a Timer to see how long the model takes to predict
t = Timer()

t.start()
score = aks_service.run(input_data=data)
t.stop()

print("Time elapsed: {}".format(t))

result = json.loads(score)
try:
    output = result["result"]
    print("Number of samples predicted: {}".format(len(output)))
    print(output)
except:
    print(result["error"])

Time elapsed: 12.8143
Number of samples predicted: 3
[3.7827566108065453, 2.7329700382428097, 2.4850673912463717]


Finally, we'll calculate the Pearson Correlation on the test set.

**What is Pearson Correlation?**

Our evaluation metric is Pearson correlation ($\rho$) which is a measure of the linear correlation between two variables. The formula for calculating Pearson correlation is as follows:  

$$\rho_{X,Y} = \frac{E[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y}$$

This metric takes a value in [-1,1] where -1 represents a perfect negative correlation, 1 represents a perfect positive correlation, and 0 represents no correlation. We utilize the Pearson correlation metric as this is the main metric that [SentEval](http://nlpprogress.com/english/semantic_textual_similarity.html), a widely-used evaluation toolkit for evaluation sentence representations, uses for the STS Benchmark dataset.

In [41]:
# load test set sentences
data = pd.read_csv("data/test.csv")
train_y = data["score"].values.flatten()
train_x = data.drop("score", axis=1).values.tolist()
data = {"data": train_x[:500]}
data = json.dumps(data)

In [42]:
# Set up a Timer to see how long the model takes to predict
t = Timer()

t.start()
score = aks_service.run(input_data=data)
t.stop()

print("Time elapsed: {}".format(t))

result = json.loads(score)

try:
    output = result["result"]
    print("Number of sample predicted : {}".format(len(output)))
except:
    print(result["error"])

Time elapsed: 17.3619
Number of sample predicted : 500


In [43]:
# get Pearson Correlation
print(pearsonr(output, train_y[:500])[0])

0.8706075673971211


## Conclusion

This notebook demonstrated how to use AzureML Pipelines and AutoML to streamline the creation of a machine learning workflow for predicting sentence similarity. After creating the pipeline, the notebook demonstrated the deployment of our sentence similarity model using AKS. The model results reported in this notebook (using Google USE embeddings) are much stronger than the results from using AutoML with its built-in embedding capabilities (as in [AutoML Local Deployment ACI](automl_local_deployment_aci.ipynb)). 