Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Training GenSen on AzureML with SNLI Dataset
**GenSen: Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning** [\[1\]](#References)

## Introduction
GenSen is a technique to learn general purpose, fixed-length representations of sentences via multi-task training.  The model combines the benefits of diverse sentence representation learning objectives into a single multi-task framework. As described in the paper **Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning**, it is "the first large-scale reusable sentence representation model obtained by combining a set of training objectives with the level of diversity explored here, i.e. multi-lingual NMT, natural language inference, constituency parsing and skip-thought vectors" [\[1\]](#References). These representations are useful for transfer and low-resource learning. GenSen is trained on several data sources with multiple training objectives on over 100 milion sentences.

GenSen yields the state-of-the-art results on multiple datasets, such as MRPC, SICK-R, SICK-E and STS, for sentence similarity. The reported results are as follows compared with other models [\[3\]](#References):

| Model | MRPC | SICK-R | SICK-E | STS |
| --- | --- | --- | --- | --- |
| GenSen (Subramanian et al., 2018) | 78.6/84.4 | 0.888 | 87.8 | 78.9/78.6 |
| [InferSent](https://arxiv.org/abs/1705.02364) (Conneau et al., 2017) | 76.2/83.1 | 0.884 | 86.3 | 75.8/75.5 |
| [TF-KLD](https://www.aclweb.org/anthology/D13-1090) (Ji and Eisenstein, 2013) | 80.4/85.9 | - | - | - |

This notebook serves as an introduction to an end-to-end NLP solution for sentence similarity by demonstrating how to train and tune GenSen on the AzureML platform. We show the advantages of AzureML when training large NLP models with GPU.

For more information on **AzureML**, see these resources:
* [Quickstart notebook](https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-create-workspace-with-python)
* [Hyperdrive](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters)

## Background: Sequence-to-Sequence Learning
![Sequence to sequence learning examples: machine translation (left) and constituent parsing (right)](https://nlpbp.blob.core.windows.net/images/seq2seq.png)**Sequence to sequence learning examples: machine translation (left) and constituent parsing (right)**

The GenSen model is known to be most similar to that of Luong et al. (2015) [\[4\]](#References), who train a many-to-many **sequence-to-sequence** model on a diverse set of weakly related tasks that includes machine translation, constituency parsing, image captioning, sequence autoencoding, and intra-sentence skip-thoughts. 

Sequence-to-sequence learning, or seq2seq, aims to directly model the conditional probability $p(x|y)$ of mapping an input sequence, $x_1,...,x_n$, into an output sequence, $y_1,...,y_m$. This is done using an encoder-decoder framework. As illustrated in the above figure, the encoder computes a representation $s$ for each input sequence,  which the *decoder* uses to generate the ouput sequence. This decomposes the conditional probability as" [\[4\]](#References):
$$
\log p(y|x)=\sum_{j=1}^{m} \log p(y_i|y_{<j}, x, s)
$$

It is worth noting that the GenSen model deviates from Luong's seq2seq method in two key ways. First, GenSen uses an attention mechanism, meaning that the learned vector representations are not of fixed length. Second, GenSen optimizes for improvements on the same tasks on which the model is trained, rather than optimizing for transferability to different tasks or domains. [\[1\]](#References)

## Azure ML Compute vs. Local
We did a comparative study to make it easier for you to choose between a GPU enabled Azure VM 
and Azure ML compute. The table below provides the cost vs performance trade-off for 
each of the choices. We can tell from the table below that with distributed training on AzureML, it will make the model converge faster and get better training loss with similar training time.

* The total time in the table stands for the training time + setup time.
* Cost is the estimated cost of running the Azure ML Compute Job or the VM up-time.

**Please note:** These were the estimated cost for running these notebooks as of July 1. Please 
look at the [Azure Pricing Calculator](https://azure.microsoft.com/en-us/pricing/calculator/) to see the most up to date pricing information. 

|---|Azure VM| AML 1 Node| AML 2 Nodes | AML 4 Nodes | AML 8 Nodes|
|---|---|---|---|---|---|
|Training Loss​|4.91​|4.81​|4.78​|4.77​|4.58​|
|Total Time​|1h 05m|1h 54m|1h 44m​|1h 26m​|1h 07m​|
|Cost|\$1.12​|\$2.71​|\$4.68​|\$7.9​|\$12.1​|

# Table of Contents
0. [Global Settings](#0-Global-Settings)
1. [Data Loading and Preprocessing](#1-Data-Loading-and-Preprocessing)    
    * 1.1. [Load SNLI](#1.1-Load-SNLI)  
    * 1.2. [Tokenize](#1.2-Tokenize)  
    * 1.3. [Preprocess](#1.3-Preprocess)  
    * 1.4. [Upload to Azure Blob Storage](#1.4-Upload-to-Azure-Blob-Storage)  
2. [Train GenSen with Distributed Pytorch and Horovod on AzureML](#2-Train-GenSen-with-Distributed-Pytorch-and-Horovod-on-AzureML)  
    * 2.1 [Create or Attach a Remote Compute Target](#2.1-Create-or-Attach-a-Remote-Compute-Target)  
    * 2.2. [Prepare the Training Script](#2.2-Prepare-the-Training-Script)  
    * 2.3. [Define the Estimator and Experiment](#2.3-Define-the-Estimator-and-Experiment)  
        * 2.3.1 [Create a PyTorch Estimator](#2.3.1-Create-a-PyTorch-Estimator)
        * 2.3.2 [Create the Experiment](#2.3.2-Create-the-Experiment)
    * 2.4. [Submit the Training Job to the Compute Target](#2.4-Submit-the-Training-Job-to-the-Compute-Target)
        * 2.4.1 [Monitor the Run](#2.4.1-Monitor-the-Run)
        * 2.4.2 [Interpret the Training Results](#2.4.2-Interpret-the-Training-Results)
3. [Tune Model Hyperparameters](#3-Tune-Model-Hyperparameters)
    * 3.1 [Start a Hyperparameter Sweep](#3.1-Start-a-Hyperparameter-Sweep)
    * 3.2 [Monitor HyperDrive Runs](#3.2-Monitor-HyperDrive-Runs)
    * 3.3 [Find the Best Model](#3.3-Find-the-Best-Model)
- [References](#References)

# 0 Global Settings

In [1]:
import sys
import time
import os
import pandas as pd
import shutil
import papermill as pm
import scrapbook as sb

sys.path.append("../../")
from utils_nlp.dataset import snli, preprocess, Split
from utils_nlp.azureml import azureml_utils
from utils_nlp.models.gensen.preprocess_utils import gensen_preprocess

import azureml as aml
import azureml.train.hyperdrive as hd
from azureml.telemetry import set_diagnostics_collection
import azureml.data
from azureml.data.azure_storage_datastore import AzureFileDatastore
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core import Experiment, get_run
from azureml.core.runconfig import MpiConfiguration
from azureml.train.dnn import PyTorch
from azureml.train.estimator import Estimator
from azureml.train.hyperdrive import (
    RandomParameterSampling,
    BanditPolicy,
    HyperDriveConfig,
    uniform,
    PrimaryMetricGoal,
)
from azureml.widgets import RunDetails

print("System version: {}".format(sys.version))
print("Azure ML SDK Version:", aml.core.VERSION)
print("Pandas version: {}".format(pd.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]
Azure ML SDK Version: 1.0.48
Pandas version: 0.24.2


In [2]:
# Model configuration
NROWS = None
CACHE_DIR = "./temp"
MAX_EPOCH = 2 # by default is None
ENTRY_SCRIPT = "utils_nlp/gensen/gensen_train.py"
TRAIN_SCRIPT = "gensen_train.py"
CONFIG_PATH = "gensen_config.json"
EXPERIMENT_NAME = "NLP-SS-GenSen-deepdive"
UTIL_NLP_PATH = "../../utils_nlp"
MAX_TOTAL_RUNS = 8
MAX_CONCURRENT_RUNS = 4

# Azure resources
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "YOUR_RESOURCE_GROUP_NAME"  
workspace_name = "YOUR_WORKSPACE_NAME"  
workspace_region = "YOUR_WORKSPACE_REGION" #Possible values eastus, eastus2 and so on.
AZUREML_CONFIG_PATH = "./.azureml"
AZUREML_VERBOSE = False

In this notebook we use the Azure Machine Learning Python SDK to facilitate remote training and computation. To get started, we must first initialize an AzureML [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace), a centralized resource for managing experiment runs, compute resources, datastores, and other machine learning artifacts on the cloud. Refer to the official [configuration](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) notebook for more information about setting up the workspace.

In [3]:
if os.path.exists(AZUREML_CONFIG_PATH):
    ws = azureml_utils.get_or_create_workspace(config_path=AZUREML_CONFIG_PATH)
else:
    ws = azureml_utils.get_or_create_workspace(
        subscription_id=subscription_id,
        resource_group=resource_group,
        workspace_name=workspace_name,
        workspace_region=workspace_region,
    )

if AZUREML_VERBOSE:
    print("Workspace name: {}".format(ws.name))
    print("Azure region: {}".format(ws.location))
    print("Subscription id: {}".format(ws.subscription_id))
    print("Resource group: {}".format(ws.resource_group))

Opt-in diagnostics for better experience, quality, and security of future releases.

In [4]:
set_diagnostics_collection(send_diagnostics=True)

Turning diagnostics collection on. 


# 1 Data Loading and Preprocessing

We use the [SNLI](https://nlp.stanford.edu/projects/snli/) dataset in this example.

Note: The dataset used in the original paper can be downloaded by running the bashfile [here](https://github.com/Maluuba/gensen/blob/master/get_data.sh). Training on the original datasets will reproduce the results in the [paper](https://arxiv.org/abs/1804.00079), but **will take about 20 hours of training time**. For the purposes of this example we use SNLI, a subset of the original dataset, as the only training dataset.

## 1.1 Load SNLI

In [5]:
data_dir = os.path.join(CACHE_DIR, "data")
train = snli.load_pandas_df(data_dir, file_split=Split.TRAIN, nrows=NROWS)
dev = snli.load_pandas_df(data_dir, file_split=Split.DEV, nrows=NROWS)
test = snli.load_pandas_df(data_dir, file_split=Split.TEST, nrows=NROWS)

100%|████████████████████████████████████████████████████████████████████████████| 92.3k/92.3k [00:07<00:00, 11.6kKB/s]


In [6]:
train.head()

Unnamed: 0,gold_label,sentence1_binary_parse,sentence2_binary_parse,sentence1_parse,sentence2_parse,sentence1,sentence2,captionID,pairID,label1,label2,label3,label4,label5
0,neutral,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( is ( ( training ( his horse...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,A person is training his horse for a competition.,3416050480.jpg#4,3416050480.jpg#4r1n,neutral,,,,
1,contradiction,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,( ( A person ) ( ( ( ( is ( at ( a diner ) ) )...,(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is at a diner, ordering an omelette.",3416050480.jpg#4,3416050480.jpg#4r1c,contradiction,,,,
2,entailment,( ( ( A person ) ( on ( a horse ) ) ) ( ( jump...,"( ( A person ) ( ( ( ( is outdoors ) , ) ( on ...",(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN o...,(ROOT (S (NP (DT A) (NN person)) (VP (VBZ is) ...,A person on a horse jumps over a broken down a...,"A person is outdoors, on a horse.",3416050480.jpg#4,3416050480.jpg#4r1e,entailment,,,,
3,neutral,( Children ( ( ( smiling and ) waving ) ( at c...,( They ( are ( smiling ( at ( their parents ) ...,(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (PRP They)) (VP (VBP are) (VP (VB...,Children smiling and waving at camera,They are smiling at their parents,2267923837.jpg#2,2267923837.jpg#2r1n,neutral,,,,
4,entailment,( Children ( ( ( smiling and ) waving ) ( at c...,( There ( ( are children ) present ) ),(ROOT (NP (S (NP (NNP Children)) (VP (VBG smil...,(ROOT (S (NP (EX There)) (VP (VBP are) (NP (NN...,Children smiling and waving at camera,There are children present,2267923837.jpg#2,2267923837.jpg#2r1e,entailment,,,,


## 1.2 Tokenize

Here we clean the dataframes, do lowercase standardization, and tokenize the text using the [NLTK](https://www.nltk.org/) library.

In [7]:
def clean_and_tokenize(df):
    df = snli.clean_cols(df)
    df = snli.clean_rows(df)
    df = preprocess.to_lowercase(df)
    df = preprocess.to_nltk_tokens(df)
    return df

For `clean_and_tokenize` function, it may take a little bit longer. To run the following cell, it takes around 5 to 10 mins.

In [8]:
train = clean_and_tokenize(train)
dev = clean_and_tokenize(dev)
test = clean_and_tokenize(test)

In [9]:
train.head()

Unnamed: 0,score,sentence1,sentence2,sentence1_tokens,sentence2_tokens
0,neutral,a person on a horse jumps over a broken down a...,a person is training his horse for a competition.,"[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, training, his, horse, for, a, ..."
1,contradiction,a person on a horse jumps over a broken down a...,"a person is at a diner, ordering an omelette.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, at, a, diner, ,, ordering, an,..."
2,entailment,a person on a horse jumps over a broken down a...,"a person is outdoors, on a horse.","[a, person, on, a, horse, jumps, over, a, brok...","[a, person, is, outdoors, ,, on, a, horse, .]"
3,neutral,children smiling and waving at camera,they are smiling at their parents,"[children, smiling, and, waving, at, camera]","[they, are, smiling, at, their, parents]"
4,entailment,children smiling and waving at camera,there are children present,"[children, smiling, and, waving, at, camera]","[there, are, children, present]"


## 1.3 Preprocess
We format our data in a specific way in order for the Gensen model to be able to ingest it. We do this by
* Saving the tokens for each split in a `snli_1.0_{split}.txt.clean` file, with the sentence pairs and scores tab-separated and the tokens separated by a single space. Since some of the samples have invalid scores ("-"), we filter those out and save them separately in a `snli_1.0_{split}.txt.clean.noblank` file.
* Saving the tokenized sentence and labels separately, in the form `snli_1.0_{split}.txt.s1.tok` or `snli_1.0_{split}.txt.s2.tok` or `snli_1.0_{split}.txt.lab`.

In [10]:
preprocessed_data_dir = gensen_preprocess(train, dev, test, data_dir)
print("Writing input data to {}".format(preprocessed_data_dir))

Writing input data to ./temp\data\clean/snli_1.0


## 1.4 Upload to Azure Blob Storage
We upload the data from the local machine into the datastore so that it can be accessed for remote training. The datastore is a reference that points to a storage account, e.g. the Azure Blob Storage service. It can be attached to an AzureML workspace to facilitate data management operations such as uploading/downloading data or interacting with data from remote compute targets.

**Note: If you already have the preprocessed files under `clean/snli_1.0/` in your default datastore, you DO NOT need to redo this section.**

In [11]:
ds = ws.get_default_datastore()

if AZUREML_VERBOSE:
    print("Datastore type: {}".format(ds.datastore_type))
    print("Datastore account: {}".format(ds.account_name))
    print("Datastore container: {}".format(ds.container_name))
    print("Data reference: {}".format(ds.as_mount()))

In [12]:
_ = ds.upload(
    src_dir=os.path.join(data_dir, "clean/snli_1.0"),
    overwrite=False,
    show_progress=AZUREML_VERBOSE,
)

# 2 Train GenSen with Distributed Pytorch and Horovod on AzureML
In this tutorial, we train a GenSen model with PyTorch on AML using distributed training across a GPU cluster.

After creating the workspace and setting up the development environment, training a model in Azure Machine Learning involves the following steps:
1. Creating a remote compute target
2. Preparing the training data and uploading it to datastore (Note that this was done in Section 1.4)
3. Preparing the training script
4. Creating Estimator and Experiment objects
5. Submitting the Estimator to an Experiment attached to the AzureML workspace

## 2.1 Create or Attach a Remote Compute Target
We create and attach a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training the model. Here we use the AzureML-managed compute target ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) as our remote training compute resource. Our cluster autoscales from 0 to 2 `STANDARD_NC6` GPU nodes.

Creating and configuring the AmlCompute cluster takes approximately 5 minutes the first time around. Once a cluster with the given configuration is created, it does not need to be created again.

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Read more about the default limits and how to request more quota [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas).

In [13]:
cluster_name = "gensen-aml"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing compute target {}".format(cluster_name))
except ComputeTargetException:
    print("Creating a new compute target {}...".format(cluster_name))
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_NC6", max_nodes=8
    )
    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=AZUREML_VERBOSE)

if AZUREML_VERBOSE:
    print(compute_target.get_status().serialize())

Found existing compute target gensen-aml


## 2.2 Prepare the Training Script
The training process involves the following steps:
1. Create or load the dataset vocabulary
2. Train on the training dataset for each batch epoch (batch size = 48 updates)
3. Evaluate on the validation dataset for every 10 epochs
4. Find the local minimum point on validation loss
5. Save the best model and stop the training process

In this section, we define the training script and move all necessary dependencies to `project_folder`, which will eventually be submitted to the remote compute target. Note that the size of the folder can not exceed 300Mb, so large dependencies such as pre-trained embeddings must be accessed from the datastore. 

In [14]:
project_folder = os.path.join(CACHE_DIR, "gensen")
os.makedirs(project_folder, exist_ok=True)

The script for distributed GenSen training is provided at `./gensen_train.py`.

In this example, we use MLflow to log metrics. We also use the [AzureML-Mlflow](https://pypi.org/project/azureml-mlflow/) package to persist these metrics to the AzureML workspace. This is done with no change to the provided training script! Note that logging is done for loss *per minibatch*.

Copy the training script `gensen_train.py` and config file `gensen_config.json` into the project folder.

In [15]:
utils_folder = os.path.join(project_folder, "utils_nlp")

In [16]:
_ = shutil.copytree(UTIL_NLP_PATH, utils_folder)

In [17]:
_ = shutil.copy(TRAIN_SCRIPT, os.path.join(utils_folder, "gensen"))
_ = shutil.copy(CONFIG_PATH, os.path.join(utils_folder, "gensen"))

## 2.3 Define the Estimator and Experiment

### 2.3.1 Create a PyTorch Estimator
The Azure ML SDK's PyTorch Estimator allows us to submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch).

Note that `gensen_config.json` defines all the hyperparameters and paths when training GenSen model. The trained model will be saved in `models` to Azure Blob Storage. **Remember to clean the `models` folder in order to save new models.**

In [30]:
if MAX_EPOCH:
    script_params = {
        "--config": "utils_nlp/gensen/gensen_config.json",
        "--data_folder": ws.get_default_datastore().as_mount(),
        "--max_epoch": MAX_EPOCH,
    }
else:
    script_params = {
        "--config": "utils_nlp/gensen/gensen_config.json",
        "--data_folder": ws.get_default_datastore().as_mount(),
    }

estimator = PyTorch(
    source_directory=project_folder,
    script_params=script_params,
    compute_target=compute_target,
    entry_script=ENTRY_SCRIPT,
    node_count=2,
    process_count_per_node=1,
    distributed_training=MpiConfiguration(),
    use_gpu=True,
    framework_version="1.1",
    conda_packages=["scikit-learn=0.20.3", "h5py", "nltk"],
    pip_packages=["azureml-mlflow>=1.0.43.1", "numpy>=1.16.0"],
)

This Estimator specifies that the training script will run on `2` nodes, with one worker per node. In order to execute a distributed run using GPU, we must define `use_gpu` and `distributed_backend` to use MPI/Horovod. PyTorch, Horovod, and other necessary dependencies are installed automatically. If the training script makes use of packages that are not already defined in `.azureml/conda_dependencies.yml`, we must explicitly tell the estimator to install them via the constructor's `pip_packages` or `conda_packages` parameters.

Note that if the estimator is being created for the first time, this step will take longer to run because the conda dependencies found under `.azureml/conda_dependencies.yml` must be installed from scratch. After the first run, it will use the existing conda environment and run the code directly. 

The training time will take around **2 hours** if you use the default value `max_epoch=None`, which means the training will stop if the local minimum loss has been found. User can specify the number of epochs for training.

**Requirements:**
- python=3.6.2
- numpy=1.15.1
- numpy-base=1.15.1
- pip=10.0.1
- python=3.6.6
- python-dateutil=2.7.3
- scikit-learn=0.20.3
- azureml-defaults
- h5py
- nltk

### 2.3.2 Create the Experiment
Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in the AzureML workspace for this tutorial.

In [31]:
experiment_name = EXPERIMENT_NAME
experiment = Experiment(ws, name=experiment_name)

## 2.4 Submit the Training Job to the Compute Target
We can run the experiment by simply submitting the Estimator object to the compute target. Note that this call is asynchronous.

In [32]:
run = experiment.submit(estimator)
if AZUREML_VERBOSE:
    print(run)

### 2.4.1 Monitor the Run
We can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes. The widget automatically plots and visualizes the loss metric that we logged to the AzureML workspace.

In [33]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 's…

In [35]:
_ = run.wait_for_completion(show_output=AZUREML_VERBOSE) # Block until the script has completed training.

### 2.4.2 Interpret the Training Results
The following chart shows the model validation loss with different node configurations on AmlCompute. We find that the minimum validation loss decreases as the number of nodes increases; that is, the performance scales with the number of nodes in the cluster.

| Standard_NC6 | AML_1node | AML_2nodes | AML_4nodes | AML_8nodes |
| --- | --- | --- | --- | --- |
| min_val_loss | 4.81 | 4.78 | 4.77 | 4.58 |

We also observe common tradeoffs associated with distributed training. We make use of [Horovod](https://github.com/horovod/horovod), a distributed training tool for many popular deep learning frameworks that enables parallelization of work across the nodes in the cluster. Distributed training decreases the time it takes for the model to converge in theory, but the model may also take more time in communicating with each node. Note that the communication time will eventually become negligible when training on larger and larger datasets, but being aware of this tradeoff is helpful for choosing the node configuration when training on smaller datasets.

# 3 Tune Model Hyperparameters
Now that we've seen how to do a simple PyTorch training run using the SDK, let's see if we can further improve the accuracy of our model. We can optimize our model's hyperparameters using Azure Machine Learning's hyperparameter tuning capabilities.

## 3.1 Start a Hyperparameter Sweep
First, we define the hyperparameter space to sweep over. Since the training script uses a learning rate schedule to decay the learning rate every several epochs, we can tune the initial learning rate parameter. In this example we will use random sampling to try different configuration sets of hyperparameters to minimize our primary metric, the best validation loss.

Then, we specify the early termination policy to use to early terminate poorly performing runs. Here we use the `BanditPolicy`, which terminates any run that doesn't fall within the slack factor of our primary evaluation metric. In this tutorial, we will apply this policy every epoch (since we report our the validation loss metric every epoch and `evaluation_interval=1`). Note that we explicitly define `delay_evaluation` such that the first policy evaluation does not occur until after the 10th epoch.

Refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-tune-hyperparameters#specify-an-early-termination-policy) for more information on the BanditPolicy and other policies available.

In [36]:
param_sampling = RandomParameterSampling({"learning_rate": uniform(0.0001, 0.001)})

early_termination_policy = BanditPolicy(
    slack_factor=0.15, evaluation_interval=1, delay_evaluation=10
)

hyperdrive_config = HyperDriveConfig(
    estimator=estimator,
    hyperparameter_sampling=param_sampling,
    policy=early_termination_policy,
    primary_metric_name="min_val_loss",
    primary_metric_goal=PrimaryMetricGoal.MINIMIZE,
    max_total_runs=MAX_TOTAL_RUNS,
    max_concurrent_runs=MAX_CONCURRENT_RUNS,
)

Finally, lauch the hyperparameter tuning job.

In [37]:
hyperdrive_run = experiment.submit(hyperdrive_config) # Start the HyperDrive run

## 3.2 Monitor HyperDrive Runs
We can monitor the progress of the runs with a Jupyter widget, or again block until the run has completed. 

In [40]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO',…

In [39]:
_ = hyperdrive_run.wait_for_completion(show_output=AZUREML_VERBOSE) # Block until complete

### 3.2.1 Interpret the Tuning Results

The chart below shows 4 different threads running in parallel with different learning rates. The number of total runs is 8. We pick the best learning rate by minimizing the validation loss. The HyperDrive run automatically shows the tracking charts (example in the following) to facilitate visualization of the tuning process.

![Tuning](https://nlpbp.blob.core.windows.net/images/gensen_tune1.PNG)
![Tuning](https://nlpbp.blob.core.windows.net/images/gensen_tune2.PNG)

**From the results in section [2.3.5 Monitor your run](#2.4.1-Monitor-your-run), the best validation loss for 1 node is 4.81, but with tuning we can easily achieve better performance around 4.65.**

## 3.3 Find the Best Model

Once all the runs complete, we can find the run that produced the model with the lowest loss.

In [41]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
print(
    "Best Run:\n  Validation loss: {0:.5f} \n  Learning rate: {1:.5f} \n".format(
        best_run_metrics["min_val_loss"], best_run_metrics["learning_rate"]
    )
)

Best Run:
  Validation loss: 6.23771 
  Learning rate: 0.00066 



In [42]:
# Persist properties of the run so we can access the logged metrics later
sb.glue("min_val_loss", best_run_metrics['min_val_loss'])
sb.glue("learning_rate", best_run_metrics['learning_rate'])

## References

1. Subramanian, Sandeep and Trischler, Adam and Bengio, Yoshua and Pal, Christopher J, [*Learning general purpose distributed sentence representations via large scale multi-task learning*](https://arxiv.org/abs/1804.00079), ICLR, 2018.
2. A. Conneau, D. Kiela, [*SentEval: An Evaluation Toolkit for Universal Sentence Representations*](https://arxiv.org/abs/1803.05449).
3. Semantic textual similarity. url: http://nlpprogress.com/english/semantic_textual_similarity.html
4. Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. [*Multi-task sequence to sequence learning*](https://arxiv.org/abs/1511.06114), 2015.
5. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. [*Learned in translation: Contextualized word vectors](https://arxiv.org/abs/1708.00107), 2017. 