Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Question Answering: Fine-Tune BERT on AzureML (PyTorch)
**BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding** [\[1\]](#References)

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/nlp/examples/question_answering/pretrained_BERT_SQuAD_deep_dive_aml.png)

This notebook contains an end-to-end walkthrough using [Azure Machine Learning service (AzureML)](https://azure.microsoft.com/en-us/services/machine-learning-service/) to fine-tune the pretrained [PyTorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) of [Google's TensorFlow repository for the BERT model](https://github.com/google-research/bert) developed by Hugging Face. 

**Note: To learn how to do pre-training on your own, please reference the [AzureML-BERT repo](https://github.com/microsoft/AzureML-BERT) created by Microsoft.**

This notebook will walk through the following:
- Download the SQuAD dataset on a remote compute and store the dataset in Azure Blob storage
- Fine-tune BERT with distributed PyTorch by Horovod for SQuAD dataset using GPU clusters provided by AzureML
- Further fine-tune BERT with AzureML's hyperparameter tuning 

## What is BERT?

[BERT (Bidirectional Encoder Representations from Transformers)](https://arxiv.org/abs/1810.04805) is a powerful pre-trained language model by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural language Inference (MNLI), Text Classification, Name Entity Recognition, etc., by only a few epochs of fine tuning on task specific datasets. The key technical innovation of BERT is applying the bidirectional training of Transformer, which is a popular attention model that learns contextual relations between words (or sub-words) in a text.

## How to fine-tune BERT for QA

The figure below shows how BERT can be fine tuned for Question and Answering (QA) tasks. BERT plugs the question-passage pairs in SQuAD dataset as the inputs, and the `[SEP]` representation is a special separator token for separating questions/answers. At the output layer, it outputs `Start/End` to denote the answer in the paragraph.

<img src="https://nlpbp.blob.core.windows.net/images/bertqa.PNG" height=400 width=400>

## What is the SQuAD dataset?

"Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable". [\[2\]](#References)

"SQuAD 1.1, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on 500+ articles". [\[2\]](#References) More details from [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/).

# Table of Contents

<ol start="0">
  <li> [Prerequisites: Global Settings](#0.-Prerequisites:-Global-Settings)</li>
  <li> [Data Loading](#1.-Data-Loading)
  <ul style="list-style: none;"><li>[1.1 Default AzureML datastore](#1.1-Default-AzureML-datastore)</li>
        <li>[1.2 Download training dataset - SQuAD](#1.2-Download-training-dataset---SQuAD)</li>
        <li>[1.3 Upload to Azure blob storage](#1.3-Upload-to-Azure-blob-storage)</li>
    </ul>
  </li><br>
  <li> [Fine tuning BERT with Distributed Training by Horovod](#2.-Fine-tuning-BERT-with-Distributed-Training-by-Horovod)
    <ul style="list-style: none;">
        <li> [2.1 Create or Attach Existing AmlCompute](#2.1-Create-a-GPU-remote-compute-target)</li>
        <li> [2.2 Access to a Project Directory](#2.2-Create-a-project-directory)  </li>
        <li> [2.3 Train Model on the Remote Compute](#2.3-Prepare-your-training-script) </li> 
        <li> [2.4 Create a PyTorch estimator for fine tuning](#2.4-Create-a-PyTorch-estimator-for-fine-tuning) </li> 
        <li> [2.5 Create an experiment](#2.5-Create-an-experiment)  </li>
        <li> [2.6 Submit and Monitor your run](#2.6-Submit-and-Monitor-your-run)  </li>
      </ul>
    </li><br>
    <li>[Fine Tuning BERT with Hyperparameter Tuning](#3-Fine-Tuning-BERT-with-Hyperparameter-Tuning)
        <ul style="list-style: none;">
            <li> [3.1 Start a hyperparameter sweep](#3.1-Start-a-hyperparameter-sweep)</li>
            <li> [3.2 Monitor HyperDrive runs](#3.2-Monitor-HyperDrive-runs)</li>
            <li> [3.3 Find and register the best model](#3.3-Find-and-register-the-best-model)</li>
        </ul>
    </li><br>
    <li>[References](#References)</li>
</ol>

## 0. Prerequisites: Global Settings
You will need to do the following to be successful with the rest of the notebook:
- Have an existing Azure subscription. You can get started for free [here](https://azure.microsoft.com/free/)
- Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning service (AzureML)

- Make sure the [AzureML Python SDK](https://pypi.org/project/azureml-sdk/) is installed with notebooks and contrib add ons.
```
conda create -n azureml -y Python=3.6
source activate azureml
pip install --upgrade azureml-sdk[notebooks,contrib] 
conda install ipywidgets
jupyter nbextension install --py --user azureml.widgets
jupyter nbextension enable azureml.widgets --user --py
```
- Import the required packages
- Set Environment Variables
- Connect to an [Azure Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-workspace#create-a-workspace)

Run the following cell to make sure you have installed all the packages.

In [None]:
import sys
sys.path.append("../../")
import urllib, os
from utils_nlp.azureml import azureml_utils
import math
import json 
import pandas as pd
import papermill as pm
#package for flattening json in pandas df
from pandas.io.json import json_normalize
import shutil
import scrapbook as sb
# Check core SDK version number
import azureml.core
from azureml.core import Datastore
from azureml.core import Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import MpiConfiguration
from azureml.telemetry import set_diagnostics_collection
from azureml.train.dnn import PyTorch
from azureml.train.hyperdrive import *
from azureml.widgets import RunDetails

print("SDK version:", azureml.core.VERSION)

The following parameters are set as variables to be used throughout the notebooks.

In [2]:
# Model configuration
DATA_FOLDER = './squad'
PROJECT_FOLDER = './transformers'
EXPERIMENT_NAME = 'NLP-QA-BERT-deepdive'
BERT_MODEL = 'bert-large-uncased'
TARGET_GRADIENT_STEPS = 16
INIT_GRADIENT_STEPS = 2
MAX_SEQ_LENGTH = 384
NUM_TRAIN_EPOCHS = 2.0
NODE_COUNT = 2
TRAIN_SCRIPT_PATH = 'bert_run_squad_azureml.py'
MAX_TOTAL_RUNS = 8
MAX_CONCURRENT_RUNS = 4
BERT_UTIL_PATH = '../../utils_nlp/azureml/azureml_bert_util.py'
EVALUATE_SQAD_PATH = '../../utils_nlp/eval/evaluate_squad.py'

# Azure resources
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "YOUR_RESOURCE_GROUP_NAME"  
workspace_name = "YOUR_WORKSPACE_NAME"  
workspace_region = "YOUR_WORKSPACE_REGION" #Possible values eastus, eastus2 and so on.
AZUREML_CONFIG_PATH = "./.azureml"
AZUREML_VERBOSE = False

**Initialize workspace**

The following cell looks to set up the connection to your [Azure Machine Learning service Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace). You can choose to connect to an existing workspace or create a new one. 

**To access an existing workspace:**
1. If you have a `config.json` file, you do not need to provide the workspace information; you will only need to update the `config_path` variable that is defined above which contains the file.
2. Otherwise, you will need to supply the following:
    * The name of your workspace
    * Your subscription id
    * The resource group name

**To create a new workspace:**

Set the following information:
* A name for your workspace
* Your subscription id
* The resource group name
* [Azure region](https://azure.microsoft.com/en-us/global-infrastructure/regions/) to create the workspace in, such as `eastus2`. 

This will automatically create a new resource group for you in the region provided if a resource group with the name given does not already exist. 

In [4]:
if os.path.exists(AZUREML_CONFIG_PATH):
    ws = azureml_utils.get_or_create_workspace(config_path=AZUREML_CONFIG_PATH)
else:
    ws = azureml_utils.get_or_create_workspace(
        subscription_id=subscription_id,
        resource_group=resource_group,
        workspace_name=workspace_name,
        workspace_region=workspace_region,
    )

if AZUREML_VERBOSE:
    print('Workspace name: ' + ws.name, 
          'Azure region: ' + ws.location, 
          'Subscription id: ' + ws.subscription_id, 
          'Resource group: ' + ws.resource_group, sep='\n')

**Diagnostics**

Opt-in diagnostics for better experience, quality, and security of future releases.

In [4]:
set_diagnostics_collection(send_diagnostics=True)

Turning diagnostics collection on. 


## 1. Data Loading

In this section, we will
1. Connect to the default AzureML datastore
2. Download and load the dataset
3. Upload the training set to the default blob storage of the workspace

In [5]:
data_folder = DATA_FOLDER

### 1.1 Default AzureML datastore

To make data accessible for remote training, AzureML provides a convenient way to do so via a [Datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data). The datastore provides a mechanism for you to upload/download data to Azure Storage, and interact with it from your remote compute targets.

Each workspace is associated with a default Azure Blob datastore named `'workspaceblobstore'`. We use this default datastore to collect the SQuAD training data.

In [6]:
ds = ws.get_default_datastore()
if AZUREML_VERBOSE:
    print(ds.datastore_type, ds.account_name, ds.container_name, ds.as_mount())

AzureBlob maidaipberteas6144514557 azureml-blobstore-cf97de17-8d21-437f-8b4c-298560f34ecd $AZUREML_DATAREFERENCE_workspaceblobstore


### 1.2 Download training dataset - SQuAD

The SQuAD dataset can be downloaded with the following links and should be saved in a blob storage.
- [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
- [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)

In [7]:
os.makedirs('./squad', exist_ok=True)
urllib.request.urlretrieve('https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json', filename=os.path.join(data_folder, 'train-v1.1.json'))
urllib.request.urlretrieve('https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json', filename=os.path.join(data_folder, 'dev-v1.1.json'))

('./squad\\dev-v1.1.json', <http.client.HTTPMessage at 0x1569b645f28>)

The SQuAD dataset contains question-answer pairs on 500+ articles. For each observation in the training set, we have a **context, question, and text**. An example shown as below: [source](https://towardsdatascience.com/building-a-question-answering-system-part-1-9388aadff507)
<img src="https://nlpbp.blob.core.windows.net/images/squad.png">

In [8]:
#load json object
with open(os.path.join(data_folder, 'train-v1.1.json')) as train_file:
    train = json.load(train_file)
for paragraph in train['data'][0]['paragraphs']:
    answer_question = paragraph['qas']
    context = paragraph['context']
    paragraph = paragraph
    break
print("The structure of an example in the training data.")
json_normalize(paragraph).head()

The structure of an example in the training data.


Unnamed: 0,context,qas
0,"Architecturally, the school has a Catholic cha...","[{'answers': [{'answer_start': 515, 'text': 'S..."


In [9]:
print("The structure of question answer pairs in the above example.")
json_normalize(answer_question).head()

The structure of question answer pairs in the above example.


Unnamed: 0,answers,id,question
0,"[{'answer_start': 515, 'text': 'Saint Bernadet...",5733be284776f41900661182,To whom did the Virgin Mary allegedly appear i...
1,"[{'answer_start': 188, 'text': 'a copper statu...",5733be284776f4190066117f,What is in front of the Notre Dame Main Building?
2,"[{'answer_start': 279, 'text': 'the Main Build...",5733be284776f41900661180,The Basilica of the Sacred heart at Notre Dame...
3,"[{'answer_start': 381, 'text': 'a Marian place...",5733be284776f41900661181,What is the Grotto at Notre Dame?
4,"[{'answer_start': 92, 'text': 'a golden statue...",5733be284776f4190066117e,What sits on top of the Main Building at Notre...


### 1.3 Upload to Azure blob storage

The following code will upload the SQuAD dataset to the path ./squad on the default datastore.

In [10]:
ds.upload(src_dir='./squad', target_path='./squad', show_progress=AZUREML_VERBOSE)

Uploading an estimated of 2 files
Target already exists. Skipping upload for squad\dev-v1.1.json
Target already exists. Skipping upload for squad\train-v1.1.json
Uploaded 0 files


$AZUREML_DATAREFERENCE_972d18f476b34d26a1ffd6a11b473114

## 2. Fine-tuning BERT with Distributed Training by Horovod
We can reference the datastore to access the SQuAD dataset and start fine-tuning the model by exploring the power of distributed training on AzureML GPU clusters.

Once you've created your workspace and set up your development environment, training a model in Azure Machine Learning involves the following steps:
1. Create a GPU remote compute target
2. Create a project directory
3. Prepare your training script
4. Create an Estimator object
5. Submit the estimator to an experiment object under the workspace

### 2.1 Create a GPU remote compute target

We need to create a GPU [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) to perform the fine-tuning. In this example, we create an AmlCompute cluster as our training compute resource. Please find the information of Azure VM size in below table.

    
|    VM Size    	| CPU 	|   GPU   	| Storage (SSD) 	| GPU memory 	| InfiniBand  	|
|:-------------:	|:---:	|:-------:	|:-------------:	|:----------:	|:----------:	|
|  Standard_NC6 	|  6  	| 1 x K80 	|    340 GiB    	|    8 GiB   	|      No   	|
| Standard_NC12 	|  12 	| 2 x K80 	|    680 GiB    	|   16 GiB   	|      No   	|
| Standard_NC24 	|  24 	| 4 x K80 	|    1440 GiB   	|   32 GiB   	|      No   	|
| Standard_NC24r 	|  24 	| 4 x K80 	|    1440 GiB   	|   32 GiB   	|      Yes   	|
| Standard_NC6s_v3 	|  6  	| 1 x V100 	|    736 GiB    	|   16 GiB   	|      No   	|
| Standard_NC12s_v3 |  12 	| 2 x V100 	|    1474 GiB   	|   32 GiB   	|      No   	|
| Standard_NC24s_v3 |  24 	| 4 x V100 	|    2948 GiB   	|   64 GiB   	|      No   	|
| Standard_NC24rs_v3|  24 	| 4 x V100 	|    2948 GiB   	|   64 GiB   	|      Yes   	|

This code creates a cluster for you if it does not already exist in your workspace.

***We strongly recommend to use NCv3-series (NVIDIA Tesla V100) to fine-tune with SQuAD dataset. You will need to request quota of NCv3-series for your AzureML subscription.***

In [11]:
# choose a name for your cluster
cluster_name = "bertncrs24"

try:
    gpu_compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC24rs_v3',
                                                           max_nodes=4)

    # create the cluster
    gpu_compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    gpu_compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current AmlCompute. 
print(gpu_compute_target.get_status().serialize())

Found existing compute target.
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-07-22T22:38:04.496000+00:00', 'errors': None, 'creationTime': '2019-07-12T19:59:45.933132+00:00', 'modifiedTime': '2019-07-12T20:00:01.793458+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC24RS_V3'}


### 2.2 Create a project directory
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.

In [6]:
project_folder = PROJECT_FOLDER

Make a local clone of the original [PyTorch reimplementation](https://github.com/huggingface/pytorch-pretrained-BERT) repository

In [None]:
!git clone -b v0.4.0 https://github.com/huggingface/transformers.git

### 2.3 Prepare your training script
Let us prepare the training script to run the fine-tuning script `run_squad.py` from [the Hugging Face repository](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_squad.py). Please refer to the [repo](https://github.com/huggingface/pytorch-pretrained-BERT#fine-tuning-with-bert-running-the-examples) for more details about the script. 

The original `run_squad.py` script uses the PyTorch distributed launch utility to launch multiple processes across nodes and GPUs. Here we use a [modified version](https://github.com/microsoft/AzureML-BERT/blob/master/finetune/run_squad_azureml.py) of this file provided by the [AzureML-BERT repo](https://github.com/microsoft/AzureML-BERT) from Microsoft to be able to launch multiple processes across nodes and GPUs in with an AzureML built-in MPI backend.

Let's retrieve and copy the training script [bert_run_squad_azureml.py](.\bert_run_squad_azureml.py), evaluation script for SQuAD v1.1 [evaluate-v1.1.py](../../utils_nlp/eval/evaluate_squad.py) and the helper utility script for Horovod [azureml_bert_util.py](../../utils_nlp/azureml/azureml_bert_util.py) into our project directory.

In [1]:
shutil.copy(EVALUATE_SQAD_PATH, project_folder)
shutil.copy(BERT_UTIL_PATH, project_folder)
shutil.copy(TRAIN_SCRIPT_PATH, project_folder)

### 2.4 Create a PyTorch estimator for fine-tuning
Let us create a new PyTorch estimator to run the fine-tuning script `run_squad_azureml.py`. To use AzureML's tracking and metrics capabilities, we need to add a small amount of AzureML code inside the training script.

In `run_squad_azureml.py`, we will log some metrics to our AzureML run. To do so, we will access the AzureML run object within the script:
```Python
from azureml.core.run import Run
run = Run.get_context()
```
Further within `run_squad_azureml.py`, we log learning rate, training loss and prediction scores the model achieves as:
```Python
run.log('lr', np.float(args.learning_rate))
...

for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")): 
    ...
    run.log('train_loss', np.float(loss))

..
```
These run metrics will become particularly important when we begin hyperparameter tuning our model in the "Tune model hyperparameters" section below.

Then, AzureML PyTorch estimator can be defined as below. We use `azuremlsamples/bert:torch-1.0.0-apex-cuda9` as the base docker image with [dockerfile](./dockerfile). In this example, we use STANDARD_NC24rs_v3 which has 4 GPUs. Thus, we can set `process_count_per_node=4`.

In [8]:
mpiConfig=MpiConfiguration()
mpiConfig.process_count_per_node=4

estimator = PyTorch(source_directory=project_folder,
                    compute_target=gpu_compute_target,
                    script_params = {
                          '--bert_model':BERT_MODEL,
                          '--do_train' : '',
                          '--do_predict': '',
                          '--train_file': ds.path('squad/train-v1.1.json').as_mount(),
                          '--predict_file': ds.path('squad/dev-v1.1.json').as_mount(),
                          '--max_seq_length': MAX_SEQ_LENGTH,
                          '--train_batch_size': 8,
                          '--learning_rate': 6.8e-5,
                          '--num_train_epochs': NUM_TRAIN_EPOCHS,
                          '--doc_stride': 128,
                          '--seed': 32,
                          '--init_gradient_accumulation_steps':INIT_GRADIENT_STEPS,
                          '--target_gradient_accumulation_steps':TARGET_GRADIENT_STEPS,
                          '--accumulation_warmup_proportion':0.25,
                          '--output_dir': './outputs',
                          '--loss_scale':256,
                    },
                    custom_docker_image='azuremlsamples/bert:torch-1.0.0-apex-cuda9',
                    entry_script='bert_run_squad_azureml.py',
                    node_count=NODE_COUNT,
                    distributed_training=mpiConfig,
                    framework_version='1.1',
                    use_gpu=True)
estimator._estimator_config.environment.python.user_managed_dependencies=True

**Note: You can try with `--bert_model:'bert-base-uncased'`to run a smaller bert model faster.**

### 2.5 Create an experiment
Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this distributed PyTorch tutorial. 

In [9]:
experiment_name = EXPERIMENT_NAME
experiment = Experiment(ws, name=experiment_name)

### 2.6 Submit and Monitor your run
AzureML can automatically create figures on the loss and time, which is eaiser to track the performance as in the following figure shown the train loss v.s. the number of iterations:
![train_loss_bert](https://nlpbp.blob.core.windows.net/images/train_loss_bert.PNG)

The Jupyter widget would be like this:
![train_loss_bert](https://nlpbp.blob.core.windows.net/images/bert_widget.PNG)

In [74]:
run = experiment.submit(estimator)
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 'sâ€¦

In [None]:
_ = run.wait_for_completion(show_output=AZUREML_VERBOSE) # Block until complete

**Cancel the job**

You can cancel the job manually to make sure you do not waste resources.
 ```python
# Cancel the job with id.
job_id = "BERT-SQuAD_1562612876_bab5b3af"
run = get_run(experiment, job_id)

# Cancel jobs.
run.cancel()
```

To achieve an F1 score over **90.5** F1 score and an Exact-Match over **83.5** with the `SQuAD v1.1` dataset, it requires **2** epochs when fine-tuning with the `BERT large` model. Below please find the elapsed time using deferent Azure GPU VMs and configures. 

The default configuration in this notebook uses 2 `STANDARD_NC24rs_v3` (8 x V100) with `fp16` enabled. The training phase should take **22 mins** to complete 2 epochs. 

|  GPU counts 	|    1 GPU    	|         2 GPU 	| 4 GPU      	| 8 GPU      	|
|------------:	|:-----------:	|--------------:	|------------	|------------	|
| NCv3-series 	|     340 mins  |    180 mins 	    |    80 mins 	|   48 mins 	|
| NCv3 with fp16|     140 mins  |    79 mins 	    |    38 mins 	|   22 mins 	|

The performance with different VMs with `fp16` enabled (Duration = training time + preparing time):

|  VM Size 	|  GPU counts|    Node counts|    Duration    	|         F1 	| EM     	| Pretrained BERT      	|
|------------:	|:-----------:	|--------------:	|------------	|------------	|------------	|------------	|
| NC6sv3 |   4 |  4 |  31 mins  |    88.24 	    |    80.59 	|   Base 	|
| NC6sv3 |   4 |  4 |  80 mins  |    90.78 	    |    83.96 	|   Large 	|
| NC24rsv3 |  4 |   1 |  19 mins  |    86.18 	    |    77.90 	|   Base 	|
| NC24rsv3 |  4 |   1 |  46 mins  |    90.53 	    |    83.56 	|   Large 	|
| NC24rsv3 |  8 |   2 |  19 mins  |    87.47 	    |    79.52 	|   Base 	|
| NC24rsv3 |  8 |   2 |  32 mins  |    90.57 	    |    83.58 	|   Large 	|

## 3 Fine-Tuning BERT with Hyperparameter Tuning

We would also like to optimize our hyperparameter, `learning rate`, using Azure Machine Learning's hyperparameter tuning capabilities.

### 3.1 Start a hyperparameter sweep
First, we will define the hyperparameter space to sweep over. In this example we will use random sampling to try different configuration sets of hyperparameter to minimize our primary metric, the f1 score (`f1`). For simplicity, we tune the BERT base model with  `--bert_model':'bert-base-uncased` and  `node_count=1`.

We can also try with `BayesianParameterSampling` with suggested `max_total_runs=20`.
```Python
param_sampling = BayesianParameterSampling( {
         'learning_rate': uniform(5e-5, 9e-5),
    }
)
```

In [10]:
param_sampling = RandomParameterSampling( {
         'learning_rate': uniform(5e-5, 9e-5),
    }
)
hyperdrive_config = HyperDriveConfig(estimator=estimator,
                                         hyperparameter_sampling=param_sampling, 
                                         primary_metric_name='f1',
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=MAX_TOTAL_RUNS,
                                         max_concurrent_runs=MAX_CONCURRENT_RUNS)

Finally, lauch the hyperparameter tuning job.

In [None]:
# start the HyperDrive run
hyperdrive_run = experiment.submit(hyperdrive_config)
RunDetails(hyperdrive_run).show()

### 3.2 Monitor HyperDrive runs
We can monitor the progress of the runs with the following Jupyter widget. 
![](https://nlpbp.blob.core.windows.net/images/bert_tune.PNG)
![](https://nlpbp.blob.core.windows.net/images/bert_tune2.PNG)

In [None]:
_ = hyperdrive_run.wait_for_completion(show_output=AZUREML_VERBOSE) # Block until complete

You can see the experiment progress from this notebook by using `azureml.widgets.RunDetails(hd_run).show()` or check from the Azure portal with the url link you can get by running `hd_run.get_portal_url()`.
To load an existing Hyperdrive run, use `hd_run = hd.HyperDriveRun(exp, <user-run-id>, hyperdrive_run_config=hd_run_config)`. You also can cancel a run with `hd_run.cancel()`.

**Cancel the hyper drive run to save the resources**
 ```python
# Cancel the hyper drive
hyperdrive_run.cancel()
 ```

### 3.3 Find and register the best model
Once all the runs complete, we can find the run that produced the model with the highest F1 score. The F1 score with default learning rate is **86.18** in [Submit and Monitor your run](#2.6-Submit-and-Monitor-your-run) . The best F1 score is **87.01** after tuning with `learning rate=0.000090` with random sampling. With Bayesian sampling, the best F1 score is **86.87** after tuning with `learning rate=0.0000896`.

In [20]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
print('Best Run is:\n  F1 score: %.2f \n  Learning rate: %f' % (best_run_metrics['f1'], best_run_metrics['lr']))

Run(Experiment: BERT-SQuAD,
Id: BERT-SQuAD_1562966635446_2,
Type: azureml.scriptrun,
Status: Completed)
Best Run is:
  F1 score: 87.01 
  Learning rate: 0.000090


In [None]:
# Persist properties of the run so we can access the logged metrics later
sb.glue("f1", best_run_metrics['f1'])
sb.glue("learning_rate", best_run_metrics['lr'])

## References

1. Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina, [*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*](https://arxiv.org/abs/1810.04805), ACL, 2018.
2. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang, [*SQuAD: 100,000+ Questions for Machine Comprehension of Text*](https://arxiv.org/abs/1606.05250), EMNLP, 2016. Dataset available at [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/).