Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# BiDAF Model Deep Dive on AzureML

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/nlp/examples/question_answering/bidaf_aml_deep_dive.png)

This notebook demonstrates a deep dive into a popular question-answering (QA) model, Bi-Directional Attention Flow (BiDAF). We use [AllenNLP](https://allennlp.org/), an open-source NLP research library built on top of PyTorch, to train the BiDAF model from scratch on the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset, using Azure Machine Learning ([AzureML](https://azure.microsoft.com/en-us/services/machine-learning-service/)). 

The following capabilities are highlighted in this notebook:  
- AmlCompute
- Datastore
- Logging
- AllenNLP library

## Table of Contents

1. [Introduction](#1.-Introduction)  
    * 1.1 [SQuAD Dataset](#1.1-SQuAD-Dataset)  
    * 1.2 [BiDAF Model](#1.2-BiDAF-Model)  
    * 1.3 [AllenNLP](#1.3-AllenNLP)  
2. [AzureML Setup](#2.-AzureML-Setup)  
    * 2.1 [Link to or create a `Workspace`](#2.1-Link-to-or-create-a-Workspace)  
    * 2.2 [Set up an `Experiment` and Logging](#2.2-Set-up-an-Experiment-and-Logging)  
    * 2.3 [Link `AmlCompute` compute target](#2.3-Link-AmlCompute-Compute-Target)  
    * 2.4 [Upload Files to `Datastore`](#2.4-Upload-Files-to-Datastore)  
3. [Prepare Training Script](#3.-Prepare-Training-Script) 
4. [Create a PyTorch Estimator](#4.-Create-a-PyTorch-Estimator)
5. [Submit a Job](#5.-Submit-a-Job)  
6. [Inspect Results of Run](#6.-Inspect-Results-of-Run)  
    * 6.1 [Evaluation on SQuAD](#6.1-Evaluation-on-SQuAD)
    * 6.2 [Try the Best Model](#6.2-Try-the-Best-Model)

## 1. Introduction

### 1.1 SQuAD Dataset

The [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset was released in 2016 and has become a benchmarking dataset for machine comprehension tasks. It contains a set of more than 100,000 question-context tuples along with their answers, extracted from Wikipedia articles. 90,000 of the question-context tuples make up the training set and the remaining 10,000 compose the development set. The answers are spans in the context (given passage) and are evaluated against human-labeled answers. Two metrics are used for evaluation: Exact Match (EM) and F1 score.

### 1.2 BiDAF Model

The [BiDAF](https://www.semanticscholar.org/paper/Bidirectional-Attention-Flow-for-Machine-Seo-Kembhavi/007ab5528b3bd310a80d553cccad4b78dc496b02
) model achieved state-of-the-art performance on the SQuAD dataset in 2017 and is a well-respected, performant baseline for QA. The BiDAF network is a "hierarchical multi-stage architecture for modeling representations of the context at different levels of granularity. BiDAF includes character-level, word-level, and phrase-level embeddings, and uses bi-directional attention flow to allow for query-aware context representations". 

The network contains six different layers, as described by [Seo et al, 2017](https://www.semanticscholar.org/paper/Bidirectional-Attention-Flow-for-Machine-Seo-Kembhavi/007ab5528b3bd310a80d553cccad4b78dc496b02):

1. **Character Embedding Layer**: character-level CNNs to embed each word
2. **Word Embedding Layer**: word embeddings using pre-trained GloVe word vectors
3. **Phrase Embedding Layer**: LSTM on top of the previous layers to model the temporal interactions between words
4. **Attention Flow Layer**: Fuses information from the context and query words. Unlike previous models, "the attention flow layer is not used to summarize the query and context into a single feature vectors. Instead, the attention vectors at each time step, along with embeddings from previous layers, are allowed to flow through to the subsequent modeling layers", reducing information loss.
5. **Modeling Layer**: produces a matrix of contextual information about the word with respect to the entire context paragraph and query
6. **Output Layer**: predicts the start and end indices of the phrase in the paragraph

The following figure displays the architecture of the BiDAF network.

![](https://nlpbp.blob.core.windows.net/images/BiDAF_model.png)

### 1.3 AllenNLP

The notebook demonstrates how to use the BiDAF implementation provided by [AllenNLP](https://www.semanticscholar.org/paper/A-Deep-Semantic-Natural-Language-Processing-Gardner-Grus/a5502187140cdd98d76ae711973dbcdaf1fef46d), an open-source NLP research library built on top of PyTorch. AllenNLP is a product of the Allen Institute for Artifical Intelligence and is used widely across differnet universities and top companies (including Facebook Research and Amazon Alexa). They maintain a robust and active [Github repository](https://github.com/allenai/allennlp) as well as a [website](https://allennlp.org/) with documentation and demos. Their model is a reimplementation of the original BiDAF model and they report a higher EM score and faster training times than the original BiDAF system (68.3 EM score versus 67.7 and 10x speedup, taking ~4 hours on a p2.xlarge). The AllenNLP library is mainly designed for use through the command line (and most tutorials use this method), but can also be used programatically. 

The AllenNLP library focuses on the creation of NLP pipelines with easily interchangable building blocks. The general pipeline steps are as follows:  
- DatasetReader: defines how to extract information from your data and convert it into Instance objects that will be used by the model  
- Iterator: takes the instances produced by the DatasetReader and batches them for training
- Model
- Trainer: trains the model and records metrics  
- Predictor: takes raw strings and produces predictions

Each step is loosely-coupled, making it easy to swap different options for each step. While it is possible to construct your own AllenNLP objects (see this [tutorial](https://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/) for a great deep-dive into constructing your own AllenNLP pipeline), the easiest way is to utilize the JSON-like parameter constructor methods provided by most AllenNLP objects. For example, rather than

```
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
```
we can use  

```
lstm_params = Params({
    "type": "lstm",
    "input_size": EMBEDDING_DIM,
    "hidden_size": HIDDEN_DIM
})

lstm = Seq2SeqEncoder.from_params(lstm_params)
```
This provides two advantages:  
1. Experiments can be declaratively specified in a separate [configuration file](https://github.com/allenai/allennlp/blob/master/tutorials/tagger/README.md#using-config-files)  
2. Experiments can be easily changed with no coding, rather just changing the entry in the config file

**AllenNLP Resources:**

The following resources are recommended for understanding how the AllenNLP library works and being able to implement your own models and pipelines

- Information about the provided AllenNLP models: https://allennlp.org/models
- Using configuration files: https://github.com/allenai/allennlp/blob/master/tutorials/tagger/README.md#using-config-files   
- In-depth discussion of each AllenNLP object used and how to construct your own specialized ones: https://mlexplained.com/2019/01/30/an-in-depth-tutorial-to-allennlp-from-basics-to-elmo-and-bert/  
- AllenNLPs Part-of-Speech-Tagging tutorial showcasing how to use their methods programatically: https://allennlp.org/tutorials   
- Short AllenNLP programatic tutorial: https://github.com/titipata/allennlp-tutorial/blob/master/allennlp_tutorial.ipynb  

In [2]:
# Imports
import sys
import os
import shutil
sys.path.append("../../")
import json
from urllib.request import urlretrieve
import scrapbook as sb

#import utils
from utils_nlp.common.timer import Timer
from utils_nlp.azureml import azureml_utils

import azureml as aml
from azureml.core import Datastore, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.exceptions import ComputeTargetException
from azureml.train.dnn import PyTorch
from azureml.widgets import RunDetails
from azureml.core.conda_dependencies import CondaDependencies
from azureml.exceptions import ComputeTargetException
from allennlp.predictors import Predictor

print("System version: {}".format(sys.version))
print("Azure ML SDK Version:", aml.core.VERSION)

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
Azure ML SDK Version: 1.0.48


In [2]:
PROJECT_FOLDER = "./bidaf-question-answering"
SQUAD_FOLDER = "./squad"
BIDAF_CONFIG_PATH = "."
LOGS_FOLDER = '.'
NUM_EPOCHS = 25
PIP_PACKAGES = [
        "allennlp==0.8.4",
        "azureml-sdk==1.0.48",
        "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz",
    ]
CONDA_PACKAGES = ["jsonnet", "cmake", "regex", "pytorch", "torchvision"]
config_path = (
    "./.azureml"
)  # Path to the directory containing config.json with azureml credentials

# Azure resources
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "YOUR_RESOURCE_GROUP_NAME"  
workspace_name = "YOUR_WORKSPACE_NAME"  
workspace_region = "YOUR_WORKSPACE_REGION" #Possible values eastus, eastus2 and so on.

## 2. AzureML Setup

Now, we set up the necessary components for running this as an AzureML experiment
1. Create or link to an existing `Workspace`
2. Set up an `Experiment` with `logging`
3. Create or attach existing `AmlCompute`
4. Upload our data to a `Datastore`

### 2.1 Link to or create a Workspace

The following cell looks to set up the connection to your [Azure Machine Learning service Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace). You can choose to connect to an existing workspace or create a new one. 

**To access an existing workspace:**
1. If you have a `config.json` file, you do not need to provide the workspace information; you will only need to update the `config_path` variable that is defined above which contains the file.
2. Otherwise, you will need to supply the following:
    * The name of your workspace
    * Your subscription id
    * The resource group name

**To create a new workspace:**

Set the following information:
* A name for your workspace
* Your subscription id
* The resource group name
* [Azure region](https://azure.microsoft.com/en-us/global-infrastructure/regions/) to create the workspace in, such as `eastus2`. 

This will automatically create a new resource group for you in the region provided if a resource group with the name given does not already exist. 

In [3]:
ws = azureml_utils.get_or_create_workspace(
    config_path=config_path,
    subscription_id=subscription_id,
    resource_group=resource_group,
    workspace_name=workspace_name,
    workspace_region=workspace_region,
)

Performing interactive authentication. Please follow the instructions on the terminal.




Interactive authentication successfully completed.


In [4]:
print(
    "Workspace name: " + ws.name,
    "Azure region: " + ws.location,
    "Subscription id: " + ws.subscription_id,
    "Resource group: " + ws.resource_group,
    sep="\n",
)

### 2.2 Set up an Experiment and Logging

Next, we set up an `Experiment` named bidaf-question-answering, add logging capabilities, and create a local folder that will be the source directory for the AzureML run.

In [5]:
# Make a folder for the project
os.makedirs(PROJECT_FOLDER, exist_ok=True)

# Set up an experiment
experiment_name = "NLP-QA-BiDAF-deepdive"
experiment = Experiment(ws, experiment_name)

# Add logging to our experiment
run = experiment.start_logging(snapshot_directory=PROJECT_FOLDER)

### 2.3 Link AmlCompute Compute Target


We need to link a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training our model (see [compute options](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#supported-compute-targets) for explanation of the different options). We will use an [AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) target and link to an existing target (if the cluster_name exists) or create a STANDARD_NC6 GPU cluster (autoscales from 0 to 4 nodes) in this example. Creating a new AmlComputes takes approximately 5 minutes. 

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [6]:
# choose your cluster
cluster_name = "gpu-test"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing compute target.")
except ComputeTargetException:
    print("Creating a new compute target...")
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_NC6", max_nodes=4
    )

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current AmlCompute.
print(compute_target.get_status().serialize())

Found existing compute target.
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-07-23T16:18:34.392000+00:00', 'errors': None, 'creationTime': '2019-07-09T16:20:30.625908+00:00', 'modifiedTime': '2019-07-09T16:20:46.601973+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


### 2.4 Upload Files to Datastore

This step uploads our local files to a `Datastore` so that the data is accessible from the remote compute target. A DataStore is backed either by a Azure File Storage (default option) or Azure Blob Storage ([how to decide between these options](https://docs.microsoft.com/en-us/azure/storage/common/storage-decide-blobs-files-disks)) and data is made accessible by mounting or copying data to the compute target. 

First, we download the SQuAD data files and save to a folder called squad.

In [7]:
os.makedirs(SQUAD_FOLDER, exist_ok=True)  # make squad folder locally

urlretrieve(
    "https://allennlp.s3.amazonaws.com/datasets/squad/squad-train-v1.1.json",
    filename=SQUAD_FOLDER+"/squad_train.json",
)

urlretrieve(
    "https://allennlp.s3.amazonaws.com/datasets/squad/squad-dev-v1.1.json",
    filename=SQUAD_FOLDER+"/squad_dev.json",
)

('./squad/squad_dev.json', <http.client.HTTPMessage at 0x2646892de10>)

We also copy our AllenNLP configuration file (bidaf_config.json) into this squad folder so that it can be uploaded to the `Datastore` and accessed during training. As described in [Section 1.3](#1.3-AllenNLP), this configuration files allows us to easily specify the parameters for instantiating AllenNLP objects. This file contains a dictionary of dictionaries. The top level contains 4 main keys: dataset_reader, model, iterator, and trainer (plus keys for train_data_path, validation_data_path, and evaluate_on_test). If you notice carefully from [Section 1.3](#1.3-AllenNLP), these correspond to the AllenNLP object building blocks. Each of these keys map to a dictionary of parameters. For instance, the trainer dictionary contains keys to specify the number of epochs, learning rate scheduler, optimizer, etc. The parameter settings provided here are the ones suggested by AllenNLP for the BiDAF model; however, below we demonstrate how to override these parameters without having to change this configuration file directly.

In [8]:
shutil.copy(BIDAF_CONFIG_PATH+'/bidaf_config.json', SQUAD_FOLDER)

'./squad\\bidaf_config.json'

Now we upload both the SQuAD data files as well as the configuration file to the datastore. `ws.datastores` lists all options for datastores and `ds.account_name` gets the name of the datastore that can be used to find it in the Azure portal. Once we have selected the appropriate datastore, we use the `upload()` method to upload all files from the squad local folder to a folder on the datastore called squad_data.

In [9]:
# Select a specific datastore or you can call ws.get_default_datastore()
datastore_name = "workspacefilestore"
ds = ws.datastores[datastore_name]

# Upload files in squad data folder to the datastore
ds.upload(
    src_dir=SQUAD_FOLDER, target_path="squad_data", overwrite=True, show_progress=True
)

Uploading an estimated of 3 files
Uploading ./squad\bidaf_config.json
Uploading ./squad\squad_dev.json
Uploading ./squad\squad_train.json
Uploaded ./squad\bidaf_config.json, 1 files out of an estimated total of 3
Uploaded ./squad\squad_dev.json, 2 files out of an estimated total of 3
Uploaded ./squad\squad_train.json, 3 files out of an estimated total of 3
Uploaded 3 files


$AZUREML_DATAREFERENCE_09a567b57ea546b697d8d7ce1bcf2d86

## 3. Prepare Training Script

Here, we create a simple training script that uses AllenNLP's `train_model_from_file()` function containing the following parameters:  
- parameter_filename (str) : A json parameter file specifying an AllenNLP experiment
- serialization_dir (str): The directory in which to save results and logs
- overrides (str): A JSON string that we will use to override values in the input parameter file
- file_friendly_logging (bool, optional): If True, we make our output more friendly to saved model files
- recover (bool, optional): If True, we will try to recover a training run from an existing serialization

Our training script parameters are: the location of the data folder, name of the configuration file, and JSON string with any overrides for the configuration file. See the [documentation](https://github.com/allenai/allennlp/blob/9a13ab570025a0c1659986009d2abddb2e652020/allennlp/commands/train.py) on AllenNLP `train_model_from_file()` function for more details.

In [10]:
%%writefile $PROJECT_FOLDER/train.py
import torch
import argparse
import os
import shutil
from allennlp.common import Params
from allennlp.commands.train import train_model_from_file

def main():
    # get command-line arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_folder', type=str, 
                        help='Folder where data is stored')
    parser.add_argument('--config_name', type=str, 
                        help='Name of json configuration file')
    parser.add_argument('--overrides', type=str, 
                        help='Override parameters on config file')
    args = parser.parse_args()
    squad_folder = os.path.join(args.data_folder, "squad_data")
    serialization_folder = "./logs" #save to the run logs folder
    
    #delete log file if it already exists
    if os.path.isdir(serialization_folder):
        shutil.rmtree(serialization_folder)
        
    train_model_from_file(parameter_filename = os.path.join(squad_folder, args.config_name),
           overrides = args.overrides,
           serialization_dir = serialization_folder,
           file_friendly_logging = True,
           recover = False)

if __name__ == "__main__":
    main()

Overwriting ./bidaf-question-answering/train.py


## 4. Create a PyTorch Estimator

AllenNLP is built on PyTorch, so we will use the AzureML SDK's PyTorch estimator to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, see [How to Train Pytorch Models on AzureML](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch). First we set up a .yml file with the necessary dependencies.

In [11]:
myenv = CondaDependencies.create(
    conda_packages= CONDA_PACKAGES,
    pip_packages= PIP_PACKAGES,
    python_version="3.6.8",
)
myenv.add_channel("conda-forge")
myenv.add_channel("pytorch")

conda_env_file_name = "bidafenv.yml"
myenv.save_to_file(PROJECT_FOLDER, conda_env_file_name)

'bidafenv.yml'

We next define any parameters in the configuration file that we want to override for this specific training run. We demonstrate overriding the num_epochs parameter to perform 25 epochs (rather than 20 epochs as set in bidaf_config.json). The AllenNLP training function expects that overrides are a JSON string, so we convert our dictionary into a JSON string before passing it in as an argument to our training script.

In [12]:
overrides = {"trainer":{'num_epochs': NUM_EPOCHS}}
overrides = json.dumps(overrides)

Define the parameters to pass to the training script, the project folder, compute target, conda dependencies file, and the name of the training script. Notice that we set `use_gpu` equal to True. 

In [13]:
script_params = {
    "--data_folder": ds.as_mount(),
    "--config_name": "bidaf_config.json",
    "--overrides": overrides,
}

estimator = PyTorch(
    source_directory=PROJECT_FOLDER,
    script_params=script_params,
    compute_target=compute_target,
    entry_script="train.py",
    use_gpu=True,
    conda_dependencies_file="bidafenv.yml",
)

This may lead to unexpected package installation errors. Take a look at `estimator.conda_dependencies` to understand what packages are installed by Azure ML.


## 5. Submit a Job

Submit the estimator object to run your experiment. Results can be monitored using a Jupyter widget. The widget and run are asynchronous and update every 10-15 seconds until job completion.

In [14]:
run = experiment.submit(estimator)
print(run)

Run(Experiment: bidaf-question-answering,
Id: bidaf-question-answering_1563899344_bce3c688,
Type: azureml.scriptrun,
Status: Starting)


In [15]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 'sâ€¦

In [16]:
#wait for the run to complete before continuing in the notebook
run.wait_for_completion() 

**Cancel the Job**

Interrupting/restarting the Jupyter kernel will not properly cancel the run, which can lead to wasted compute resources. To avoid this, we recommend explicitly canceling a run with the following code:

In [17]:
#run.cancel()

## 6. Inspect Results of Run 

AllenNLP's training saves all intermediate and final results to the serialization_dir (defined in train.py). In order to inspect the results as well as use the trained model, we will download the files from the run logs using the `download_files()` command.

In [18]:
run.download_files(prefix="./logs", output_directory=LOGS_FOLDER)

### 6.1 Evaluation on SQuAD

The metrics.json file contains the final metrics. We can load this file and extract the final SQuAD dev set EM score (key is 'best_validation_em'). AllenNLP reports an EM score of 68.3, so depending on the parameters specified in your config file, expect a score in that range.

In [19]:
with open(LOGS_FOLDER+"/logs/metrics.json") as f:
    metrics = json.load(f)

sb.glue("validation_EM", metrics["best_validation_em"])
metrics["best_validation_em"]

0.6152317880794702

### 6.2 Try the Best Model

In order to use our model, we need to create an AllenNLP [Predictor](https://github.com/allenai/allennlp/blob/master/allennlp/predictors/predictor.py) object. We instantiate this object from an archive path. An archive comprises a Model and its experimental configuration file. After training a model, the archive is saved to the serialization_dir (whose path is set in train.py).

In [20]:
model = Predictor.from_path(LOGS_FOLDER+"/logs")



The Predictor object allows us to directly pass in a question and passage (behind the scenes it converts this to Instance objects using the DatasetReader). We define an example passage/question, call the model's `predict()` function, and finally extract the `best_span_str` attribute which contains the answer to our query.

In [21]:
passage = "Machine Comprehension (MC), answering questions about a given context, \
requires modeling complex interactions between the context and the query. Recently,\
attention mechanisms have been successfully extended to MC. Typically these mechanisms\
use attention to summarize the query and context into a single vector, couple \
attentions temporally, and often form a uni-directional attention. In this paper \
we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage \
hierarchical process that represents the context at different levels of granularity \
and uses a bi-directional attention flow mechanism to achieve a query-aware context \
representation without early summarization. Our experimental evaluations show that \
our model achieves the state-of-the-art results in Stanford QA (SQuAD) and \
CNN/DailyMail Cloze Test datasets."

question = "What dataset does BIDAF achieve state-of-the-art results on?"

In [22]:
result = model.predict(question, passage)["best_span_str"]

In [23]:
result

'Stanford QA'