# Natural Language Inference on XNLI Dataset using BERT with Azure Machine Learning

## 1. Summary
In this notebook, we demonstrate how to fine-tune BERT using distributed training (Horovod) on Azure Machine Learning service to do language inference in English. We use the [XNLI](https://github.com/facebookresearch/XNLI) dataset and to classify sentence pairs into three classes: contradiction, entailment, and neutral.   

The figure below shows how [BERT](https://arxiv.org/abs/1810.04805) classifies sentence pairs. It concatenates the tokens in each sentence pairs and separates the sentences by the [SEP] token. A [CLS] token is prepended to the token list and used as the aggregate sequence representation for the classification task.
<img src="https://nlpbp.blob.core.windows.net/images/bert_two_sentence.PNG">

**Note: To learn how to do pre-training on your own, please reference the [AzureML-BERT repo](https://github.com/microsoft/AzureML-BERT) created by Microsoft.**

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/nlp/examples/entailment/entailment_xnli_bert_azureml.png)

In [1]:
# Imports

import sys

sys.path.append("../..")

import os
import shutil
import torch
import json
import pandas as pd

import azureml.core
from azureml.train.dnn import PyTorch
from azureml.core.runconfig import MpiConfiguration
from azureml.core import Experiment
from azureml.widgets import RunDetails
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.exceptions import ComputeTargetException
from utils_nlp.azureml.azureml_utils import get_or_create_workspace, get_output_files

In [2]:
# Parameters

DEBUG = True
NODE_COUNT = 4
NUM_PROCESS = 1
DATA_PERCENT_USED = 1.0

config_path = (
    "./.azureml"
)  # Path to the directory containing config.json with azureml credentials

# Azure resources
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "YOUR_RESOURCE_GROUP_NAME"  
workspace_name = "YOUR_WORKSPACE_NAME"  
workspace_region = "YOUR_WORKSPACE_REGION"  # eg: eastus, eastus2.
cluster_name = "gpu-entail"

## 2. AzureML Setup

### 2.1 Initialize a Workspace

The following cell looks to set up the connection to your [Azure Machine Learning service Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace). You can choose to connect to an existing workspace or create a new one. 

**To access an existing workspace:**
1. If you have a `config.json` file, you do not need to provide the workspace information; you will only need to update the `config_path` variable that is defined above which contains the file.
2. Otherwise, you will need to supply the following:
    * The name of your workspace
    * Your subscription id
    * The resource group name

**To create a new workspace:**

Set the following information:
* A name for your workspace
* Your subscription id
* The resource group name
* [Azure region](https://azure.microsoft.com/en-us/global-infrastructure/regions/) to create the workspace in, such as `eastus2`. 

This will automatically create a new resource group for you in the region provided if a resource group with the name given does not already exist. 

In [3]:
ws = get_or_create_workspace(
    config_path=config_path,
    subscription_id=subscription_id,
    resource_group=resource_group,
    workspace_name=workspace_name,
    workspace_region=workspace_region,
)

In [None]:
print(
    "Workspace name: " + ws.name,
    "Azure region: " + ws.location,
    "Subscription id: " + ws.subscription_id,
    "Resource group: " + ws.resource_group,
    sep="\n",
)

### 2.3 Link AmlCompute Compute Target

We need to link a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training our model (see [compute options](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#supported-compute-targets) for explanation of the different options). We will use an [AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) target and link to an existing target (if the cluster_name exists) or create a STANDARD_NC6 GPU cluster (autoscales from 0 to 4 nodes) in this example. Creating a new AmlComputes takes approximately 5 minutes. 

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [4]:
try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found compute target: {}".format(cluster_name))
except ComputeTargetException:
    print("Creating new compute target: {}".format(cluster_name))
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_NC6", max_nodes=NODE_COUNT
    )
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True)


print(compute_target.get_status().serialize())

Found compute target: gpu-entail
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-08-03T13:43:20.068000+00:00', 'errors': None, 'creationTime': '2019-07-27T02:14:46.127092+00:00', 'modifiedTime': '2019-07-27T02:15:07.181277+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6S_V2'}


In [5]:
project_dir = "./entail_utils"
if DEBUG and os.path.exists(project_dir):
    shutil.rmtree(project_dir)
shutil.copytree("../../utils_nlp", os.path.join(project_dir, "utils_nlp"))

'./entail_utils\\utils_nlp'

## 3. Prepare Training Script

In [6]:
%%writefile $project_dir/train.py
import horovod.torch as hvd
import torch
import numpy as np
import time
import argparse
from utils_nlp.common.timer import Timer
from utils_nlp.dataset.xnli_torch_dataset import XnliDataset
from utils_nlp.models.bert.common import Language
from utils_nlp.models.bert.sequence_classification_distributed import (
    BERTSequenceClassifier,
)
from sklearn.metrics import classification_report

print("Torch version:", torch.__version__)

hvd.init()

LANGUAGE_ENGLISH = "en"
TRAIN_FILE_SPLIT = "train"
TEST_FILE_SPLIT = "test"
TO_LOWERCASE = True
PRETRAINED_BERT_LNG = Language.ENGLISH
LEARNING_RATE = 5e-5
WARMUP_PROPORTION = 0.1
BATCH_SIZE = 32
NUM_GPUS = 1
OUTPUT_DIR = "./outputs/"
LABELS = ["contradiction", "entailment", "neutral"]

## each machine gets it's own copy of data
CACHE_DIR = "./xnli-%d" % hvd.rank()

parser = argparse.ArgumentParser()
# Training settings
parser.add_argument(
    "--seed", type=int, default=42, metavar="S", help="random seed (default: 42)"
)
parser.add_argument(
    "--epochs", type=int, default=2, metavar="S", help="random seed (default: 2)"
)
parser.add_argument(
    "--no-cuda", action="store_true", default=False, help="disables CUDA training"
)
parser.add_argument(
    "--data_percent_used",
    type=float,
    default=1.0,
    metavar="S",
    help="data percent used (default: 1.0)",
)

args = parser.parse_args()
args.cuda = not args.no_cuda and torch.cuda.is_available()

"""
Note: For example, you have 4 nodes and 4 GPUs each node, so you spawn 16 workers. 
Every worker will have a rank [0, 15], and every worker will have a local_rank [0, 3]
"""
if args.cuda:
    torch.cuda.set_device(hvd.local_rank())
    torch.cuda.manual_seed(args.seed)

# num_workers - this is equal to number of gpus per machine
kwargs = {"num_workers": NUM_GPUS, "pin_memory": True} if args.cuda else {}

train_dataset = XnliDataset(
    file_split=TRAIN_FILE_SPLIT,
    cache_dir=CACHE_DIR,
    language=LANGUAGE_ENGLISH,
    to_lowercase=TO_LOWERCASE,
    tok_language=PRETRAINED_BERT_LNG,
    data_percent_used=args.data_percent_used,
)


# set the label_encoder for evaluation
label_encoder = train_dataset.label_encoder
num_labels = len(np.unique(train_dataset.labels))

# Train
classifier = BERTSequenceClassifier(
    language=Language.ENGLISH,
    num_labels=num_labels,
    cache_dir=CACHE_DIR,
    use_distributed=True,
)


train_loader = classifier.create_data_loader(
    train_dataset, BATCH_SIZE, mode="train", **kwargs
)


num_samples = len(train_loader.dataset)
num_batches = int(num_samples / BATCH_SIZE)
num_train_optimization_steps = num_batches * args.epochs
optimizer = classifier.create_optimizer(
    num_train_optimization_steps, lr=LEARNING_RATE, warmup_proportion=WARMUP_PROPORTION
)

with Timer() as t:
    for epoch in range(1, args.epochs + 1):

        # to allow data shuffling for DistributedSampler
        train_loader.sampler.set_epoch(epoch)

        # epoch and num_epochs is passed in the fit function to print loss at regular batch intervals
        classifier.fit(
            train_loader,
            epoch=epoch,
            num_epochs=args.epochs,
            bert_optimizer=optimizer,
            num_gpus=NUM_GPUS,
        )

#if machine has multiple gpus then run predictions on only on 1 gpu since test_dataset is small.
if hvd.rank() == 0:
    NUM_GPUS = 1
    
    test_dataset = XnliDataset(
        file_split=TEST_FILE_SPLIT,
        cache_dir=CACHE_DIR,
        language=LANGUAGE_ENGLISH,
        to_lowercase=TO_LOWERCASE,
        tok_language=PRETRAINED_BERT_LNG,
    )

    test_loader = classifier.create_data_loader(test_dataset, mode="test")

    # predict
    predictions, pred_labels = classifier.predict(test_loader, NUM_GPUS)

    predictions = label_encoder.inverse_transform(predictions)

    # Evaluate
    results = classification_report(
        pred_labels, predictions, target_names=LABELS, output_dict=True
    )

    result_file = os.path.join(OUTPUT_DIR, "results.json")
    with open(result_file, "w+") as fp:
        json.dump(results, fp)

    # save model
    classifier.save_model()

Writing ./entail_utils/train.py


## 4. Create a PyTorch Estimator

BERT is built on PyTorch, so we will use the AzureML SDK's PyTorch estimator to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, see [How to Train Pytorch Models on AzureML](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch). 

In [7]:
mpiConfig = MpiConfiguration()
mpiConfig.process_count_per_node = NUM_PROCESS

script_params = {
    '--data_percent_used': DATA_PERCENT_USED
}

est = PyTorch(
    source_directory=project_dir,
    compute_target=compute_target,
    entry_script="train.py",
    script_params = script_params,
    node_count=NODE_COUNT,
    distributed_training=mpiConfig,
    use_gpu=True,
    framework_version="1.0",
    conda_packages=["scikit-learn=0.20.3", "numpy", "spacy", "nltk"],
    pip_packages=["pandas", "seqeval[gpu]", "pytorch-pretrained-bert"],
)

## 5. Create Experiment and Submit a Job
Submit the estimator object to run your experiment. Results can be monitored using a Jupyter widget. The widget and run are asynchronous and update every 10-15 seconds until job completion.

**Note**: The experiment takes ~4 hours with 2 NC24 nodes and ~7hours with 4 NC6 nodes. The overhead is due to the communication time between nodes.    

In [8]:
experiment = Experiment(ws, name="NLP-Entailment-BERT")
run = experiment.submit(est)

In [9]:
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 'â€¦

Since the above cell is an async call, the below cell is a blocking call to stop the cells below it to execute.

In [None]:
run.wait_for_completion()

### 6. Analyze Results

Download result.json from portal and open to view results. 

In [10]:
file_names = ["outputs/results.json"]
get_output_files(run, "./outputs", file_names=file_names)

Downloading file outputs/results.json to ./outputs\results.json...


In [11]:
with open("outputs/results.json", "r") as handle:
    parsed = json.load(handle)
    print(pd.DataFrame.from_dict(parsed).transpose())

               f1-score  precision    recall  support
contradiction  0.838749   0.859296  0.819162   1670.0
entailment     0.817280   0.877663  0.764671   1670.0
neutral        0.777870   0.719817  0.846108   1670.0
micro avg      0.809980   0.809980  0.809980   5010.0
macro avg      0.811300   0.818925  0.809980   5010.0
weighted avg   0.811300   0.818925  0.809980   5010.0
