*Copyright (c) Microsoft Corporation. All rights reserved.*

*Licensed under the MIT License.*

# Text Classification of MultiNLI Sentences using Multiple Transformer Models

In [1]:
import json
import os
import sys
from tempfile import TemporaryDirectory

import numpy as np
import pandas as pd
import scrapbook as sb
import torch
import torch.nn as nn
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
from utils_nlp.common.timer import Timer
from utils_nlp.common.pytorch_utils import dataloader_from_dataset
from utils_nlp.dataset.multinli import load_pandas_df
from utils_nlp.models.transformers.sequence_classification import (
    Processor, SequenceClassifier)

## Introduction
In this notebook, we fine-tune and evaluate a number of pretrained models on a subset of the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) dataset.

We use a [sequence classifier](../../utils_nlp/models/transformers/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/transformers) of different transformers, like [BERT](https://github.com/google-research/bert), [XLNet](https://github.com/zihangdai/xlnet), and [RoBERTa](https://github.com/pytorch/fairseq).

In [2]:
# notebook parameters
DATA_FOLDER = TemporaryDirectory().name
CACHE_DIR = TemporaryDirectory().name
NUM_EPOCHS = 1
BATCH_SIZE = 16
NUM_GPUS = 2
MAX_LEN = 100
TRAIN_DATA_FRACTION = 0.05
TEST_DATA_FRACTION = 0.05
TRAIN_SIZE = 0.75
LABEL_COL = "genre"
TEXT_COL = "sentence1"
MODEL_NAMES = ["distilbert-base-uncased", "roberta-base", "xlnet-base-cased"]

## Read Dataset
We start by loading a subset of the data. The following function also downloads and extracts the files, if they don't exist in the data folder.

The MultiNLI dataset is mainly used for natural language inference (NLI) tasks, where the inputs are sentence pairs and the labels are entailment indicators. The sentence pairs are also classified into *genres* that allow for more coverage and better evaluation of NLI models.

For our classification task, we use the first sentence only as the text input, and the corresponding genre as the label. We select the examples corresponding to one of the entailment labels (*neutral* in this case) to avoid duplicate rows, as the sentences are not unique, whereas the sentence pairs are.

In [3]:
df = load_pandas_df(DATA_FOLDER, "train")
df = df[df["gold_label"]=="neutral"]  # get unique sentences

100%|██████████| 222k/222k [01:20<00:00, 2.74kKB/s] 


In [4]:
df[[LABEL_COL, TEXT_COL]].head()

Unnamed: 0,genre,sentence1
0,government,Conceptually cream skimming has two basic dime...
4,telephone,yeah i tell you what though if you go price so...
6,travel,But a few Christian mosaics survive above the ...
12,slate,It's not that the questions they asked weren't...
13,travel,"Thebes held onto power until the 12th Dynasty,..."


We split the data for training and testing, sample a fraction for faster execution, and encode the class labels:

In [5]:
# split
df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state=0)



In [6]:
# sample
df_train = df_train.sample(frac=TRAIN_DATA_FRACTION).reset_index(drop=True)
df_test = df_test.sample(frac=TEST_DATA_FRACTION).reset_index(drop=True)

The examples in the dataset are grouped into 5 genres:

In [7]:
df_train[LABEL_COL].value_counts()

telephone     1043
slate          989
fiction        968
travel         964
government     945
Name: genre, dtype: int64

In [8]:
# encode labels
label_encoder = LabelEncoder()
df_train[LABEL_COL] = label_encoder.fit_transform(df_train[LABEL_COL])
df_test[LABEL_COL] = label_encoder.transform(df_test[LABEL_COL])

num_labels = len(np.unique(df_train[LABEL_COL]))

In [9]:
print("Number of unique labels: {}".format(num_labels))
print("Number of training examples: {}".format(df_train.shape[0]))
print("Number of testing examples: {}".format(df_test.shape[0]))

Number of unique labels: 5
Number of training examples: 4909
Number of testing examples: 1636


## Select Pretrained Models

Several pretrained models have been made available by [Hugging Face](https://github.com/huggingface/transformers). For text classification, the following pretrained models are supported.

In [10]:
pd.DataFrame({"model_name": SequenceClassifier.list_supported_models()})

Unnamed: 0,model_name
0,bert-base-uncased
1,bert-large-uncased
2,bert-base-cased
3,bert-large-cased
4,bert-base-multilingual-uncased
5,bert-base-multilingual-cased
6,bert-base-chinese
7,bert-base-german-cased
8,bert-large-uncased-whole-word-masking
9,bert-large-cased-whole-word-masking


## Fine-tune

Our wrappers make it easy to fine-tune different models in a unified way, hiding the preprocessing details that are needed before training. In this example, we're going to select the following models and use the same piece of code to fine-tune them on our genre classification task. Note that some models were pretrained on multilingual datasets and can be used with non-English datasets.

In [11]:
print(MODEL_NAMES)

['distilbert-base-uncased', 'roberta-base', 'xlnet-base-cased']


For each pretrained model, we preprocess the data, fine-tune the classifier, score the test set, and store the evaluation results.

In [12]:
results = {}

for model_name in tqdm(MODEL_NAMES, disable=True):

    # preprocess
    processor = Processor(
        model_name=model_name,
        to_lower=model_name.endswith("uncased"),
        cache_dir=CACHE_DIR,
    )
    train_dataset = processor.dataset_from_dataframe(
        df_train, TEXT_COL, LABEL_COL, max_len=MAX_LEN
    )
    train_dataloader = dataloader_from_dataset(
        train_dataset, batch_size=BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=True
    )
    test_dataset = processor.dataset_from_dataframe(
        df_test, TEXT_COL, LABEL_COL, max_len=MAX_LEN
    )
    test_dataloader = dataloader_from_dataset(
        test_dataset, batch_size=BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=False
    )

    # fine-tune
    classifier = SequenceClassifier(
        model_name=model_name, num_labels=num_labels, cache_dir=CACHE_DIR
    )
    with Timer() as t:
        classifier.fit(
            train_dataloader, num_epochs=NUM_EPOCHS, num_gpus=NUM_GPUS, verbose=False,
        )
    train_time = t.interval / 3600

    # predict
    preds = classifier.predict(test_dataloader, num_gpus=NUM_GPUS, verbose=False)

    # eval
    accuracy = accuracy_score(df_test[LABEL_COL], preds)
    class_report = classification_report(
        df_test[LABEL_COL], preds, target_names=label_encoder.classes_, output_dict=True
    )

    # save results
    results[model_name] = {
        "accuracy": accuracy,
        "f1-score": class_report["macro avg"]["f1-score"],
        "time(hrs)": train_time,
    }



## Evaluate

Finally, we report the accuracy and F1-score metrics for each model, as well as the fine-tuning time in hours.

In [13]:
df_results = pd.DataFrame(results)
df_results

Unnamed: 0,distilbert-base-uncased,roberta-base,xlnet-base-cased
accuracy,0.889364,0.885697,0.886308
f1-score,0.885225,0.880926,0.881819
time(hrs),0.023326,0.044209,0.052801


In [14]:
# for testing
sb.glue("accuracy", df_results.iloc[0, :].mean())
sb.glue("f1", df_results.iloc[1, :].mean())