*Copyright (c) Microsoft Corporation. All rights reserved.*

*Licensed under the MIT License.*

# Text Classification of Multi Language Datasets using Transformer Model

In [1]:
import scrapbook as sb
import pandas as pd
import torch
import numpy as np

from tempfile import TemporaryDirectory
from utils_nlp.common.timer import Timer
from sklearn.metrics import classification_report
from utils_nlp.models.transformers.sequence_classification import SequenceClassifier

from utils_nlp.dataset import multinli
from utils_nlp.dataset import dac
from utils_nlp.dataset import bbc_hindi

## Introduction

In this notebook, we fine-tune and evaluate a pretrained Transformer model using BERT earchitecture on three different language datasets:

- [MultiNLI dataset](https://www.nyu.edu/projects/bowman/multinli/): The Multi-Genre NLI corpus, in English
- [DAC dataset](https://data.mendeley.com/datasets/v524p5dhpj/2): DataSet for Arabic Classification corpus, in Arabic
- [BBC Hindi dataset](https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1): BBC Hindi News corpus, in Hindi

If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`** to run the notebook on a small subset of the data and a smaller number of epochs. You can also choose a dataset from three existing datasets (**`MultNLI`**, **`DAC`**, and **`BBC Hindi`**) to experiment. 

### Running Time

The table below provides some reference running time on different datasets.  

|Dataset|QUICK_RUN|Machine Configurations|Running time|
|:------|:---------|:----------------------|:------------|
|MultiNLI|True|2 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 8 minutes |
|MultiNLI|False|2 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 5.7 hours |
|DAC|True|2 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 13 minutes |
|DAC|False|2 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 5.6 hours |
|BBC Hindi|True|2 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 1 minute |
|BBC Hindi|False|2 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 14 minutes |

If you run into CUDA out-of-memory error or the jupyter kernel dies constantly, try reducing the `batch_size` and `max_len` in `CONFIG`, but note that model performance may be compromised. 

In [2]:
# Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.
QUICK_RUN = True

# the dataset you want to try, valid values are: "multinli", "dac", "bbc-hindi"
USE_DATASET = "dac"

Several pretrained models have been made available by [Hugging Face](https://github.com/huggingface/transformers). For text classification, the following pretrained models are supported.

In [3]:
pd.DataFrame({"model_name": SequenceClassifier.list_supported_models()})

Unnamed: 0,model_name
0,bert-base-uncased
1,bert-large-uncased
2,bert-base-cased
3,bert-large-cased
4,bert-base-multilingual-uncased
5,bert-base-multilingual-cased
6,bert-base-chinese
7,bert-base-german-cased
8,bert-large-uncased-whole-word-masking
9,bert-large-cased-whole-word-masking


In order to demonstrate multi language capability of Transformer models, we only use the model **`bert-base-multilingual-cased`** by default in this notebook.

## Configuration

In [4]:
CONFIG = {
    'local_path': TemporaryDirectory().name,
    'test_fraction': 0.2,
    'random_seed': 100,
    'train_sample_ratio': 1.0,
    'test_sample_ratio': 1.0,
    'model_name': 'distilbert-base-multilingual-cased',
    'to_lower': False,
    'cache_dir': TemporaryDirectory().name,
    'max_len': 150,
    'num_train_epochs': 5,
    'num_gpus': 2,
    'batch_size': 16,
    'verbose': False,
    'load_dataset_func': None,
    'get_labels_func': None
}

if QUICK_RUN:
    CONFIG['train_sample_ratio'] = 0.2
    CONFIG['test_sample_ratio'] = 0.2
    CONFIG['num_train_epochs'] = 1

torch.manual_seed(CONFIG['random_seed'])

if torch.cuda.is_available():
    CONFIG['batch_size'] = 32
    
if USE_DATASET == "multinli":
    CONFIG['to_lower'] = True
    CONFIG['load_dataset_func'] = multinli.load_tc_dataset
    CONFIG['get_labels_func'] = multinli.get_label_values
    
    if QUICK_RUN:
        CONFIG['train_sample_ratio'] = 0.1
        CONFIG['test_sample_ratio'] = 0.1
elif USE_DATASET == "dac":
    CONFIG['load_dataset_func'] = dac.load_tc_dataset
    CONFIG['get_labels_func'] = dac.get_label_values
elif USE_DATASET == "bbc-hindi":
    CONFIG['load_dataset_func'] = bbc_hindi.load_tc_dataset
    CONFIG['get_labels_func'] = bbc_hindi.get_label_values
else:
    raise ValueError("Supported datasets are: 'multinli', 'dac', and 'bbc-hindi'")

## Load Dataset

By choosing the dataset you want to experiment with, the code snippet below will adaptively seletct a helper function **`load_dataset`** for the dataset.  The helper function downloads the raw data, splits it into training and testing datasets (also sub-sampling if the sampling ratio is smaller than 1.0), and then processes for the transformer model. Everything is done in one function call, and you can use the processed training and testing Pytorch datasets to fine tune the model and evaluate the performance of the model.

In [5]:
train_dataloader, test_dataloader, label_encoder, test_labels = CONFIG['load_dataset_func'](
    local_path=CONFIG['local_path'],
    test_fraction=CONFIG['test_fraction'],
    random_seed=CONFIG['random_seed'],
    train_sample_ratio=CONFIG['train_sample_ratio'],
    test_sample_ratio=CONFIG['test_sample_ratio'],
    model_name=CONFIG['model_name'],
    to_lower=CONFIG['to_lower'],
    cache_dir=CONFIG['cache_dir'],
    max_len=CONFIG['max_len'],
    num_gpus=CONFIG['num_gpus']
)

100%|██████████| 80.1k/80.1k [00:02<00:00, 30.8kKB/s]


## Fine Tune

There are two steps to fine tune a transformer model for text classifiction: 1). instantiate a `SequenceClassifier` class which is a wrapper of the transformer model, and 2), fit the model using the preprocessed training dataset. The member method `fit` of `SequenceClassifier` class is used to fine tune the model.

In [6]:
model = SequenceClassifier(
    model_name=CONFIG['model_name'],
    num_labels=len(label_encoder.classes_),
    cache_dir=CONFIG['cache_dir']
)

# Fine tune the model using the training dataset
with Timer() as t:
    model.fit(
        train_dataloader=train_dataloader,
        num_epochs=CONFIG['num_train_epochs'],
        num_gpus=CONFIG['num_gpus'],
        verbose=CONFIG['verbose'],
        seed=CONFIG['random_seed']
    )

print("Training time : {:.3f} hrs".format(t.interval / 3600))



Training time : 0.190 hrs


## Evaluate on Testing Dataset

The `predict` method of the `SequenceClassifier` returns a Numpy ndarray of raw predictions. Each predicting value is a label ID, and if you want to get the label values you will need to call function `get_label_values` from the dataset package. An instance of sklearn `LabelEncoder` is returned when loading the dataset and can be used to get the mapping between label ID and label value.

In [7]:
with Timer() as t:
    preds = model.predict(
        test_dataloader=test_dataloader,
        num_gpus=CONFIG['num_gpus'],
        verbose=CONFIG['verbose']
    )

print("Prediction time : {:.3f} hrs".format(t.interval / 3600))

Prediction time : 0.021 hrs


Finally, we compute the precision, recall, and F1 metrics of the evaluation on the test set.

In [8]:
report = classification_report(
    test_labels, 
    preds,
    digits=2,
    labels=np.unique(test_labels),
    target_names=label_encoder.classes_
)

print(report)

              precision    recall  f1-score   support

     culture       0.93      0.94      0.93       548
     diverse       0.94      0.94      0.94       640
     economy       0.90      0.88      0.89       570
    politics       0.87      0.88      0.88       809
      sports       0.99      0.98      0.99      1785

   micro avg       0.94      0.94      0.94      4352
   macro avg       0.93      0.93      0.93      4352
weighted avg       0.94      0.94      0.94      4352



In [9]:
# for testing
report_splits = report.split('\n')[-2].split()

sb.glue("precision", float(report_splits[2]))
sb.glue("recall", float(report_splits[3]))
sb.glue("f1", float(report_splits[4]))