{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Copyright (c) Microsoft Corporation. All rights reserved.*\n",
"\n",
"*Licensed under the MIT License.*\n",
"\n",
"# Text Classification of MultiNLI Sentences using Multiple Transformer Models"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import os\n",
"import sys\n",
"from tempfile import TemporaryDirectory\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import scrapbook as sb\n",
"import torch\n",
"import torch.nn as nn\n",
"from sklearn.metrics import accuracy_score, classification_report\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import LabelEncoder\n",
"from tqdm import tqdm\n",
"from utils_nlp.common.timer import Timer\n",
"from utils_nlp.common.pytorch_utils import dataloader_from_dataset\n",
"from utils_nlp.dataset.multinli import load_pandas_df\n",
"from utils_nlp.models.transformers.sequence_classification import (\n",
" Processor, SequenceClassifier)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"In this notebook, we fine-tune and evaluate a number of pretrained models on a subset of the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) dataset.\n",
"\n",
"We use a [sequence classifier](../../utils_nlp/models/transformers/sequence_classification.py) that wraps [Hugging Face's PyTorch implementation](https://github.com/huggingface/transformers) of different transformers, like [BERT](https://github.com/google-research/bert), [XLNet](https://github.com/zihangdai/xlnet), and [RoBERTa](https://github.com/pytorch/fairseq)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"# notebook parameters\n",
"DATA_FOLDER = TemporaryDirectory().name\n",
"CACHE_DIR = TemporaryDirectory().name\n",
"NUM_EPOCHS = 1\n",
"BATCH_SIZE = 16\n",
"NUM_GPUS = 2\n",
"MAX_LEN = 100\n",
"TRAIN_DATA_FRACTION = 0.05\n",
"TEST_DATA_FRACTION = 0.05\n",
"TRAIN_SIZE = 0.75\n",
"LABEL_COL = \"genre\"\n",
"TEXT_COL = \"sentence1\"\n",
"MODEL_NAMES = [\"distilbert-base-uncased\", \"roberta-base\", \"xlnet-base-cased\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Dataset\n",
"We start by loading a subset of the data. The following function also downloads and extracts the files, if they don't exist in the data folder.\n",
"\n",
"The MultiNLI dataset is mainly used for natural language inference (NLI) tasks, where the inputs are sentence pairs and the labels are entailment indicators. The sentence pairs are also classified into *genres* that allow for more coverage and better evaluation of NLI models.\n",
"\n",
"For our classification task, we use the first sentence only as the text input, and the corresponding genre as the label. We select the examples corresponding to one of the entailment labels (*neutral* in this case) to avoid duplicate rows, as the sentences are not unique, whereas the sentence pairs are."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 222k/222k [01:20<00:00, 2.74kKB/s] \n"
]
}
],
"source": [
"df = load_pandas_df(DATA_FOLDER, \"train\")\n",
"df = df[df[\"gold_label\"]==\"neutral\"] # get unique sentences"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" genre | \n",
" sentence1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" government | \n",
" Conceptually cream skimming has two basic dime... | \n",
"
\n",
" \n",
" 4 | \n",
" telephone | \n",
" yeah i tell you what though if you go price so... | \n",
"
\n",
" \n",
" 6 | \n",
" travel | \n",
" But a few Christian mosaics survive above the ... | \n",
"
\n",
" \n",
" 12 | \n",
" slate | \n",
" It's not that the questions they asked weren't... | \n",
"
\n",
" \n",
" 13 | \n",
" travel | \n",
" Thebes held onto power until the 12th Dynasty,... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" genre sentence1\n",
"0 government Conceptually cream skimming has two basic dime...\n",
"4 telephone yeah i tell you what though if you go price so...\n",
"6 travel But a few Christian mosaics survive above the ...\n",
"12 slate It's not that the questions they asked weren't...\n",
"13 travel Thebes held onto power until the 12th Dynasty,..."
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[[LABEL_COL, TEXT_COL]].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We split the data for training and testing, sample a fraction for faster execution, and encode the class labels:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/media/bleik2/backup/.conda/envs/nlp_gpu/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.\n",
" FutureWarning)\n"
]
}
],
"source": [
"# split\n",
"df_train, df_test = train_test_split(df, train_size = TRAIN_SIZE, random_state=0)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# sample\n",
"df_train = df_train.sample(frac=TRAIN_DATA_FRACTION).reset_index(drop=True)\n",
"df_test = df_test.sample(frac=TEST_DATA_FRACTION).reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The examples in the dataset are grouped into 5 genres:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"telephone 1043\n",
"slate 989\n",
"fiction 968\n",
"travel 964\n",
"government 945\n",
"Name: genre, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train[LABEL_COL].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# encode labels\n",
"label_encoder = LabelEncoder()\n",
"df_train[LABEL_COL] = label_encoder.fit_transform(df_train[LABEL_COL])\n",
"df_test[LABEL_COL] = label_encoder.transform(df_test[LABEL_COL])\n",
"\n",
"num_labels = len(np.unique(df_train[LABEL_COL]))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of unique labels: 5\n",
"Number of training examples: 4909\n",
"Number of testing examples: 1636\n"
]
}
],
"source": [
"print(\"Number of unique labels: {}\".format(num_labels))\n",
"print(\"Number of training examples: {}\".format(df_train.shape[0]))\n",
"print(\"Number of testing examples: {}\".format(df_test.shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Select Pretrained Models\n",
"\n",
"Several pretrained models have been made available by [Hugging Face](https://github.com/huggingface/transformers). For text classification, the following pretrained models are supported."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" model_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" bert-base-uncased | \n",
"
\n",
" \n",
" 1 | \n",
" bert-large-uncased | \n",
"
\n",
" \n",
" 2 | \n",
" bert-base-cased | \n",
"
\n",
" \n",
" 3 | \n",
" bert-large-cased | \n",
"
\n",
" \n",
" 4 | \n",
" bert-base-multilingual-uncased | \n",
"
\n",
" \n",
" 5 | \n",
" bert-base-multilingual-cased | \n",
"
\n",
" \n",
" 6 | \n",
" bert-base-chinese | \n",
"
\n",
" \n",
" 7 | \n",
" bert-base-german-cased | \n",
"
\n",
" \n",
" 8 | \n",
" bert-large-uncased-whole-word-masking | \n",
"
\n",
" \n",
" 9 | \n",
" bert-large-cased-whole-word-masking | \n",
"
\n",
" \n",
" 10 | \n",
" bert-large-uncased-whole-word-masking-finetune... | \n",
"
\n",
" \n",
" 11 | \n",
" bert-large-cased-whole-word-masking-finetuned-... | \n",
"
\n",
" \n",
" 12 | \n",
" bert-base-cased-finetuned-mrpc | \n",
"
\n",
" \n",
" 13 | \n",
" bert-base-german-dbmdz-cased | \n",
"
\n",
" \n",
" 14 | \n",
" bert-base-german-dbmdz-uncased | \n",
"
\n",
" \n",
" 15 | \n",
" bert-base-japanese | \n",
"
\n",
" \n",
" 16 | \n",
" bert-base-japanese-whole-word-masking | \n",
"
\n",
" \n",
" 17 | \n",
" bert-base-japanese-char | \n",
"
\n",
" \n",
" 18 | \n",
" bert-base-japanese-char-whole-word-masking | \n",
"
\n",
" \n",
" 19 | \n",
" bert-base-finnish-cased-v1 | \n",
"
\n",
" \n",
" 20 | \n",
" bert-base-finnish-uncased-v1 | \n",
"
\n",
" \n",
" 21 | \n",
" roberta-base | \n",
"
\n",
" \n",
" 22 | \n",
" roberta-large | \n",
"
\n",
" \n",
" 23 | \n",
" roberta-large-mnli | \n",
"
\n",
" \n",
" 24 | \n",
" distilroberta-base | \n",
"
\n",
" \n",
" 25 | \n",
" roberta-base-openai-detector | \n",
"
\n",
" \n",
" 26 | \n",
" roberta-large-openai-detector | \n",
"
\n",
" \n",
" 27 | \n",
" xlnet-base-cased | \n",
"
\n",
" \n",
" 28 | \n",
" xlnet-large-cased | \n",
"
\n",
" \n",
" 29 | \n",
" distilbert-base-uncased | \n",
"
\n",
" \n",
" 30 | \n",
" distilbert-base-uncased-distilled-squad | \n",
"
\n",
" \n",
" 31 | \n",
" distilbert-base-german-cased | \n",
"
\n",
" \n",
" 32 | \n",
" distilbert-base-multilingual-cased | \n",
"
\n",
" \n",
" 33 | \n",
" albert-base-v1 | \n",
"
\n",
" \n",
" 34 | \n",
" albert-large-v1 | \n",
"
\n",
" \n",
" 35 | \n",
" albert-xlarge-v1 | \n",
"
\n",
" \n",
" 36 | \n",
" albert-xxlarge-v1 | \n",
"
\n",
" \n",
" 37 | \n",
" albert-base-v2 | \n",
"
\n",
" \n",
" 38 | \n",
" albert-large-v2 | \n",
"
\n",
" \n",
" 39 | \n",
" albert-xlarge-v2 | \n",
"
\n",
" \n",
" 40 | \n",
" albert-xxlarge-v2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" model_name\n",
"0 bert-base-uncased\n",
"1 bert-large-uncased\n",
"2 bert-base-cased\n",
"3 bert-large-cased\n",
"4 bert-base-multilingual-uncased\n",
"5 bert-base-multilingual-cased\n",
"6 bert-base-chinese\n",
"7 bert-base-german-cased\n",
"8 bert-large-uncased-whole-word-masking\n",
"9 bert-large-cased-whole-word-masking\n",
"10 bert-large-uncased-whole-word-masking-finetune...\n",
"11 bert-large-cased-whole-word-masking-finetuned-...\n",
"12 bert-base-cased-finetuned-mrpc\n",
"13 bert-base-german-dbmdz-cased\n",
"14 bert-base-german-dbmdz-uncased\n",
"15 bert-base-japanese\n",
"16 bert-base-japanese-whole-word-masking\n",
"17 bert-base-japanese-char\n",
"18 bert-base-japanese-char-whole-word-masking\n",
"19 bert-base-finnish-cased-v1\n",
"20 bert-base-finnish-uncased-v1\n",
"21 roberta-base\n",
"22 roberta-large\n",
"23 roberta-large-mnli\n",
"24 distilroberta-base\n",
"25 roberta-base-openai-detector\n",
"26 roberta-large-openai-detector\n",
"27 xlnet-base-cased\n",
"28 xlnet-large-cased\n",
"29 distilbert-base-uncased\n",
"30 distilbert-base-uncased-distilled-squad\n",
"31 distilbert-base-german-cased\n",
"32 distilbert-base-multilingual-cased\n",
"33 albert-base-v1\n",
"34 albert-large-v1\n",
"35 albert-xlarge-v1\n",
"36 albert-xxlarge-v1\n",
"37 albert-base-v2\n",
"38 albert-large-v2\n",
"39 albert-xlarge-v2\n",
"40 albert-xxlarge-v2"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame({\"model_name\": SequenceClassifier.list_supported_models()})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fine-tune\n",
"\n",
"Our wrappers make it easy to fine-tune different models in a unified way, hiding the preprocessing details that are needed before training. In this example, we're going to select the following models and use the same piece of code to fine-tune them on our genre classification task. Note that some models were pretrained on multilingual datasets and can be used with non-English datasets."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['distilbert-base-uncased', 'roberta-base', 'xlnet-base-cased']\n"
]
}
],
"source": [
"print(MODEL_NAMES)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For each pretrained model, we preprocess the data, fine-tune the classifier, score the test set, and store the evaluation results."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/media/bleik2/backup/.conda/envs/nlp_gpu/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
" warnings.warn('Was asked to gather along dimension 0, but all '\n"
]
}
],
"source": [
"results = {}\n",
"\n",
"for model_name in tqdm(MODEL_NAMES, disable=True):\n",
"\n",
" # preprocess\n",
" processor = Processor(\n",
" model_name=model_name,\n",
" to_lower=model_name.endswith(\"uncased\"),\n",
" cache_dir=CACHE_DIR,\n",
" )\n",
" train_dataset = processor.dataset_from_dataframe(\n",
" df_train, TEXT_COL, LABEL_COL, max_len=MAX_LEN\n",
" )\n",
" train_dataloader = dataloader_from_dataset(\n",
" train_dataset, batch_size=BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=True\n",
" )\n",
" test_dataset = processor.dataset_from_dataframe(\n",
" df_test, TEXT_COL, LABEL_COL, max_len=MAX_LEN\n",
" )\n",
" test_dataloader = dataloader_from_dataset(\n",
" test_dataset, batch_size=BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=False\n",
" )\n",
"\n",
" # fine-tune\n",
" classifier = SequenceClassifier(\n",
" model_name=model_name, num_labels=num_labels, cache_dir=CACHE_DIR\n",
" )\n",
" with Timer() as t:\n",
" classifier.fit(\n",
" train_dataloader, num_epochs=NUM_EPOCHS, num_gpus=NUM_GPUS, verbose=False,\n",
" )\n",
" train_time = t.interval / 3600\n",
"\n",
" # predict\n",
" preds = classifier.predict(test_dataloader, num_gpus=NUM_GPUS, verbose=False)\n",
"\n",
" # eval\n",
" accuracy = accuracy_score(df_test[LABEL_COL], preds)\n",
" class_report = classification_report(\n",
" df_test[LABEL_COL], preds, target_names=label_encoder.classes_, output_dict=True\n",
" )\n",
"\n",
" # save results\n",
" results[model_name] = {\n",
" \"accuracy\": accuracy,\n",
" \"f1-score\": class_report[\"macro avg\"][\"f1-score\"],\n",
" \"time(hrs)\": train_time,\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate\n",
"\n",
"Finally, we report the accuracy and F1-score metrics for each model, as well as the fine-tuning time in hours."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" distilbert-base-uncased | \n",
" roberta-base | \n",
" xlnet-base-cased | \n",
"
\n",
" \n",
" \n",
" \n",
" accuracy | \n",
" 0.889364 | \n",
" 0.885697 | \n",
" 0.886308 | \n",
"
\n",
" \n",
" f1-score | \n",
" 0.885225 | \n",
" 0.880926 | \n",
" 0.881819 | \n",
"
\n",
" \n",
" time(hrs) | \n",
" 0.023326 | \n",
" 0.044209 | \n",
" 0.052801 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" distilbert-base-uncased roberta-base xlnet-base-cased\n",
"accuracy 0.889364 0.885697 0.886308\n",
"f1-score 0.885225 0.880926 0.881819\n",
"time(hrs) 0.023326 0.044209 0.052801"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_results = pd.DataFrame(results)\n",
"df_results"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"application/scrapbook.scrap.json+json": {
"data": 0.887123064384678,
"encoder": "json",
"name": "accuracy",
"version": 1
}
},
"metadata": {
"scrapbook": {
"data": true,
"display": false,
"name": "accuracy"
}
},
"output_type": "display_data"
},
{
"data": {
"application/scrapbook.scrap.json+json": {
"data": 0.8826569624491233,
"encoder": "json",
"name": "f1",
"version": 1
}
},
"metadata": {
"scrapbook": {
"data": true,
"display": false,
"name": "f1"
}
},
"output_type": "display_data"
}
],
"source": [
"# for testing\n",
"sb.glue(\"accuracy\", df_results.iloc[0, :].mean())\n",
"sb.glue(\"f1\", df_results.iloc[1, :].mean())"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.6.8 64-bit ('nlp_gpu': conda)",
"language": "python",
"name": "python36864bitnlpgpucondaa579511bcea84c65877ff3dca4205921"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}