*Copyright (c) Microsoft Corporation. All rights reserved.*  
*Licensed under the MIT License.*

# Named Entity Recognition Using Transformer Model

## Summary

This notebook demonstrates how to fine tune [pretrained Transformer model](https://github.com/huggingface/transformers) for named entity recognition (NER) task. Utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, model scoring, and model evaluation. 

The pretrained transformer of [BERT (Bidirectional Transformers for Language Understanding)](https://arxiv.org/pdf/1810.04805.pdf) architecture is used in this notebook. [BERT](https://arxiv.org/pdf/1810.04805.pdf) is a powerful pre-trained lanaguage model that can be used for multiple NLP tasks, including text classification, question answering, named entity recognition, etc. It's able to achieve state of the art performance with only a few epochs of fine tuning on task specific datasets.

The figure below illustrates how BERT can be fine tuned for NER tasks. The input data is a list of tokens representing a sentence. In the training data, each token has an entity label. After fine tuning, the model predicts an entity label for each token in a given testing sentence. 

<img src="https://nlpbp.blob.core.windows.net/images/bert_architecture.png">

In [1]:
import os
import random
import string
import sys
from tempfile import TemporaryDirectory

import pandas as pd
import scrapbook as sb
import torch
from seqeval.metrics import classification_report
from sklearn.model_selection import train_test_split
from utils_nlp.common.pytorch_utils import dataloader_from_dataset
from utils_nlp.common.timer import Timer
from utils_nlp.dataset import wikigold
from utils_nlp.dataset.ner_utils import read_conll_file
from utils_nlp.dataset.url_utils import maybe_download
from utils_nlp.models.transformers.named_entity_recognition import (
    TokenClassificationProcessor, TokenClassifier)
from utils_nlp.models.transformers.named_entity_recognition import supported_models as SUPPORTED_MODELS

## Configuration

The running time shown in this notebook is on a Standard_NC12 Azure Virtual Machine with 2 NVIDIA Tesla K80 GPUs. 
> **Tip**: If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`** to run the notebook on a small subset of the data and a smaller number of epochs. 

The table below provides some reference running time on different machine configurations.  

|QUICK_RUN|Machine Configurations|Running time|
|:---------|:----------------------|:------------|
|True|4 CPUs, 14GB memory| ~ 2 minutes|
|False|4 CPUs, 14GB memory| ~1.5 hours|
|True|1 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 1 minute|
|False|1 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 7 minutes |

If you run into CUDA out-of-memory error or the jupyter kernel dies constantly, try reducing the `BATCH_SIZE` and `MAX_SEQ_LENGTH`, but note that model performance will be compromised. 

In [2]:
# Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.
QUICK_RUN = False

In [3]:
# Wikigold dataset
DATA_URL = (
    "https://raw.githubusercontent.com/juand-r/entity-recognition-datasets"
    "/master/data/wikigold/CONLL-format/data/wikigold.conll.txt"
)

# fraction of the dataset used for testing
TEST_DATA_FRACTION = 0.3

# sub-sampling ratio
SAMPLE_RATIO = 1

# the data path used to save the downloaded data file
DATA_PATH = TemporaryDirectory().name

# the cache data path during find tuning
CACHE_DIR = TemporaryDirectory().name

# set random seeds
RANDOM_SEED = 100
torch.manual_seed(RANDOM_SEED)

# model configurations
NUM_TRAIN_EPOCHS = 5
MODEL_NAME = "distilbert-base-cased"
DO_LOWER_CASE = False
MAX_SEQ_LENGTH = 200
TRAILING_PIECE_TAG = "X"
NUM_GPUS = None  # uses all if available
BATCH_SIZE = 16

# update variables for quick run option
if QUICK_RUN:
    SAMPLE_RATIO = 0.1
    NUM_TRAIN_EPOCHS = 1

### Models that can be used for token classification task

In [4]:
pd.DataFrame({"supported models": SUPPORTED_MODELS})

Unnamed: 0,supported models
0,albert-base-v1
1,albert-base-v2
2,albert-large-v1
3,albert-large-v2
4,albert-xlarge-v1
...,...
65,xlm-roberta-large-finetuned-conll02-spanish
66,xlm-roberta-large-finetuned-conll03-english
67,xlm-roberta-large-finetuned-conll03-german
68,xlnet-base-cased


## Get Traning & Testing Dataset

The dataset used in this notebook is the [wikigold dataset](https://www.aclweb.org/anthology/W09-3302). The wikigold dataset consists of 145 mannually labelled Wikipedia articles, including 1841 sentences and 40k tokens in total. The dataset can be directly downloaded from [here](https://github.com/juand-r/entity-recognition-datasets/tree/master/data/wikigold). 

In the following cell, we download the data file, parse the tokens and labels, sample a given number of sentences, and split the dataset for training and testing.

In [5]:
# download data
file_name = DATA_URL.split("/")[-1]  # a name for the downloaded file
maybe_download(DATA_URL, file_name, DATA_PATH)
data_file = os.path.join(DATA_PATH, file_name)

# parse CoNll file
sentence_list, labels_list = read_conll_file(data_file, sep=" ")

# sub-sample (optional)
random.seed(RANDOM_SEED)
sample_size = int(SAMPLE_RATIO * len(sentence_list))
sentence_list, labels_list = list(
    zip(*random.sample(list(zip(sentence_list, labels_list)), k=sample_size))
)

# train-test split
train_sentence_list, test_sentence_list, train_labels_list, test_labels_list = train_test_split(
    sentence_list, labels_list, test_size=TEST_DATA_FRACTION, random_state=RANDOM_SEED
)

100%|██████████| 96.0/96.0 [00:00<00:00, 4.02kKB/s]

Maximum sequence length is: 144





The following is an example input sentence of the training set.

In [7]:
# Show example sentences from input
pd.DataFrame({"sentence": sentence_list, "labels": labels_list}).head(10)

Unnamed: 0,sentence,labels
0,"[The, origin, of, Agotes, (, or, Cagots, ), is...","[O, O, O, I-MISC, O, O, I-MISC, O, O, O, O]"
1,[-DOCSTART-],[O]
2,"[It, provides, full, -, and, part-time, polyte...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,"[Since, she, was, the, daughter, of, the, grea...","[O, O, O, O, O, O, O, O, I-MISC, O, O, O, I-MI..."
4,"[The, goals, were, two, posts, ,, with, no, cr...","[O, O, O, O, O, O, O, O, O, O]"
5,"[At, one, point, ,, so, many, orders, had, bee...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
6,"[Left, camp, in, July, 1972, ,, and, was, deal...","[O, O, O, O, O, O, O, O, O, O, O, I-ORG, I-ORG..."
7,"[She, fled, again, to, Abra, ,, where, she, wa...","[O, O, O, O, I-LOC, O, O, O, O, O, O]"
8,"[As, the, younger, sibling, ,, Ben, was, const...","[O, O, O, O, O, I-PER, O, O, O, O, O, O, O, O,..."
9,"[Milepost, 1, :, granite, masonry, arch, over,...","[O, O, O, O, O, O, O, I-LOC, I-LOC, O]"


In [8]:
# Show example tokens from input
pd.DataFrame({"token": train_sentence_list[0], "label": train_labels_list[0]}).head(11)

Unnamed: 0,token,label
0,In,O
1,1999,O
2,",",O
3,the,O
4,Caloi,I-PER
5,family,O
6,sold,O
7,the,O
8,majority,O
9,of,O


> If your data is unlabeled, try using an annotation tool to simplify the process of labeling. The example [here](../annotation/Doccano.md) introduces [Doccanno](https://github.com/chakki-works/doccano) and shows how it can be used for NER annotation.

## Create PyTorch Datasets and Dataloaders
Given the tokenized input and corresponding labels, we use a custom processer to convert our input lists into a PyTorch dataset that can be used with our token classifier. Next, we create PyTorch dataloaders for training and testing.

In [9]:
processor = TokenClassificationProcessor(model_name=MODEL_NAME, to_lower=DO_LOWER_CASE, cache_dir=CACHE_DIR)

label_map = TokenClassificationProcessor.create_label_map(
    label_lists=labels_list, trailing_piece_tag=TRAILING_PIECE_TAG
)

train_dataset = processor.preprocess(
    text=train_sentence_list,
    max_len=MAX_SEQ_LENGTH,
    labels=train_labels_list,
    label_map=label_map,
    trailing_piece_tag=TRAILING_PIECE_TAG,
)
train_dataloader = dataloader_from_dataset(
    train_dataset, batch_size=BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=True, distributed=False
)

test_dataset = processor.preprocess(
    text=test_sentence_list,
    max_len=MAX_SEQ_LENGTH,
    labels=test_labels_list,
    label_map=label_map,
    trailing_piece_tag=TRAILING_PIECE_TAG,
)
test_dataloader = dataloader_from_dataset(
    test_dataset, batch_size=BATCH_SIZE, num_gpus=NUM_GPUS, shuffle=False, distributed=False
)


HBox(children=(IntProgress(value=0, description='Downloading', max=411, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=213450, style=ProgressStyle(description_wid…








## Train Model

There are two steps to train a NER model using pretrained transformer model: 1) Instantiate a TokenClassifier class which is a wrapper of a transformer-based network, and 2) Fit the model using the preprocessed training dataloader. The member method `fit` of TokenClassifier class is used to fine-tune the model.

In [10]:
# Instantiate a TokenClassifier class for NER using pretrained transformer model
model = TokenClassifier(
    model_name=MODEL_NAME,
    num_labels=len(label_map),
    cache_dir=CACHE_DIR
)

# Fine tune the model using the training dataset
with Timer() as t:
    model.fit(
        train_dataloader=train_dataloader,
        num_epochs=NUM_TRAIN_EPOCHS,
        num_gpus=NUM_GPUS,
        local_rank=-1,
        weight_decay=0.0,
        learning_rate=5e-5,
        adam_epsilon=1e-8,
        warmup_steps=0,
        verbose=False,
        seed=RANDOM_SEED
    )

print("Training time : {:.3f} hrs".format(t.interval / 3600))


HBox(children=(IntProgress(value=0, description='Downloading', max=263273408, style=ProgressStyle(description_…


Training time : 0.060 hrs


## Evaluate on Testing Dataset

The `predict` method of the TokenClassifier returns a Numpy ndarray of raw predictions. The shape of the ndarray is \[`number_of_examples`, `sequence_length`, `number_of_labels`\]. Each value in the ndarray is not normalized. Post-process will be needed to get the probability for each class label. Function `get_predicted_token_labels` will process the raw prediction and output the predicted labels for each token.

In [11]:
with Timer() as t:
    preds = model.predict(
        test_dataloader=test_dataloader,
        num_gpus=None,
        verbose=True
    )

print("Prediction time : {:.3f} hrs".format(t.interval / 3600))

Scoring: 100%|██████████| 35/35 [00:06<00:00,  6.14it/s]

Prediction time : 0.002 hrs





Get the true token labels of the testing dataset:

In [12]:
true_labels = model.get_true_test_labels(label_map=label_map, dataset=test_dataset)

Get the predicted labels for each token by calling member method `get_predicted_token_labels`, and generate the classification report.

In [13]:
predicted_labels = model.get_predicted_token_labels(
    predictions=preds,
    label_map=label_map,
    dataset=test_dataset
)

report = classification_report(true_labels, 
              predicted_labels, 
              digits=2
)

print(report)

           precision    recall  f1-score   support

      ORG       0.72      0.76      0.74       274
     MISC       0.67      0.73      0.70       221
      LOC       0.79      0.84      0.81       317
      PER       0.90      0.93      0.92       257

micro avg       0.76      0.82      0.79      1069
macro avg       0.77      0.82      0.79      1069



## Score Example Sentences
Finally, we test the model on some random input sentences.

In [14]:
# test
sample_text = [    
    "Is it true that Jane works at Microsoft?",
    "Joe now lives in Copenhagen."
]
sample_tokens = [x.split() for x in sample_text]

sample_dataset = processor.preprocess(
    text=sample_tokens,
    max_len=MAX_SEQ_LENGTH,
    labels=None,
    label_map=label_map,
    trailing_piece_tag=TRAILING_PIECE_TAG,
)
sample_dataloader = dataloader_from_dataset(
    sample_dataset, batch_size=BATCH_SIZE, num_gpus=None, shuffle=False, distributed=False
)
preds = model.predict(
        test_dataloader=sample_dataloader,
        num_gpus=None,
        verbose=True
)
predicted_labels = model.get_predicted_token_labels(
    predictions=preds,
    label_map=label_map,
    dataset=sample_dataset
)

for i in range(len(sample_text)):
    print("\n", sample_text[i])
    print(pd.DataFrame({"tokens": sample_tokens[i] , "labels":predicted_labels[i]}))  

Scoring: 100%|██████████| 1/1 [00:00<00:00, 25.31it/s]


 Is it true that Jane works at Microsoft?
       tokens labels
0          Is      O
1          it      O
2        true      O
3        that      O
4        Jane  I-PER
5       works      O
6          at      O
7  Microsoft?  I-ORG

 Joe now lives in Copenhagen.
        tokens labels
0          Joe  I-PER
1          now      O
2        lives      O
3           in      O
4  Copenhagen.  I-LOC





## For Testing

In [15]:
report_splits = report.split('\n')[-2].split()

sb.glue("precision", float(report_splits[2]))
sb.glue("recall", float(report_splits[3]))
sb.glue("f1", float(report_splits[4]))