Propagation of NER labels

In the context of Named Entity Recognition (NER), typical datasets contain the text tokens and the NER labels for each of the tokens. For example:

NER Labels: B-P I-P  O  O
      Text: New York is big

Now, imagine we have obtained a noisy version of the grouth truth text through the OCR process, for example. The problem becomes: how can we label the noisy tokens?

    NER Labels:  B-P I-P  O  O
       GT Text:  New York is big
    Noisy Text:  New Yo rkis big
    NER Labels:   ?  ?   ?    ?

We can utilize text alignment and propagate the NER labels onto the noisy tokens. We will demonstrate how in the rest of this document.

Tokenization

To ensure consistent interpretation of the text alignment results, we need to first tokenize the grouth truth and the OCR’ed (nosiy) text.

from genalog.text import ner_label
from genalog.text import preprocess

gt_txt = "New York is big"
ocr_txt = "New Yo rkis big"

# Input to the method
gt_labels = ["B-P", "I-P", "O", "O"]
gt_tokens = preprocess.tokenize(gt_txt) # tokenize into list of tokens
ocr_tokens = preprocess.tokenize(ocr_txt)
# Inputs to the method
print(gt_labels)
print(gt_tokens)
print(ocr_tokens)
['B-P', 'I-P', 'O', 'O']
['New', 'York', 'is', 'big']
['New', 'Yo', 'rkis', 'big']

Label Propagation

We then can run label propagation to obtain the NER labels for the OCR’ed (noisy) tokens.

# Method returns a tuple of 4 elements (gt_tokens, gt_labels, ocr_tokens, ocr_labels, gap_char)
ocr_labels, aligned_gt, aligned_ocr, gap_char = ner_label.propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens)
# Outputs
print(f"OCR labels:           {ocr_labels}")
print(f"Aligned ground truth: {aligned_gt}")
print(f"Alinged OCR text:     {aligned_ocr}")
OCR labels:           ['B-P', 'I-P', 'I-P', 'O']
Aligned ground truth: New Yo@rk is big
Alinged OCR text:     New Yo rk@is big

Display Result After Propagation

print(ner_label.format_label_propagation(gt_tokens, gt_labels, ocr_tokens, ocr_labels, aligned_gt, aligned_ocr))
B-P I-P  O  O   
New York is big 
New Yo@rk is big
||||||.||.||||||
New Yo rk@is big
New Yo  rkis big 
B-P I-P I-P  O   

Final Results

Formatting the OCR tokens and their NER labels.

# Format tokens and labels
print(ner_label.format_labels(ocr_tokens, ocr_labels))
B-P I-P I-P  O   
New Yo  rkis big