Text alignment

genalog provides text alignment capabilities. This is most useful in the following situations after you have ran Opitcal Character Recognition (OCR) on the synthetic documents:

  • Text alignment between noisy (OCR result) and grouth truth text

  • NER label propagation using text alignment results (we will cover this in the next page)

genalog provides two methods of alignment:

  1. genalog.text.anchor.align_w_anchor()

  2. genalog.text.alignment.align()

align_w_anchor() implements the Recursive Text Alignment Scheme (RETAS) from the paper A Fast Alignment Scheme for Automatic OCR Evaluation of Books and works best on longer text strings, while align() implement the Needleman-Wunsch algorithm and works best on shorter strings.

We recommend using the align_w_anchor() method on inputs longer than 200 characters. Both methods share the same function contract and are interchangeable.

gt_txt = "New York is big"
noise_txt = "New Yo rkis"

RETAS Method

This is our implementation of The Recursive Text Alignment Scheme (RETAS) from the paper A Fast Alignment Scheme for Automatic OCR Evaluation of Books, as the original paper did not release the algorithm written in Python.

from genalog.text import anchor

# Extra whitespaces are removed
aligned_gt, aligned_noise = anchor.align_w_anchor(gt_txt, noise_txt)
print(f"Aligned ground truth: {aligned_gt}")
print(f"Aligned noise:        {aligned_noise}")
Aligned ground truth: New Yo@rk is big
Aligned noise:        New Yo rk@is@@@@

Hint

@ is the default gap character inserted by the alignment algorithm, you can change the gap character by providing the keyword-argument anchor.align_w_anchor(gt_txt, noise_txt, gap_char=<NEW_CHAR>)

Needleman-Wunsch Algorithm

We use Biopython’s implementation of the Needleman-Wunsch algorithm for text alignment. This algorithm is an exhaustive search for all possible candidates with dynamic programming. It produces weighted score for each candidate and returns those having the highest score. (NOTE that multiple candidates can share the same score)

# Needleman-Wunsch alignment ONLY
from genalog.text import alignment

aligned_gt, aligned_noise = alignment.align(gt_txt, noise_txt)
print(f"Aligned ground truth: {aligned_gt}")
print(f"Aligned noise:        {aligned_noise}")
Aligned ground truth: New Yo@rk is big
Aligned noise:        New Yo rk@is@@@@

Advanced Algorithm Configurations

The Needleman-Wunsch Algorithm algorithm has 4 hyperparameters for tuning candidate scores:

  1. Match Reward - how much the algorithm rewards matching characters

  2. Mismatch Penalty - how much the algorithm penalizes mismatching characters

  3. Gap Penalty - how much the algorithm penalizes for creating a gap with a GAP_CHAR (defaults to ‘@’)

  4. Gap Extension Penalty - how much the algorithm penalizes for extending a gap (ex “@@@@”)

You can find the default values for these four parameters as a constant in the package:

  1. genalog.text.alignment.MATCH_REWARD

  2. genalog.text.alignment.MISMATCH_PENALTY

  3. genalog.text.alignment.GAP_PENALTY

  4. genalog.text.alignment.GAP_EXT_PENALTY

Interpret the Alignment Results

genalog provide additional functionality to interpret the alignment results and produce a relational mapping between the tokens in the noisy and grouth truth text.

from genalog.text import alignment

# Process the aligned strings to find out how the tokens are related
gt_to_noise_mapping, noise_to_gt_mapping = alignment.parse_alignment(aligned_gt, aligned_noise, gap_char="@")
print(f"gt_to_noise: {gt_to_noise_mapping}")
print(f"noise_to_gt: {noise_to_gt_mapping}")
gt_to_noise: [[0], [1, 2], [2], []]
noise_to_gt: [[0], [1], [1, 2], []]

Recall that the ground truth is New York is big while the noisy text is New Yo rkis.

gt_to_noise: [[0], [1, 2], [2], []] can be interpreted as: “the 0th gt token (New) maps to the 0th noisy token (New), the 1st gt token (York) maps to the 1st and 2nd nosity tokens (Yo and rkis), the 2nd token (is) maps to the 2nd noisy token (rkis), and finally, the last gt token (big) cannot be mapped to any noisy token.”

And the vice versa for noise_to_gt: [[0], [1], [1, 2], []]

Formatting Alignment Results

You can use genalog.alignment._format_alignment() for better visual understanding of the alignment results

# Format aligned string for better display
print(alignment._format_alignment(aligned_gt, aligned_noise))
New Yo@rk is @ big@
||||||.||.||||||||.
New Yo rk@is @ big