genalog.text¶

genalog.text.alignment module¶

genalog.text.alignment.align(gt, noise, gap_char='@')[source]¶

Align two text segments via sequence alignment algorithm

NOTE: this algorithm is O(N^2) and is NOT efficient for longer text. Please refer to genalog.text.anchor for faster alignment on longer strings.

Parameters

gt (str) – ground true text (should not contain GAP_CHAR)
noise (str) – str with ocr noise (should not contain GAP_CHAR)
gap_char (char, optional) – gap char used in alignment algorithm (default: GAP_CHAR)

Returns

a tuple of aligned ground truth and noise

Return type

tuple(str, str)

Invariants:

The returned aligned strings will satisfy the following invariants:

len(aligned_gt) == len(aligned_noise)
number of tokens in gt == number of tokens in aligned_gt

Example:

        gt: "New York is big" (num_tokens = 4)
aligned_gt: "N@ew @@York @is big@@" (num_tokens = 4)

genalog.text.alignment.parse_alignment(aligned_gt, aligned_noise, gap_char='@')[source]¶

Parse alignment to pair ground truth tokens with noise tokens

               Case 1:         Case 2:         Case 3:         Case 4:         Case 5:
               one-to-many     many-to-one     many-to-many    missing tokens  one-to-one
         gt    "New York"      "New York"      "New York"      "New York"      "New York"
                 |   |           |   |           |   |           |   |           |   |
  aligned_gt   "New Yo@rk"     "New York"      "N@ew York"     "New York"      "New York"
                 |   /\           \/             /\/             |   |           |   |
aligned_noise  "New Yo rk"     "New@York"      "N ew@York"     "New @@@@"      "New York"
                 |   | |           |            |    |           |               |   |
       noise   "New Yo rk"     "NewYork"       "N ewYork"      "New"           "New York"

Parameters

aligned_gt (str) – ground truth string aligned with the nose string
aligned_noise (str) – noise string aligned with the ground truth
gap_char (char, optional) – gap char used in alignment algorithm. Defaults to GAP_CHAR.

Returns

(gt_to_noise_mapping, noise_to_gt_mapping) of two 2D int arrays:

Return type

tuple

where each array defines the mapping between aligned gt tokens to noise tokens and vice versa.

Example

Given input

    aligned_gt: "N@ew York @is big"
                /\\   |    |   |
aligned_noise: "N ew@York kis big."

The returned output will be:

([[0,1],[1],[2],[3]], [[0],[0,1],[2],[3]])

genalog.text.anchor module¶

Baseline alignment algorithm is slow on long documents. The idea is to break down the longer text into smaller fragments for quicker alignment on individual pieces. We refer “anchor words” as these points of breakage.

The bulk of this algorithm is to identify these “anchor words”.

This is an re-implementation of the algorithm in this paper “A Fast Alignment Scheme for Automatic OCR Evaluation of Books” (https://ieeexplore.ieee.org/document/6065412)

We rely on genalog.text.alignment to align the subsequences.

genalog.text.anchor.align_w_anchor(gt, ocr, gap_char='@', max_seg_length=100)[source]¶

A faster alignment scheme of two text segments. This method first breaks the strings into smaller segments with anchor words. Then these smaller segments are aligned.

NOTE: this function shares the same contract as genalog.text.alignment.align() These two methods are interchangeable and their alignment results should be similar.

For example:

    Ground Truth: "The planet Mars, I scarcely need remind the reader,"
    Noisy Text:   "The plamet Maris, I scacely neee remind te reader,"

    Here the unique anchor words are "I", "remind" and "reader".

    Thus, the algorithm will split into following segment pairs:

        "The planet Mar, "
        "The plamet Maris, "

        "I scarcely need "
        "I scacely neee "

        "remind the reader,"
        "remind te reader,"

    And run sequence alignment on each pair.

Parameters

gt (str) – ground truth text
noise (str) – text with ocr noise
gap_char (str, optional) – gap char used in alignment algorithm . Defaults to GAP_CHAR.
max_seg_length (int, optional) – maximum segment length. Segments longer than this threshold will continued be split recursively into smaller segment. Defaults to MAX_ALIGN_SEGMENT_LENGTH.

Returns

(aligned_gt, aligned_noise)

Return type

a tuple (str, str) of aligned ground truth and noise

genalog.text.anchor.find_anchor_recur(gt_tokens, ocr_tokens, start_pos_gt=0, start_pos_ocr=0, max_seg_length=100)[source]¶

Recursively find anchor positions in the gt and ocr text

Parameters

gt_tokens (list) – a list of ground truth tokens
ocr_tokens (list) – a list of tokens from OCR’ed document
start_pos (int, optional) – a constant to add to all the resulting indices. Defaults to 0.
max_seg_length (int, optional) – trigger recursion if any text segment is larger than this. Defaults to MAX_ALIGN_SEGMENT_LENGTH.

Raises

ValueError – when there different number of anchor points in gt and ocr.

Returns

two lists of token indices where each list is the position of the anchor in the input gt_tokens and ocr_tokens

Return type

tuple

genalog.text.anchor.get_anchor_map(gt_tokens, ocr_tokens, min_anchor_len=2)[source]¶

Find the location of anchor words in both the gt and ocr text. Anchor words are location where we can split both the source gt and ocr text into smaller text fragment for faster alignment.

Parameters

gt_tokens (list) – a list of ground truth tokens
ocr_tokens (list) – a list of tokens from OCR’ed document
min_anchor_len (int, optional) – minimum len of the anchor word. Defaults to 2.

Returns

a 2-element (anchor_map_gt, anchor_map_ocr) tuple:

Return type

tuple

anchor_map_gt is a word_map that locates all the anchor words in the gt tokens
anchor_map_gt is a word_map that locates all the anchor words in the ocr tokens

And len(anchor_map_gt) == len(anchor_map_ocr)

For example:
    Input:
        gt_tokens:  ["b", "a", "c"]
        ocr_tokens: ["c", "b", "a"]
    Ourput:
        ([("b", 0), ("a", 1)], [("b", 1), ("a", 2)])

genalog.text.anchor.get_unique_words(tokens, case_sensitive=False)[source]¶

Get a set of unique words from a Counter dictionary of word occurrences

Parameters

d (dict) – a Counter dictionary of word occurrences
case_sensitive (bool, optional) – whether unique words are case sensitive. Defaults to False.

Returns

a set of unique words (original alphabetical case of the word is preserved)

Return type

set

genalog.text.anchor.get_word_map(unique_words, src_tokens)[source]¶

Arrange the set of unique words by the order they original appear in the text

Parameters

unique_words (set) – a set of unique words
src_tokens (list) – a list of tokens

Returns

a word_map: a list of word corrdinate tuples (word, word_index) defined as follow:

word is a typical word token
word_index is the index of the word in the source token array

Return type

list

genalog.text.anchor.segment_len(tokens)[source]¶

Get length of the segment

Parameters: segment (list) – a list of tokens
Returns: the length of the segment
Return type: int

genalog.text.conll_format module¶

This is a utility tool to create CoNLL-formatted token+label files for OCR’ed text by extracting text from grok OCR output JSON files and propagating labels from clean text to OCR text.

Usage:

conll_format.py [-h] [--train_subset] [--test_subset]
                [--gt_folder GT_FOLDER]
                base_folder degraded_folder

Positional Argument:
    base_folder            base directory containing the collection of dataset
    degraded_folder        directory name containing train and test subset for degradation

Optional Arguments:
    --train_subset            include if only train directory should be processed
    --test_subset             include if only test directory should be processed
    --gt_folder GT_FOLDER     directory name containing ground truth (default to `shared`)

Seek Help:
    -h, --help                show this help message and exit

Example Usage:

# to run for specified degradation of the dataset on both train and test
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all'

# to run for specified degradation of the dataset and ground truth
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all' --gt_folder='shared'

# to run for specified degradation of the dataset  on only test subset
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all' --test_subset

# to run for specified degradation of the dataset  on only train subset
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all' --train_subset

genalog.text.conll_format.check_n_sentences(clean_labels_dir, output_labels_dir, clean_label_ext)[source]¶

check_n_sentences prints file name if number of sentences is different in clean and OCR files

Parameters

clean_labels_dir (str) – path of directory with clean labels - CoNLL formatted so contains tokens and corresponding labels
output_labels_dir (str) – path of directory with ocr labels - CoNLL formatted so contains tokens and corresponding labels

genalog.text.conll_format.extract_ocr_text(input_file, output_file)[source]¶

extract_ocr_text from GROK json

Parameters

input_file (str) – file path of input file
output_file (str) – file path of output file

genalog.text.conll_format.for_all_files(input_dir, output_dir, func)[source]¶

for_all_files will apply function to every file in a director

Parameters

input_dir (str) – directory with input files
output_dir (str) – directory for output files
func (function) – function to be applied to all files in input_dir

genalog.text.conll_format.propagate_labels_sentences(clean_tokens, clean_labels, clean_sentences, ocr_tokens)[source]¶

propagate_labels_sentences propagates clean labels for clean tokens to ocr tokens and splits ocr tokens into sentences

Parameters

clean_tokens (list) – list of tokens in clean text
clean_labels (list) – list of labels corresponding to clean tokens
clean_sentences (list) – list of sentences (each sentence is a list of tokens)
ocr_tokens (list) – list of tokens in ocr text

Returns

list of ocr sentences (each sentence is a list of tokens) list of labels for ocr sentences

Return type

list, list

genalog.text.conll_format.propagate_labels_sentences_multiprocess(clean_labels_dir, output_text_dir, output_labels_dir, clean_label_ext)[source]¶

propagate_labels_sentences_all_files propagates labels and sentences for all files in dataset

Parameters

clean_labels_dir (str) – path of directory with clean labels - CoNLL formatted so contains tokens and corresponding labels
output_text_dir (dir) – path of directory with ocr text
output_labels_dir (dir) – path of directory with ocr labels - CoNLL formatted so contains tokens and corresponding labels
clean_label_ext (str) – file extension of the clean_labels

genalog.text.conll_format.remove_first_line(input_file, output_file)[source]¶

remove_first_line from files (some clean CoNLL files have an empty first line)

Parameters

input_file (str) – input file path
output_file (str) – output file path

genalog.text.conll_format.remove_last_line(input_file, output_file)[source]¶

remove_last_line from files (some clean CoNLL files have an empty last line)

Parameters

input_file (str) – input file path
output_file (str) – output file path

genalog.text.lcs module¶

class genalog.text.lcs.LCS(str_m, str_n)[source]¶: Compute the Longest Common Subsequence (LCS) of two given string.

genalog.text.ner_label module¶

exception genalog.text.ner_label.GapCharError[source]¶

genalog.text.ner_label._propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, gap_char='@', use_anchor=True)[source]¶

Propagate NER label for ground truth tokens to to ocr tokens. Low level implementation

NOTE: that gt_tokens and ocr_tokens MUST NOT contain invalid tokens. Invalid tokens are: 1. non-atomic tokens, or space-separated string (“New York”) 2. multiple occurrences of the GAP_CHAR (‘@@@’) 3. empty string (“”) 4. string with spaces (” “)

 Case Analysis:
 ******************************** MULTI-TOKEN-LABELS ********************************

             Case 1:         Case 2:         Case 3:         Case 4:         Case 5:
             one-to-many     many-to-one     many-to-many    missing tokens  missing tokens
                                            (Case 1&2 comb)  (I-label)       (B-label)
 gt label     B-p    I-p      B-p I-p        B-p   I-p       B-p  I-p        B-p  I-p  I-p
               |      |        |   |          |     |         |    |          |   |     |
 gt_token     New    York     New York       New  York       New York        New York City
              / \    / \        \ /           /\   /          |                   |     |
ocr_token    N   ew Yo  rk    NewYork        N ew@York       New                 York City
             |   |   |   |       |           |    |           |                   |     |
ocr label   B-p I-p I-p I-p     B-p          B-p I-p         B-p                 B-p   I-p

 ******************************** SINGLE-TOKEN-LABELS ********************************

             Case 1:         Case 2:         Case 3:         Case 4:
             one-to-many     many-to-one     many-to-many    missing tokens
                                            (Case 1&2 comb)
 gt label         O           V    O          O   V   W       O   O
                  |           |    |          |   |   |       |   |
 gt_token     something       is  big       this is huge      is big
              / \    \          \ /          /\  /\ /         |
ocr_token    so  me  thing     isbig       th isi shuge       is
             |   |     |         |          |  |    |         |
ocr label    o   o     o         V          O  O    V         O

Parameters

gt_labels (list) – a list of NER label for ground truth token
gt_tokens (list) – a list of ground truth string tokens
ocr_tokens (list) – a list of OCR’ed text tokens
gap_char (char, optional) – gap char used in alignment algorithm . Defaults to alignment.GAP_CHAR.
use_anchor (bool, optional) – use faster alignment method with anchors if set to True. Defaults to True.

Raises

ValueError – when
1. there is unequal number of gt_tokens and gt_labels –
2. there is a non-atomic token in gt_tokens or ocr_tokens –
3. there is an empty string in gt_tokens or ocr_tokens –
4. there is a token full of space characters only in gt_tokens or ocr_tokens –
5. gt_to_ocr_mapping has more tokens than gt_tokens –
GapCharError – when
1. there is a token consisted of GAP_CHAR only –

Returns

(ocr_labels, aligned_gt, aligned_ocr, gap_char) where ocr_labels is a list of NER label for the corresponding ocr tokens aligned_gt is the ground truth string aligned with the ocr text aligned_ocr is the ocr text aligned with ground true gap_char is the char used to alignment for inserting gaps

Return type

a tuple of 4 elements

For example, given input:

>>> _propagate_label_to_ocr(
    ["B-place", "I-place", "o", "o"],
    ["New", "York", "is", "big"],
    ["N", "ewYork", "big"]
)
(["B-place", "I-place", "o"], "N@ew York is big", "N ew@York@@@ big", '@')

genalog.text.ner_label.correct_ner_labels(labels)[source]¶

Correct the given list of labels for the following case:

Missing B-Label (i.e. I-PLACE I-PLACE -> B-PLACE I-PLACE)

Parameters: labels (list) – list of NER labels
Returns: a list of NER labels

genalog.text.ner_label.format_label_propagation(gt_tokens, gt_labels, ocr_tokens, ocr_labels, aligned_gt, aligned_ocr, show_alignment=True)[source]¶

Format label propagation for display

Parameters

gt_tokens (list) – list of ground truth tokens
gt_labels (list) – list of NER labels for ground truth tokens
ocr_tokens (list) – list of OCR’ed text tokens
ocr_labels (list) – list of NER labels for the OCR’ed tokens
aligned_gt (str) – ground truth string aligned with the OCR’ed text
aligned_ocr (str) – OCR’ed text aligned with ground truth
show_alignment (bool, optional) – if true, show alignment result . Defaults to True.

Returns

a string formatted for display as follows:

Return type

str

if show_alignment:

    "B-PLACE I-PLACE V  O"      # [gt_labels]
    "New     York    is big"    # [gt_txt]
    "New York is big"           # [aligned_gt]
    "||||....|||||||"
    "New @@@@ is big"           # [aligned_ocr]
    "New     is big "           # [ocr_txt]
    "B-PLACE V  O   "           # [ocr_labels]

else:

    "B-PLACE I-PLACE V  O"     # [gt_labels]
    "New     York    is big"   # [gt_txt]
    "New     is big"           # [ocr_txt]
    "B-PLACE V  O"             # [ocr_labels]

genalog.text.ner_label.format_labels(tokens, labels, label_top=True)[source]¶

Format tokens and their NER label for display

Parameters

tokens (list) – a list of word tokens
labels (list) – a list of NER labels
label_top (bool, optional) – True if label is place on top of the token. Defaults to True.

Returns

a str with NER label align to the token it is labeling

Given inputs:
    tokens: ["New", "York", "is", "big"]
    labels: ["B-place", "I-place", "o", "o"]
    label_top: True

Outputs:
    "B-place I-place o  o "
    "New     York    is big"

genalog.text.ner_label.propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, use_anchor=True)[source]¶

Propagate NER label for ground truth tokens to to ocr tokens.

NOTE that gt_tokens and ocr_tokens MUST NOT contain invalid tokens.

Invalid tokens are:
1. non-atomic tokens, or space-separated string (“New York”) 3. empty string (“”) 4. string with spaces (” “)

Parameters

gt_labels (list) – a list of NER label for ground truth token
gt_tokens (list) – a list of ground truth string tokens
ocr_tokens (list) – a list of OCR’ed text tokens
gap_char (char, optional) – gap char used in alignment algorithm. Defaults to alignment.GAP_CHAR.
use_anchor (bool, optional) – use faster alignment method with anchors if set to True. Defaults to True.

Raises

GapCharError – when the set of input character is EQUAL to set of all possible gap characters (GAP_CHAR_SET)

Returns

a tuple of 3 elements (ocr_labels, aligned_gt, aligned_ocr, gap_char) where 1. ocr_labels is a list of NER label for the corresponding ocr tokens 2. aligned_gt is the ground truth string aligned with the ocr text 3. aligned_ocr is the ocr text aligned with ground true 4. gap_char is the char used to alignment for inserting gaps

Return type

tuple

genalog.text.preprocess module¶

genalog.text.preprocess.is_sentence_separator(token)[source]¶: Returns true if the token is a sentence splitter

genalog.text.preprocess.join_tokens(tokens)[source]¶

Join a list of tokens into a string

Parameters: tokens (list) – a list of tokens
Returns: a string with space-separated tokens

genalog.text.preprocess.remove_non_ascii(token, replacement='_')[source]¶

Remove non ascii characters in a token

Parameters

token (str) – a word token
replacement (str, optional) – a replace character for non-ASCII characters. Defaults to NON_ASCII_REPLACEMENT.

Returns

str – a word token with non-ASCII characters removed

genalog.text.preprocess.split_sentences(text, delimiter='\n')[source]¶: Split a text into sentences with a delimiter

genalog.text.preprocess.tokenize(s)[source]¶

Tokenize string

Parameters: s (str) – aligned string
Returns: a list of tokens

genalog.text

Contents

genalog.text¶

genalog.text.alignment module¶

genalog.text.anchor module¶

genalog.text.conll_format module¶

genalog.text.lcs module¶

genalog.text.ner_label module¶

genalog.text.preprocess module¶

genalog.text.splitter module¶