genalog.text

genalog.text.alignment module

genalog.text.alignment.align(gt, noise, gap_char='@')[source]

Align two text segments via sequence alignment algorithm

NOTE: this algorithm is O(N^2) and is NOT efficient for longer text. Please refer to genalog.text.anchor for faster alignment on longer strings.

Parameters
  • gt (str) – ground true text (should not contain GAP_CHAR)

  • noise (str) – str with ocr noise (should not contain GAP_CHAR)

  • gap_char (char, optional) – gap char used in alignment algorithm (default: GAP_CHAR)

Returns

a tuple of aligned ground truth and noise

Return type

tuple(str, str)

Invariants:

The returned aligned strings will satisfy the following invariants:

  1. len(aligned_gt) == len(aligned_noise)

  2. number of tokens in gt == number of tokens in aligned_gt

Example:

        gt: "New York is big" (num_tokens = 4)
aligned_gt: "N@ew @@York @is big@@" (num_tokens = 4)
genalog.text.alignment.parse_alignment(aligned_gt, aligned_noise, gap_char='@')[source]

Parse alignment to pair ground truth tokens with noise tokens

               Case 1:         Case 2:         Case 3:         Case 4:         Case 5:
               one-to-many     many-to-one     many-to-many    missing tokens  one-to-one
         gt    "New York"      "New York"      "New York"      "New York"      "New York"
                 |   |           |   |           |   |           |   |           |   |
  aligned_gt   "New Yo@rk"     "New York"      "N@ew York"     "New York"      "New York"
                 |   /\           \/             /\/             |   |           |   |
aligned_noise  "New Yo rk"     "New@York"      "N ew@York"     "New @@@@"      "New York"
                 |   | |           |            |    |           |               |   |
       noise   "New Yo rk"     "NewYork"       "N ewYork"      "New"           "New York"
Parameters
  • aligned_gt (str) – ground truth string aligned with the nose string

  • aligned_noise (str) – noise string aligned with the ground truth

  • gap_char (char, optional) – gap char used in alignment algorithm. Defaults to GAP_CHAR.

Returns

(gt_to_noise_mapping, noise_to_gt_mapping) of two 2D int arrays:

Return type

tuple

where each array defines the mapping between aligned gt tokens to noise tokens and vice versa.

Example

Given input

    aligned_gt: "N@ew York @is big"
                /\\   |    |   |
aligned_noise: "N ew@York kis big."

The returned output will be:

([[0,1],[1],[2],[3]], [[0],[0,1],[2],[3]])

genalog.text.anchor module

Baseline alignment algorithm is slow on long documents. The idea is to break down the longer text into smaller fragments for quicker alignment on individual pieces. We refer “anchor words” as these points of breakage.

The bulk of this algorithm is to identify these “anchor words”.

This is an re-implementation of the algorithm in this paper “A Fast Alignment Scheme for Automatic OCR Evaluation of Books” (https://ieeexplore.ieee.org/document/6065412)

We rely on genalog.text.alignment to align the subsequences.

genalog.text.anchor.align_w_anchor(gt, ocr, gap_char='@', max_seg_length=100)[source]

A faster alignment scheme of two text segments. This method first breaks the strings into smaller segments with anchor words. Then these smaller segments are aligned.

NOTE: this function shares the same contract as genalog.text.alignment.align() These two methods are interchangeable and their alignment results should be similar.

For example:

    Ground Truth: "The planet Mars, I scarcely need remind the reader,"
    Noisy Text:   "The plamet Maris, I scacely neee remind te reader,"

    Here the unique anchor words are "I", "remind" and "reader".

    Thus, the algorithm will split into following segment pairs:

        "The planet Mar, "
        "The plamet Maris, "

        "I scarcely need "
        "I scacely neee "

        "remind the reader,"
        "remind te reader,"

    And run sequence alignment on each pair.
Parameters
  • gt (str) – ground truth text

  • noise (str) – text with ocr noise

  • gap_char (str, optional) – gap char used in alignment algorithm . Defaults to GAP_CHAR.

  • max_seg_length (int, optional) – maximum segment length. Segments longer than this threshold will continued be split recursively into smaller segment. Defaults to MAX_ALIGN_SEGMENT_LENGTH.

Returns

(aligned_gt, aligned_noise)

Return type

a tuple (str, str) of aligned ground truth and noise

genalog.text.anchor.find_anchor_recur(gt_tokens, ocr_tokens, start_pos_gt=0, start_pos_ocr=0, max_seg_length=100)[source]

Recursively find anchor positions in the gt and ocr text

Parameters
  • gt_tokens (list) – a list of ground truth tokens

  • ocr_tokens (list) – a list of tokens from OCR’ed document

  • start_pos (int, optional) – a constant to add to all the resulting indices. Defaults to 0.

  • max_seg_length (int, optional) – trigger recursion if any text segment is larger than this. Defaults to MAX_ALIGN_SEGMENT_LENGTH.

Raises

ValueError – when there different number of anchor points in gt and ocr.

Returns

two lists of token indices where each list is the position of the anchor in the input gt_tokens and ocr_tokens

Return type

tuple

genalog.text.anchor.get_anchor_map(gt_tokens, ocr_tokens, min_anchor_len=2)[source]

Find the location of anchor words in both the gt and ocr text. Anchor words are location where we can split both the source gt and ocr text into smaller text fragment for faster alignment.

Parameters
  • gt_tokens (list) – a list of ground truth tokens

  • ocr_tokens (list) – a list of tokens from OCR’ed document

  • min_anchor_len (int, optional) – minimum len of the anchor word. Defaults to 2.

Returns

a 2-element (anchor_map_gt, anchor_map_ocr) tuple:

Return type

tuple

  1. anchor_map_gt is a word_map that locates all the anchor words in the gt tokens

  2. anchor_map_gt is a word_map that locates all the anchor words in the ocr tokens

And len(anchor_map_gt) == len(anchor_map_ocr)

For example:
    Input:
        gt_tokens:  ["b", "a", "c"]
        ocr_tokens: ["c", "b", "a"]
    Ourput:
        ([("b", 0), ("a", 1)], [("b", 1), ("a", 2)])
genalog.text.anchor.get_unique_words(tokens, case_sensitive=False)[source]

Get a set of unique words from a Counter dictionary of word occurrences

Parameters
  • d (dict) – a Counter dictionary of word occurrences

  • case_sensitive (bool, optional) – whether unique words are case sensitive. Defaults to False.

Returns

a set of unique words (original alphabetical case of the word is preserved)

Return type

set

genalog.text.anchor.get_word_map(unique_words, src_tokens)[source]

Arrange the set of unique words by the order they original appear in the text

Parameters
  • unique_words (set) – a set of unique words

  • src_tokens (list) – a list of tokens

Returns

a word_map: a list of word corrdinate tuples (word, word_index) defined as follow:

  1. word is a typical word token

  2. word_index is the index of the word in the source token array

Return type

list

genalog.text.anchor.segment_len(tokens)[source]

Get length of the segment

Parameters

segment (list) – a list of tokens

Returns

the length of the segment

Return type

int

genalog.text.conll_format module

This is a utility tool to create CoNLL-formatted token+label files for OCR’ed text by extracting text from grok OCR output JSON files and propagating labels from clean text to OCR text.

Usage:

conll_format.py [-h] [--train_subset] [--test_subset]
                [--gt_folder GT_FOLDER]
                base_folder degraded_folder

Positional Argument:
    base_folder            base directory containing the collection of dataset
    degraded_folder        directory name containing train and test subset for degradation

Optional Arguments:
    --train_subset            include if only train directory should be processed
    --test_subset             include if only test directory should be processed
    --gt_folder GT_FOLDER     directory name containing ground truth (default to `shared`)

Seek Help:
    -h, --help                show this help message and exit

Example Usage:

# to run for specified degradation of the dataset on both train and test
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all'

# to run for specified degradation of the dataset and ground truth
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all' --gt_folder='shared'

# to run for specified degradation of the dataset  on only test subset
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all' --test_subset

# to run for specified degradation of the dataset  on only train subset
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all' --train_subset
genalog.text.conll_format.check_n_sentences(clean_labels_dir, output_labels_dir, clean_label_ext)[source]

check_n_sentences prints file name if number of sentences is different in clean and OCR files

Parameters
  • clean_labels_dir (str) – path of directory with clean labels - CoNLL formatted so contains tokens and corresponding labels

  • output_labels_dir (str) – path of directory with ocr labels - CoNLL formatted so contains tokens and corresponding labels

genalog.text.conll_format.extract_ocr_text(input_file, output_file)[source]

extract_ocr_text from GROK json

Parameters
  • input_file (str) – file path of input file

  • output_file (str) – file path of output file

genalog.text.conll_format.for_all_files(input_dir, output_dir, func)[source]

for_all_files will apply function to every file in a director

Parameters
  • input_dir (str) – directory with input files

  • output_dir (str) – directory for output files

  • func (function) – function to be applied to all files in input_dir

genalog.text.conll_format.propagate_labels_sentences(clean_tokens, clean_labels, clean_sentences, ocr_tokens)[source]

propagate_labels_sentences propagates clean labels for clean tokens to ocr tokens and splits ocr tokens into sentences

Parameters
  • clean_tokens (list) – list of tokens in clean text

  • clean_labels (list) – list of labels corresponding to clean tokens

  • clean_sentences (list) – list of sentences (each sentence is a list of tokens)

  • ocr_tokens (list) – list of tokens in ocr text

Returns

list of ocr sentences (each sentence is a list of tokens) list of labels for ocr sentences

Return type

list, list

genalog.text.conll_format.propagate_labels_sentences_multiprocess(clean_labels_dir, output_text_dir, output_labels_dir, clean_label_ext)[source]

propagate_labels_sentences_all_files propagates labels and sentences for all files in dataset

Parameters
  • clean_labels_dir (str) – path of directory with clean labels - CoNLL formatted so contains tokens and corresponding labels

  • output_text_dir (dir) – path of directory with ocr text

  • output_labels_dir (dir) – path of directory with ocr labels - CoNLL formatted so contains tokens and corresponding labels

  • clean_label_ext (str) – file extension of the clean_labels

genalog.text.conll_format.remove_first_line(input_file, output_file)[source]

remove_first_line from files (some clean CoNLL files have an empty first line)

Parameters
  • input_file (str) – input file path

  • output_file (str) – output file path

genalog.text.conll_format.remove_last_line(input_file, output_file)[source]

remove_last_line from files (some clean CoNLL files have an empty last line)

Parameters
  • input_file (str) – input file path

  • output_file (str) – output file path

genalog.text.lcs module

class genalog.text.lcs.LCS(str_m, str_n)[source]

Compute the Longest Common Subsequence (LCS) of two given string.

genalog.text.ner_label module

exception genalog.text.ner_label.GapCharError[source]
genalog.text.ner_label._propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, gap_char='@', use_anchor=True)[source]

Propagate NER label for ground truth tokens to to ocr tokens. Low level implementation

NOTE: that gt_tokens and ocr_tokens MUST NOT contain invalid tokens. Invalid tokens are: 1. non-atomic tokens, or space-separated string (“New York”) 2. multiple occurrences of the GAP_CHAR (‘@@@’) 3. empty string (“”) 4. string with spaces (” “)

 Case Analysis:
 ******************************** MULTI-TOKEN-LABELS ********************************

             Case 1:         Case 2:         Case 3:         Case 4:         Case 5:
             one-to-many     many-to-one     many-to-many    missing tokens  missing tokens
                                            (Case 1&2 comb)  (I-label)       (B-label)
 gt label     B-p    I-p      B-p I-p        B-p   I-p       B-p  I-p        B-p  I-p  I-p
               |      |        |   |          |     |         |    |          |   |     |
 gt_token     New    York     New York       New  York       New York        New York City
              / \    / \        \ /           /\   /          |                   |     |
ocr_token    N   ew Yo  rk    NewYork        N ew@York       New                 York City
             |   |   |   |       |           |    |           |                   |     |
ocr label   B-p I-p I-p I-p     B-p          B-p I-p         B-p                 B-p   I-p

 ******************************** SINGLE-TOKEN-LABELS ********************************

             Case 1:         Case 2:         Case 3:         Case 4:
             one-to-many     many-to-one     many-to-many    missing tokens
                                            (Case 1&2 comb)
 gt label         O           V    O          O   V   W       O   O
                  |           |    |          |   |   |       |   |
 gt_token     something       is  big       this is huge      is big
              / \    \          \ /          /\  /\ /         |
ocr_token    so  me  thing     isbig       th isi shuge       is
             |   |     |         |          |  |    |         |
ocr label    o   o     o         V          O  O    V         O
Parameters
  • gt_labels (list) – a list of NER label for ground truth token

  • gt_tokens (list) – a list of ground truth string tokens

  • ocr_tokens (list) – a list of OCR’ed text tokens

  • gap_char (char, optional) – gap char used in alignment algorithm . Defaults to alignment.GAP_CHAR.

  • use_anchor (bool, optional) – use faster alignment method with anchors if set to True. Defaults to True.

Raises
  • ValueError – when

  • 1. there is unequal number of gt_tokens and gt_labels

  • 2. there is a non-atomic token in gt_tokens or ocr_tokens

  • 3. there is an empty string in gt_tokens or ocr_tokens

  • 4. there is a token full of space characters only in gt_tokens or ocr_tokens

  • 5. gt_to_ocr_mapping has more tokens than gt_tokens

  • GapCharError – when

  • 1. there is a token consisted of GAP_CHAR only

Returns

(ocr_labels, aligned_gt, aligned_ocr, gap_char) where ocr_labels is a list of NER label for the corresponding ocr tokens aligned_gt is the ground truth string aligned with the ocr text aligned_ocr is the ocr text aligned with ground true gap_char is the char used to alignment for inserting gaps

Return type

a tuple of 4 elements

For example, given input:

>>> _propagate_label_to_ocr(
    ["B-place", "I-place", "o", "o"],
    ["New", "York", "is", "big"],
    ["N", "ewYork", "big"]
)
(["B-place", "I-place", "o"], "N@ew York is big", "N ew@York@@@ big", '@')
genalog.text.ner_label.correct_ner_labels(labels)[source]

Correct the given list of labels for the following case:

  1. Missing B-Label (i.e. I-PLACE I-PLACE -> B-PLACE I-PLACE)

Parameters

labels (list) – list of NER labels

Returns

a list of NER labels

genalog.text.ner_label.format_label_propagation(gt_tokens, gt_labels, ocr_tokens, ocr_labels, aligned_gt, aligned_ocr, show_alignment=True)[source]

Format label propagation for display

Parameters
  • gt_tokens (list) – list of ground truth tokens

  • gt_labels (list) – list of NER labels for ground truth tokens

  • ocr_tokens (list) – list of OCR’ed text tokens

  • ocr_labels (list) – list of NER labels for the OCR’ed tokens

  • aligned_gt (str) – ground truth string aligned with the OCR’ed text

  • aligned_ocr (str) – OCR’ed text aligned with ground truth

  • show_alignment (bool, optional) – if true, show alignment result . Defaults to True.

Returns

a string formatted for display as follows:

Return type

str

if show_alignment:

    "B-PLACE I-PLACE V  O"      # [gt_labels]
    "New     York    is big"    # [gt_txt]
    "New York is big"           # [aligned_gt]
    "||||....|||||||"
    "New @@@@ is big"           # [aligned_ocr]
    "New     is big "           # [ocr_txt]
    "B-PLACE V  O   "           # [ocr_labels]

else:

    "B-PLACE I-PLACE V  O"     # [gt_labels]
    "New     York    is big"   # [gt_txt]
    "New     is big"           # [ocr_txt]
    "B-PLACE V  O"             # [ocr_labels]
genalog.text.ner_label.format_labels(tokens, labels, label_top=True)[source]

Format tokens and their NER label for display

Parameters
  • tokens (list) – a list of word tokens

  • labels (list) – a list of NER labels

  • label_top (bool, optional) – True if label is place on top of the token. Defaults to True.

Returns

a str with NER label align to the token it is labeling

Given inputs:
    tokens: ["New", "York", "is", "big"]
    labels: ["B-place", "I-place", "o", "o"]
    label_top: True

Outputs:
    "B-place I-place o  o "
    "New     York    is big"
genalog.text.ner_label.propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, use_anchor=True)[source]

Propagate NER label for ground truth tokens to to ocr tokens.

NOTE that gt_tokens and ocr_tokens MUST NOT contain invalid tokens.
Invalid tokens are:

1. non-atomic tokens, or space-separated string (“New York”) 3. empty string (“”) 4. string with spaces (” “)

Parameters
  • gt_labels (list) – a list of NER label for ground truth token

  • gt_tokens (list) – a list of ground truth string tokens

  • ocr_tokens (list) – a list of OCR’ed text tokens

  • gap_char (char, optional) – gap char used in alignment algorithm. Defaults to alignment.GAP_CHAR.

  • use_anchor (bool, optional) – use faster alignment method with anchors if set to True. Defaults to True.

Raises

GapCharError – when the set of input character is EQUAL to set of all possible gap characters (GAP_CHAR_SET)

Returns

a tuple of 3 elements (ocr_labels, aligned_gt, aligned_ocr, gap_char) where 1. ocr_labels is a list of NER label for the corresponding ocr tokens 2. aligned_gt is the ground truth string aligned with the ocr text 3. aligned_ocr is the ocr text aligned with ground true 4. gap_char is the char used to alignment for inserting gaps

Return type

tuple

genalog.text.preprocess module

genalog.text.preprocess.is_sentence_separator(token)[source]

Returns true if the token is a sentence splitter

genalog.text.preprocess.join_tokens(tokens)[source]

Join a list of tokens into a string

Parameters

tokens (list) – a list of tokens

Returns

a string with space-separated tokens

genalog.text.preprocess.remove_non_ascii(token, replacement='_')[source]

Remove non ascii characters in a token

Parameters
  • token (str) – a word token

  • replacement (str, optional) – a replace character for non-ASCII characters. Defaults to NON_ASCII_REPLACEMENT.

Returns

str – a word token with non-ASCII characters removed

genalog.text.preprocess.split_sentences(text, delimiter='\n')[source]

Split a text into sentences with a delimiter

genalog.text.preprocess.tokenize(s)[source]

Tokenize string

Parameters

s (str) – aligned string

Returns

a list of tokens

genalog.text.splitter module