genalog.text
Contents
genalog.text¶
genalog.text.alignment module¶
- genalog.text.alignment.align(gt, noise, gap_char='@')[source]¶
Align two text segments via sequence alignment algorithm
NOTE: this algorithm is O(N^2) and is NOT efficient for longer text. Please refer to genalog.text.anchor for faster alignment on longer strings.
- Parameters
gt (str) – ground true text (should not contain GAP_CHAR)
noise (str) – str with ocr noise (should not contain GAP_CHAR)
gap_char (char, optional) – gap char used in alignment algorithm (default: GAP_CHAR)
- Returns
a tuple of aligned ground truth and noise
- Return type
tuple(str, str)
- Invariants:
The returned aligned strings will satisfy the following invariants:
len(aligned_gt) == len(aligned_noise)
number of tokens in gt == number of tokens in aligned_gt
Example:
gt: "New York is big" (num_tokens = 4) aligned_gt: "N@ew @@York @is big@@" (num_tokens = 4)
- genalog.text.alignment.parse_alignment(aligned_gt, aligned_noise, gap_char='@')[source]¶
Parse alignment to pair ground truth tokens with noise tokens
Case 1: Case 2: Case 3: Case 4: Case 5: one-to-many many-to-one many-to-many missing tokens one-to-one gt "New York" "New York" "New York" "New York" "New York" | | | | | | | | | | aligned_gt "New Yo@rk" "New York" "N@ew York" "New York" "New York" | /\ \/ /\/ | | | | aligned_noise "New Yo rk" "New@York" "N ew@York" "New @@@@" "New York" | | | | | | | | | noise "New Yo rk" "NewYork" "N ewYork" "New" "New York"
- Parameters
aligned_gt (str) – ground truth string aligned with the nose string
aligned_noise (str) – noise string aligned with the ground truth
gap_char (char, optional) – gap char used in alignment algorithm. Defaults to GAP_CHAR.
- Returns
(gt_to_noise_mapping, noise_to_gt_mapping)
of two 2D int arrays:- Return type
tuple
where each array defines the mapping between aligned gt tokens to noise tokens and vice versa.
Example
Given input
aligned_gt: "N@ew York @is big" /\\ | | | aligned_noise: "N ew@York kis big."
The returned output will be:
([[0,1],[1],[2],[3]], [[0],[0,1],[2],[3]])
genalog.text.anchor module¶
Baseline alignment algorithm is slow on long documents. The idea is to break down the longer text into smaller fragments for quicker alignment on individual pieces. We refer “anchor words” as these points of breakage.
The bulk of this algorithm is to identify these “anchor words”.
This is an re-implementation of the algorithm in this paper “A Fast Alignment Scheme for Automatic OCR Evaluation of Books” (https://ieeexplore.ieee.org/document/6065412)
We rely on genalog.text.alignment to align the subsequences.
- genalog.text.anchor.align_w_anchor(gt, ocr, gap_char='@', max_seg_length=100)[source]¶
A faster alignment scheme of two text segments. This method first breaks the strings into smaller segments with anchor words. Then these smaller segments are aligned.
NOTE: this function shares the same contract as genalog.text.alignment.align() These two methods are interchangeable and their alignment results should be similar.
For example: Ground Truth: "The planet Mars, I scarcely need remind the reader," Noisy Text: "The plamet Maris, I scacely neee remind te reader," Here the unique anchor words are "I", "remind" and "reader". Thus, the algorithm will split into following segment pairs: "The planet Mar, " "The plamet Maris, " "I scarcely need " "I scacely neee " "remind the reader," "remind te reader," And run sequence alignment on each pair.
- Parameters
gt (str) – ground truth text
noise (str) – text with ocr noise
gap_char (str, optional) – gap char used in alignment algorithm . Defaults to GAP_CHAR.
max_seg_length (int, optional) – maximum segment length. Segments longer than this threshold will continued be split recursively into smaller segment. Defaults to
MAX_ALIGN_SEGMENT_LENGTH
.
- Returns
(aligned_gt, aligned_noise)
- Return type
a tuple (str, str) of aligned ground truth and noise
- genalog.text.anchor.find_anchor_recur(gt_tokens, ocr_tokens, start_pos_gt=0, start_pos_ocr=0, max_seg_length=100)[source]¶
Recursively find anchor positions in the gt and ocr text
- Parameters
gt_tokens (list) – a list of ground truth tokens
ocr_tokens (list) – a list of tokens from OCR’ed document
start_pos (int, optional) – a constant to add to all the resulting indices. Defaults to 0.
max_seg_length (int, optional) – trigger recursion if any text segment is larger than this. Defaults to
MAX_ALIGN_SEGMENT_LENGTH
.
- Raises
ValueError – when there different number of anchor points in gt and ocr.
- Returns
two lists of token indices where each list is the position of the anchor in the input
gt_tokens
andocr_tokens
- Return type
tuple
- genalog.text.anchor.get_anchor_map(gt_tokens, ocr_tokens, min_anchor_len=2)[source]¶
Find the location of anchor words in both the gt and ocr text. Anchor words are location where we can split both the source gt and ocr text into smaller text fragment for faster alignment.
- Parameters
gt_tokens (list) – a list of ground truth tokens
ocr_tokens (list) – a list of tokens from OCR’ed document
min_anchor_len (int, optional) – minimum len of the anchor word. Defaults to 2.
- Returns
a 2-element
(anchor_map_gt, anchor_map_ocr)
tuple:- Return type
tuple
anchor_map_gt
is aword_map
that locates all the anchor words in the gt tokensanchor_map_gt
is aword_map
that locates all the anchor words in the ocr tokens
And
len(anchor_map_gt) == len(anchor_map_ocr)
For example: Input: gt_tokens: ["b", "a", "c"] ocr_tokens: ["c", "b", "a"] Ourput: ([("b", 0), ("a", 1)], [("b", 1), ("a", 2)])
- genalog.text.anchor.get_unique_words(tokens, case_sensitive=False)[source]¶
Get a set of unique words from a Counter dictionary of word occurrences
- Parameters
d (dict) – a Counter dictionary of word occurrences
case_sensitive (bool, optional) – whether unique words are case sensitive. Defaults to False.
- Returns
a set of unique words (original alphabetical case of the word is preserved)
- Return type
set
- genalog.text.anchor.get_word_map(unique_words, src_tokens)[source]¶
Arrange the set of unique words by the order they original appear in the text
- Parameters
unique_words (set) – a set of unique words
src_tokens (list) – a list of tokens
- Returns
a
word_map
: a list of word corrdinate tuples(word, word_index)
defined as follow:word
is a typical word tokenword_index
is the index of the word in the source token array
- Return type
list
genalog.text.conll_format module¶
This is a utility tool to create CoNLL-formatted token+label files for OCR’ed text by extracting text from grok OCR output JSON files and propagating labels from clean text to OCR text.
Usage:
conll_format.py [-h] [--train_subset] [--test_subset]
[--gt_folder GT_FOLDER]
base_folder degraded_folder
Positional Argument:
base_folder base directory containing the collection of dataset
degraded_folder directory name containing train and test subset for degradation
Optional Arguments:
--train_subset include if only train directory should be processed
--test_subset include if only test directory should be processed
--gt_folder GT_FOLDER directory name containing ground truth (default to `shared`)
Seek Help:
-h, --help show this help message and exit
Example Usage:
# to run for specified degradation of the dataset on both train and test
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all'
# to run for specified degradation of the dataset and ground truth
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all' --gt_folder='shared'
# to run for specified degradation of the dataset on only test subset
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all' --test_subset
# to run for specified degradation of the dataset on only train subset
python -m genalog.text.conll_format '/data/enki/datasets/synthetic_dataset/' 'hyphens_all' --train_subset
- genalog.text.conll_format.check_n_sentences(clean_labels_dir, output_labels_dir, clean_label_ext)[source]¶
check_n_sentences prints file name if number of sentences is different in clean and OCR files
- Parameters
clean_labels_dir (str) – path of directory with clean labels - CoNLL formatted so contains tokens and corresponding labels
output_labels_dir (str) – path of directory with ocr labels - CoNLL formatted so contains tokens and corresponding labels
- genalog.text.conll_format.extract_ocr_text(input_file, output_file)[source]¶
extract_ocr_text from GROK json
- Parameters
input_file (str) – file path of input file
output_file (str) – file path of output file
- genalog.text.conll_format.for_all_files(input_dir, output_dir, func)[source]¶
for_all_files will apply function to every file in a director
- Parameters
input_dir (str) – directory with input files
output_dir (str) – directory for output files
func (function) – function to be applied to all files in input_dir
- genalog.text.conll_format.propagate_labels_sentences(clean_tokens, clean_labels, clean_sentences, ocr_tokens)[source]¶
propagate_labels_sentences propagates clean labels for clean tokens to ocr tokens and splits ocr tokens into sentences
- Parameters
clean_tokens (list) – list of tokens in clean text
clean_labels (list) – list of labels corresponding to clean tokens
clean_sentences (list) – list of sentences (each sentence is a list of tokens)
ocr_tokens (list) – list of tokens in ocr text
- Returns
list of ocr sentences (each sentence is a list of tokens) list of labels for ocr sentences
- Return type
list, list
- genalog.text.conll_format.propagate_labels_sentences_multiprocess(clean_labels_dir, output_text_dir, output_labels_dir, clean_label_ext)[source]¶
propagate_labels_sentences_all_files propagates labels and sentences for all files in dataset
- Parameters
clean_labels_dir (str) – path of directory with clean labels - CoNLL formatted so contains tokens and corresponding labels
output_text_dir (dir) – path of directory with ocr text
output_labels_dir (dir) – path of directory with ocr labels - CoNLL formatted so contains tokens and corresponding labels
clean_label_ext (str) – file extension of the clean_labels
genalog.text.lcs module¶
genalog.text.ner_label module¶
- genalog.text.ner_label._propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, gap_char='@', use_anchor=True)[source]¶
Propagate NER label for ground truth tokens to to ocr tokens. Low level implementation
NOTE: that gt_tokens and ocr_tokens MUST NOT contain invalid tokens. Invalid tokens are: 1. non-atomic tokens, or space-separated string (“New York”) 2. multiple occurrences of the GAP_CHAR (‘@@@’) 3. empty string (“”) 4. string with spaces (” “)
Case Analysis: ******************************** MULTI-TOKEN-LABELS ******************************** Case 1: Case 2: Case 3: Case 4: Case 5: one-to-many many-to-one many-to-many missing tokens missing tokens (Case 1&2 comb) (I-label) (B-label) gt label B-p I-p B-p I-p B-p I-p B-p I-p B-p I-p I-p | | | | | | | | | | | gt_token New York New York New York New York New York City / \ / \ \ / /\ / | | | ocr_token N ew Yo rk NewYork N ew@York New York City | | | | | | | | | | ocr label B-p I-p I-p I-p B-p B-p I-p B-p B-p I-p ******************************** SINGLE-TOKEN-LABELS ******************************** Case 1: Case 2: Case 3: Case 4: one-to-many many-to-one many-to-many missing tokens (Case 1&2 comb) gt label O V O O V W O O | | | | | | | | gt_token something is big this is huge is big / \ \ \ / /\ /\ / | ocr_token so me thing isbig th isi shuge is | | | | | | | | ocr label o o o V O O V O
- Parameters
gt_labels (list) – a list of NER label for ground truth token
gt_tokens (list) – a list of ground truth string tokens
ocr_tokens (list) – a list of OCR’ed text tokens
gap_char (char, optional) – gap char used in alignment algorithm . Defaults to
alignment.GAP_CHAR
.use_anchor (bool, optional) – use faster alignment method with anchors if set to True. Defaults to True.
- Raises
ValueError – when
1. there is unequal number of gt_tokens and gt_labels –
2. there is a non-atomic token in gt_tokens or ocr_tokens –
3. there is an empty string in gt_tokens or ocr_tokens –
4. there is a token full of space characters only in gt_tokens or ocr_tokens –
5. gt_to_ocr_mapping has more tokens than gt_tokens –
GapCharError – when
1. there is a token consisted of GAP_CHAR only –
- Returns
(ocr_labels, aligned_gt, aligned_ocr, gap_char) where ocr_labels is a list of NER label for the corresponding ocr tokens aligned_gt is the ground truth string aligned with the ocr text aligned_ocr is the ocr text aligned with ground true gap_char is the char used to alignment for inserting gaps
- Return type
a tuple of 4 elements
For example, given input:
>>> _propagate_label_to_ocr( ["B-place", "I-place", "o", "o"], ["New", "York", "is", "big"], ["N", "ewYork", "big"] ) (["B-place", "I-place", "o"], "N@ew York is big", "N ew@York@@@ big", '@')
- genalog.text.ner_label.correct_ner_labels(labels)[source]¶
Correct the given list of labels for the following case:
Missing B-Label (i.e. I-PLACE I-PLACE -> B-PLACE I-PLACE)
- Parameters
labels (list) – list of NER labels
- Returns
a list of NER labels
- genalog.text.ner_label.format_label_propagation(gt_tokens, gt_labels, ocr_tokens, ocr_labels, aligned_gt, aligned_ocr, show_alignment=True)[source]¶
Format label propagation for display
- Parameters
gt_tokens (list) – list of ground truth tokens
gt_labels (list) – list of NER labels for ground truth tokens
ocr_tokens (list) – list of OCR’ed text tokens
ocr_labels (list) – list of NER labels for the OCR’ed tokens
aligned_gt (str) – ground truth string aligned with the OCR’ed text
aligned_ocr (str) – OCR’ed text aligned with ground truth
show_alignment (bool, optional) – if true, show alignment result . Defaults to True.
- Returns
a string formatted for display as follows:
- Return type
str
if show_alignment: "B-PLACE I-PLACE V O" # [gt_labels] "New York is big" # [gt_txt] "New York is big" # [aligned_gt] "||||....|||||||" "New @@@@ is big" # [aligned_ocr] "New is big " # [ocr_txt] "B-PLACE V O " # [ocr_labels] else: "B-PLACE I-PLACE V O" # [gt_labels] "New York is big" # [gt_txt] "New is big" # [ocr_txt] "B-PLACE V O" # [ocr_labels]
- genalog.text.ner_label.format_labels(tokens, labels, label_top=True)[source]¶
Format tokens and their NER label for display
- Parameters
tokens (list) – a list of word tokens
labels (list) – a list of NER labels
label_top (bool, optional) – True if label is place on top of the token. Defaults to True.
- Returns
a str with NER label align to the token it is labeling
Given inputs: tokens: ["New", "York", "is", "big"] labels: ["B-place", "I-place", "o", "o"] label_top: True Outputs: "B-place I-place o o " "New York is big"
- genalog.text.ner_label.propagate_label_to_ocr(gt_labels, gt_tokens, ocr_tokens, use_anchor=True)[source]¶
Propagate NER label for ground truth tokens to to ocr tokens.
- NOTE that gt_tokens and ocr_tokens MUST NOT contain invalid tokens.
- Invalid tokens are:
1. non-atomic tokens, or space-separated string (“New York”) 3. empty string (“”) 4. string with spaces (” “)
- Parameters
gt_labels (list) – a list of NER label for ground truth token
gt_tokens (list) – a list of ground truth string tokens
ocr_tokens (list) – a list of OCR’ed text tokens
gap_char (char, optional) – gap char used in alignment algorithm. Defaults to
alignment.GAP_CHAR
.use_anchor (bool, optional) – use faster alignment method with anchors if set to True. Defaults to True.
- Raises
GapCharError – when the set of input character is EQUAL to set of all possible gap characters (GAP_CHAR_SET)
- Returns
a tuple of 3 elements
(ocr_labels, aligned_gt, aligned_ocr, gap_char)
where 1.ocr_labels
is a list of NER label for the corresponding ocr tokens 2.aligned_gt
is the ground truth string aligned with the ocr text 3.aligned_ocr
is the ocr text aligned with ground true 4.gap_char
is the char used to alignment for inserting gaps- Return type
tuple
genalog.text.preprocess module¶
- genalog.text.preprocess.is_sentence_separator(token)[source]¶
Returns true if the token is a sentence splitter
- genalog.text.preprocess.join_tokens(tokens)[source]¶
Join a list of tokens into a string
- Parameters
tokens (list) – a list of tokens
- Returns
a string with space-separated tokens
- genalog.text.preprocess.remove_non_ascii(token, replacement='_')[source]¶
Remove non ascii characters in a token
- Parameters
token (str) – a word token
replacement (str, optional) – a replace character for non-ASCII characters. Defaults to
NON_ASCII_REPLACEMENT
.
- Returns
str – a word token with non-ASCII characters removed