genalog.ocr

This module will be deprecated in favor of the official Azure Computer Vision SDK .

genalog.ocr.common module

genalog.ocr.grok module

genalog.ocr.metrics module

Utility functions to support getting OCR metrics

OCR Metrics 1. word/character accuracy like in this paper https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6065412. Accuracy = Correct Words/Total Words (in target strings)

2. Count of edit distance ops: insert, delete, substitutions; like in the paper “Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing”. This is based on Levenshtein edit distance.

3. By looking at the gaps in alignment we also generate substitution dicts: e.g: if we have text “a worn coat” and ocr is “a wom coat” , “rn” -> “m” will be captured as a substitution since the rest of the segments align.The assumption here is that we do not expect to have very long gaps in alignment, hence collecting and counting these substitutions will be managable.

genalog.ocr.metrics.get_align_stats(alignment, src_string, target, gap_char)[source]

Get alignment stats

Parameters
  • alignment (tuple(str,str)) – the result of calling the align function

  • src_string (str) – the original source string

  • target (str) – the original target string

  • gap_char (str) – the gap character used in alignment

Raises

ValueError – if any of the strings are empty

Returns

dict of the align starts and dict of the substitution mappings

Return type

tuple(dict, dict)

genalog.ocr.metrics.get_editops_stats(alignment, gap_char)[source]

Get stats for character level edit operations that need to be done to transform the source string to the target string. Inputs must not be empty and must be the result of calling the runing the align function.

Parameters
  • alignment (tuple(str, str)) – the results from the string alignment biopy function

  • gap_char (str) – gap character used in alignment

Raises

ValueError – If any of the string in the alignment are empty

Returns

[description]

Return type

[type]

genalog.ocr.metrics.get_metrics(src_text_path, ocr_json_path, folder_hash=None, use_multiprocessing=True)[source]

Given a path to the folder containing the source text and a folder containing the output OCR json, this generates the metrics for all files in the source folder. This assumes that the files json folder are of the same name the text files except they are prefixed by the parameter folder_hash followed by underscore and suffixed by .png.json.

Parameters
  • src_text_path (str) – path to source txt files

  • ocr_json_path (str) – path to OCR json files

  • folder_hash (str) – prefix for OCR json files

  • use_multiprocessing (bool) – use multiprocessing

Returns

A pandas dataframe of the metrics with each file in a row,

a dict containing the substitions mappings for each file. the key to the dict is the filename and the values are dicts of the substition mappings for that file.

Return type

tuple(pandas.DataFrame, dict)

genalog.ocr.metrics.get_stats(target, src_string)[source]

Get align stats, edit stats, and substitution mappings for transforming the source string to the target string. Edit stats refers to character level edit operation required to transform the source to target. Align stats referers to substring level operation required to transform the source to target. Align stats have keys insert,replace,delete and the special key spacing which counts spacing differences between the two strings. Edit stats have the keys edit_insert, edit_replace, edit_delete which count the character level edits.

Parameters
  • src_string (str) – the source string

  • target (str) – the target string

Returns

One dict containing the edit and align stats, another dict containing the substitutions

Return type

tuple(str, str)

genalog.ocr.metrics.substitution_dict_to_json(substitution_dict)[source]

Converts substitution dict to list of tuples of (source_substring, target_substring, count)

Parameters

substitution_dict ([type]) – [description]

genalog.ocr.rest_client module

Uses the REST api to perform operations on the search service. see: https://docs.microsoft.com/en-us/rest/api/searchservice/

class genalog.ocr.rest_client.GrokRestClient(cognitive_service_key, search_service_key, search_service_name, skillset_name, index_name, indexer_name, datasource_name, datasource_container_name, blob_account_name, blob_key, projections_container_name='ocrprojections')[source]

This is a REST client. It is a wrapper around the REST api for the Azure Search Service see: https://docs.microsoft.com/en-us/rest/api/searchservice/

This class can be used to create an indexing pipeline and can be used to run and monitor ongoing indexers. The indexing pipeline can allow you to run batch OCR enrichment of documents.

Creates the REST client

Parameters
  • cognitive_service_key (str) – key to cognitive services account

  • search_service_key (str) – key to the search service account

  • search_service_name (str) – name of the search service account

  • skillset_name (str) – name of the skillset

  • index_name (str) – name of the index

  • indexer_name (str) – the name of indexer

  • datasource_name (str) – the name to give the the attached blob storage source

  • datasource_container_name (str) – the container in the blob storage that host the files

  • blob_account_name (str) – blob storage account name that will host the documents to push though the pipeline

  • blob_key (str) – key to blob storage account

create_datasource()[source]

Attaches the blob data store to the search service as a source for image documents

create_index()[source]

Create an index with the layoutText column to store OCR output from the enrichment

create_indexer(extension_to_exclude='.txt, .json')[source]

Creates an indexer that runs the enrichment skillset of documents from the datatsource. The enriched results are pushed to the index.

create_indexing_pipeline()[source]

Using REST calls, creates an index, indexer and skillset on the Cognitive service. The templates for json are in the templates folder.

create_skillset()[source]

Adds a skillset that performs OCR on images

delete_indexer_pipeline()[source]

Deletes all indexers, indexes, skillsets and datasources that had been previously created

genalog.ocr.blob_client module

Uses the python sdk to make operation on Azure Blob storage. see: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python

class genalog.ocr.blob_client.GrokBlobClient(datasource_container_name, blob_account_name, blob_key, projections_container_name='ocrprojections')[source]

This class is a client that is used to upload and delete files from Azure Blob storage https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python

Creates the blob storage client given the key and storage account name

Parameters
  • datasource_container_name (str) – container name. This container does not need to be existing

  • projections_container_name (str) – projections container to store ocr projections. This container does not need to be existing

  • blob_account_name (str) – storage account name

  • blob_key (str) – storage account key

static create_from_env_var()[source]

Created the blob client using values in the environment variables

Returns

the new blob client

Return type

GrokBlobClient

delete_blobs_folder(folder_name)[source]

Deletes all blobs in a folder

Parameters

folder_name (str) – folder to delete

get_folder_hash(folder_name)[source]

Create an Md5 hash for all files in a folder. This hash can be used for blob folders.

Parameters

folder_name (str) – path to folder

Returns

md5 hash of all filenames in the folder

Return type

str

upload_images_to_blob(src_folder_path, dest_folder_name=None, check_existing_cache=True, use_async=True)[source]

Uploads images from the src_folder_path to blob storage at the destination folder. The destination folder is created if it doesn’t exist. If a destination folder is not given a folder is created named by the md5 hash of the files.

Parameters
  • src_folder_path (src) – path to local folder that has images

  • dest_folder_name (str, optional) – destination folder name. Defaults to None.

Returns

the destination folder name

Return type

str