genalog.ocr
Contents
genalog.ocr¶
This module will be deprecated in favor of the official Azure Computer Vision SDK .
genalog.ocr.common module¶
genalog.ocr.grok module¶
genalog.ocr.metrics module¶
Utility functions to support getting OCR metrics
OCR Metrics 1. word/character accuracy like in this paper https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6065412. Accuracy = Correct Words/Total Words (in target strings)
2. Count of edit distance ops: insert, delete, substitutions; like in the paper “Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing”. This is based on Levenshtein edit distance.
3. By looking at the gaps in alignment we also generate substitution dicts: e.g: if we have text “a worn coat” and ocr is “a wom coat” , “rn” -> “m” will be captured as a substitution since the rest of the segments align.The assumption here is that we do not expect to have very long gaps in alignment, hence collecting and counting these substitutions will be managable.
- genalog.ocr.metrics.get_align_stats(alignment, src_string, target, gap_char)[source]¶
Get alignment stats
- Parameters
alignment (tuple(str,str)) – the result of calling the align function
src_string (str) – the original source string
target (str) – the original target string
gap_char (str) – the gap character used in alignment
- Raises
ValueError – if any of the strings are empty
- Returns
dict of the align starts and dict of the substitution mappings
- Return type
tuple(dict, dict)
- genalog.ocr.metrics.get_editops_stats(alignment, gap_char)[source]¶
Get stats for character level edit operations that need to be done to transform the source string to the target string. Inputs must not be empty and must be the result of calling the runing the align function.
- Parameters
alignment (tuple(str, str)) – the results from the string alignment biopy function
gap_char (str) – gap character used in alignment
- Raises
ValueError – If any of the string in the alignment are empty
- Returns
[description]
- Return type
[type]
- genalog.ocr.metrics.get_metrics(src_text_path, ocr_json_path, folder_hash=None, use_multiprocessing=True)[source]¶
Given a path to the folder containing the source text and a folder containing the output OCR json, this generates the metrics for all files in the source folder. This assumes that the files json folder are of the same name the text files except they are prefixed by the parameter folder_hash followed by underscore and suffixed by .png.json.
- Parameters
src_text_path (str) – path to source txt files
ocr_json_path (str) – path to OCR json files
folder_hash (str) – prefix for OCR json files
use_multiprocessing (bool) – use multiprocessing
- Returns
- A pandas dataframe of the metrics with each file in a row,
a dict containing the substitions mappings for each file. the key to the dict is the filename and the values are dicts of the substition mappings for that file.
- Return type
tuple(pandas.DataFrame, dict)
- genalog.ocr.metrics.get_stats(target, src_string)[source]¶
Get align stats, edit stats, and substitution mappings for transforming the source string to the target string. Edit stats refers to character level edit operation required to transform the source to target. Align stats referers to substring level operation required to transform the source to target. Align stats have keys insert,replace,delete and the special key spacing which counts spacing differences between the two strings. Edit stats have the keys edit_insert, edit_replace, edit_delete which count the character level edits.
- Parameters
src_string (str) – the source string
target (str) – the target string
- Returns
One dict containing the edit and align stats, another dict containing the substitutions
- Return type
tuple(str, str)
genalog.ocr.rest_client module¶
Uses the REST api to perform operations on the search service. see: https://docs.microsoft.com/en-us/rest/api/searchservice/
- class genalog.ocr.rest_client.GrokRestClient(cognitive_service_key, search_service_key, search_service_name, skillset_name, index_name, indexer_name, datasource_name, datasource_container_name, blob_account_name, blob_key, projections_container_name='ocrprojections')[source]¶
This is a REST client. It is a wrapper around the REST api for the Azure Search Service see: https://docs.microsoft.com/en-us/rest/api/searchservice/
This class can be used to create an indexing pipeline and can be used to run and monitor ongoing indexers. The indexing pipeline can allow you to run batch OCR enrichment of documents.
Creates the REST client
- Parameters
cognitive_service_key (str) – key to cognitive services account
search_service_key (str) – key to the search service account
search_service_name (str) – name of the search service account
skillset_name (str) – name of the skillset
index_name (str) – name of the index
indexer_name (str) – the name of indexer
datasource_name (str) – the name to give the the attached blob storage source
datasource_container_name (str) – the container in the blob storage that host the files
blob_account_name (str) – blob storage account name that will host the documents to push though the pipeline
blob_key (str) – key to blob storage account
- create_datasource()[source]¶
Attaches the blob data store to the search service as a source for image documents
- create_index()[source]¶
Create an index with the layoutText column to store OCR output from the enrichment
- create_indexer(extension_to_exclude='.txt, .json')[source]¶
Creates an indexer that runs the enrichment skillset of documents from the datatsource. The enriched results are pushed to the index.
genalog.ocr.blob_client module¶
Uses the python sdk to make operation on Azure Blob storage. see: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python
- class genalog.ocr.blob_client.GrokBlobClient(datasource_container_name, blob_account_name, blob_key, projections_container_name='ocrprojections')[source]¶
This class is a client that is used to upload and delete files from Azure Blob storage https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python
Creates the blob storage client given the key and storage account name
- Parameters
datasource_container_name (str) – container name. This container does not need to be existing
projections_container_name (str) – projections container to store ocr projections. This container does not need to be existing
blob_account_name (str) – storage account name
blob_key (str) – storage account key
- static create_from_env_var()[source]¶
Created the blob client using values in the environment variables
- Returns
the new blob client
- Return type
- delete_blobs_folder(folder_name)[source]¶
Deletes all blobs in a folder
- Parameters
folder_name (str) – folder to delete
- get_folder_hash(folder_name)[source]¶
Create an Md5 hash for all files in a folder. This hash can be used for blob folders.
- Parameters
folder_name (str) – path to folder
- Returns
md5 hash of all filenames in the folder
- Return type
str
- upload_images_to_blob(src_folder_path, dest_folder_name=None, check_existing_cache=True, use_async=True)[source]¶
Uploads images from the src_folder_path to blob storage at the destination folder. The destination folder is created if it doesn’t exist. If a destination folder is not given a folder is created named by the md5 hash of the files.
- Parameters
src_folder_path (src) – path to local folder that has images
dest_folder_name (str, optional) – destination folder name. Defaults to None.
- Returns
the destination folder name
- Return type
str