Synthetic Document Generator
Synthetic Document Generator¶
genalog
is an open source, cross-platform python package for generating document images with synthetic noise that mimics scanned analog documents (thus the name genalog
). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you can create in simple HTML format.
genalog
provides several document templates as a start. You can alter the document layout using standard CSS properties like font-family
, font-size
, text-align
, etc. Here are some of the example generated documents:
Once a document is generated, you can combine various image degradation effects and apply onto the synthetic documents. Here are some of the degradation effects:
In addition to the document generation and degradation, genalog
also provide efficient implementation for text alignment between the source and noise text.