Synthetic Document Generator

Build Status Azure DevOps tests (compact) Azure DevOps coverage (main) Python Versions arxiv link MIT license

genalog is an open source, cross-platform python package for generating document images with synthetic noise that mimics scanned analog documents (thus the name genalog). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you can create in simple HTML format.


Fig. 1 Generate documents and apply degradations

genalog provides several document templates as a start. You can alter the document layout using standard CSS properties like font-family, font-size, text-align, etc. Here are some of the example generated documents:


Fig. 2 Document template with 2 columns


Fig. 3 Letter-like document template


Fig. 4 Simple text block template

Once a document is generated, you can combine various image degradation effects and apply onto the synthetic documents. Here are some of the degradation effects:


Fig. 5 Mimics a document printed on two sides


Fig. 6 Lowers image quality


Fig. 7 Mimics ink degradation


Fig. 8 Degrades printing quality


Fig. 9 Ink overflows


Fig. 10 Combining various degradation effects: blur, salt, open, and bleed-through

In addition to the document generation and degradation, genalog also provide efficient implementation for text alignment between the source and noise text.