Synthetic Document Generator

Synthetic Document Generator

Python Versions arxiv link MIT license

genalog is an open source, cross-platform python package for generating document images with synthetic noise that mimics scanned analog documents (thus the name genalog). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you can create in simple HTML format.

_images/genalog_demo.gif

Fig. 1 Generate documents and apply degradations

genalog provides several document templates as a start. You can alter the document layout using standard CSS properties like font-family, font-size, text-align, etc. Here are some of the example generated documents:

_images/columns_Times_11px.png

Fig. 2 Document template with 2 columns

_images/letter_Times_11px.png

Fig. 3 Letter-like document template

_images/text_block_Times_11px.png

Fig. 4 Simple text block template

Once a document is generated, you can combine various image degradation effects and apply onto the synthetic documents. Here are some of the degradation effects:

_images/bleed_through.png

Fig. 5 Mimics a document printed on two sides

_images/blur.png

Fig. 6 Lowers image quality

_images/salt_pepper.png

Fig. 7 Mimics ink degradation

_images/close_dilate.png

Fig. 8 Degrades printing quality

_images/open_erode.png

Fig. 9 Ink overflows

_images/degrader.png

Fig. 10 Combining various degradation effects: blur, salt, open, and bleed-through

In addition to the document generation and degradation, genalog also provide efficient implementation for text alignment between the source and noise text.