Synthetic Document Generator
Synthetic Document Generator¶
genalog
is an open source, cross-platform python package for generating document images with synthetic noise that mimics scanned analog documents (thus the name genalog
). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you can create in simple HTML format.

Fig. 1 Generate documents and apply degradations¶
genalog
provides several document templates as a start. You can alter the document layout using standard CSS properties like font-family
, font-size
, text-align
, etc. Here are some of the example generated documents:

Fig. 2 Document template with 2 columns¶

Fig. 3 Letter-like document template¶

Fig. 4 Simple text block template¶
Once a document is generated, you can combine various image degradation effects and apply onto the synthetic documents. Here are some of the degradation effects:

Fig. 5 Mimics a document printed on two sides¶

Fig. 6 Lowers image quality¶

Fig. 7 Mimics ink degradation¶

Fig. 8 Degrades printing quality¶

Fig. 9 Ink overflows¶

Fig. 10 Combining various degradation effects: blur, salt, open, and bleed-through¶
In addition to the document generation and degradation, genalog
also provide efficient implementation for text alignment between the source and noise text.