Synthetic Document Generator¶
genalog is an open source, cross-platform python package for generating document images with synthetic noise that mimics scanned analog documents (thus the name
genalog). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you can create in simple HTML format.
genalog provides several document templates as a start. You can alter the document layout using standard CSS properties like
text-align, etc. Here are some of the example generated documents:
Once a document is generated, you can combine various image degradation effects and apply onto the synthetic documents. Here are some of the degradation effects:
In addition to the document generation and degradation,
genalog also provide efficient implementation for text alignment between the source and noise text.