The def
function will automatically parse PDF files and extract text from them. This is useful for generating prompts from PDF files.
def("DOCS", env.files) // contains some pdfsdef("PDFS", env.files, { endsWith: ".pdf" }) // only pdfs
Parsers
The parsers.PDF
function reads a PDF file and attempts to cleanly convert it into a text format
that is friendly to the LLM.
const { file, pages } = await parsers.PDF(env.files[0])
Once parsed, you can use the file
and pages
to generate prompts. If the parsing fails, file
will be undefined
.
const { file, pages } = await parsers.PDF(env.files[0])
// inline the entire filedef("FILE", file)
// or analyze page per page, filter pagespages.slice(0, 2).forEach((page, i) => { def(`PAGE_${i}`, page)})
Images and figures
GenAIScript automatically extracts bitmap images from PDFs and stores them in the data array. You can use these images to generate prompts. The image are encoded as PNG and may be large.
const { data } = await parsers.PDF(env.files[0])
Rendering pages to images
Add the renderAsImage
option to also reach each page to a PNG image (as a buffer). This buffer can be used with a vision model to perform
an OCR operation.
const { images } = await parsers.PDF(env.files[0], { renderAsImage: true })
You can control the quality of the rendered image using the scale
parameter (default is 3).
PDFs are messy
The PDF format was never really meant to allow for clean text extraction. The parsers.PDF
function uses the pdf-parse
package to extract text from PDFs. This package is not perfect and may fail to extract text from some PDFs. If you have access to the original document, it is recommended to use a more text-friendly format such as markdown or plain text.