
Parsers
The parsers
object provides various parsers for common data formats.
The parsers.json5
function parses the JSON5 format.
JSON5 is an extension to the popular JSON file format that aims to be easier to write and maintain by hand (e.g. for config files).
In general, parsing a JSON file as JSON5 does not cause harm, but it might be more forgiving to syntactic errors. In addition to JSON5, JSON repair is applied if the initial parse fails.
- JSON5 example
{ // comments unquoted: "and you can quote me on that", singleQuotes: 'I can use "double quotes" here', lineBreaks: "Look, Mom! \No \\n's!", hexadecimal: 0xdecaf, leadingDecimalPoint: 0.8675309, andTrailing: 8675309, positiveSign: +1, trailingComma: "in objects", andIn: ["arrays"], backwardsCompatible: "with JSON",}
To parse, use parsers.JSON5
. It supports both a text content or a file as input.
const res = parsers.JSON5("...")
The parsers.YAML
function parses the YAML format.
YAML is more friendly to the LLM tokenizer than JSON and is commonly used in configuration
files.
fields: number: 1 boolean: true string: fooarray: - 1 - 2
To parse, use parsers.YAML
. It supports both a text content or a file as input.
const res = parsers.YAML("...")
The parsers.TOML
function parses the TOML format.
TOML is more friendly to the LLM tokenizer than JSON and is commonly used in configuration
files.
# This is a TOML documenttitle = "TOML Example"[object]string = "foo"number = 1
To parse, use parsers.TOML
. It supports both a text content or a file as input.
const res = parsers.TOML("...")
JSONL is a format that stores JSON objects in a line-by-line format. Each line is a valid JSON(5) object (we use the JSON5 parser to be more error resilient).
{"name": "Alice"}{"name": "Bob"}
You can use parsers.JSONL
to parse the JSONL files into an array of object (any[]
).
const res = parsers.JSONL(file)
The parsers.XML
function parses for the XML format.
const res = parsers.XML('<xml attr="1"><child /></xml>')
Attribute names are prepended with ”@_“.
{ "xml": { "@_attr": "1", "child": {} }}
front matter
Section titled “front matter”Front matter is a metadata section at the head of a file, typically formatted as YAML.
---title: "Hello, World!"---
...
You can use the parsers.frontmatter
or MD to parse out the metadata into an object
const meta = parsers.frontmatter(file)
The parsers.CSV
function parses for the CSV format. If successful, the function returns an array of object where each object represents a row in the CSV file.
const res = parsers.CSV("...")
The parsers will auto-detect the header names if present; otherwise you should pass an array of header names in the options.
const res = parsers.CSV("...", { delimiter: "\t", headers: ["name", "age"] })
The parsers.PDF
function reads a PDF file and attempts to cleanly convert it into a text format. Read the /genaiscript/reference/scripts/pdf for more information.
The parsers.DOCX
function reads a .docx file as raw text.
The parsers.INI
parses .ini files, typically
used for configuration files. This format is similar to the key=value
format.
KEY=VALUE
The parsers.XLSX
function reads a .xlsx file and returns an array of objects where each object represents a row in the spreadsheet.
The first row is used as headers.
The function uses the xlsx library.
const sheets = await parsers.XLSX("...filename.xlsx")const { rows } = sheets[0]
By default, it reads the first sheet and the first row as headers. You can pass a worksheet name and/or a range to process as options.
const res = await parsers.XLSX("...filename.xlsx", { sheet: "Sheet2", range: "A1:C10",})
VTT, SRT
Section titled “VTT, SRT”The parsers.transcription
parses VTT or SRT transcription files into a sequence of segments.
const segments = await parsers.transcription("WEBVTT...")
Unpacks the contents of a zip file and returns an array of files.
const files = await parsers.unzip(env.files[0])
HTML to Text
Section titled “HTML to Text”The parsers.HTMLToText
converts HTML to plain text using html-to-text.
const text = parsers.HTMLToText(html)
Prompty
Section titled “Prompty”Prompty is a markdown-based prompt template format. GenAIScript provides a parser for prompty templates, with a few additional frontmatter fields to define tests and samples.
---name: Basic Promptdescription: A basic prompt that uses the chat API to answer questions---system:You are an AI assistant who helps people find information. Answer all questions to the best of your ability.As the assistant, you answer questions briefly, succinctly.user:{{question}}
To parse this file, use the parsers.prompty
function.
const doc = await parsers.prompty(file)
Code (JavaScript, Python, C, C++, Java, …)
Section titled “Code (JavaScript, Python, C, C++, Java, …)”The parsers.code
function parses source code using the Tree Sitter
library. It returns an AST (Abstract Syntax Tree) that can be used to analyze the code.
// the whole treeconst { captures } = await parsers.code(file)// with a queryconst { captures } = await parsers.code(file, "(interface_declaration) @i")
The tags
query is a built-in alias for the tree-sitter tags
query that is made available in most tree-sitter libraries.
const { captures } = await parsers.code(file, 'tags')```
## Math expression
The `parsers.math` function uses [mathjs](https://mathjs.org/) to parse a math expression.
```jsconst res = await parsers.math("1 + 1")
The parsers.dotEnv
parses .env files, typically
using for configuration files. This format is similar to the key=value
format.
KEY=VALUE
fences
Section titled “fences”Parse output of LLM similar to output of genaiscript def() function. Expect text to look something like this:
Foo bar:```jsvar x = 1...```
Baz qux:
Also supported. …
Returns a list of parsed code sections.
const fences = parsers.fences("...")
annotations
Section titled “annotations”Parses error, warning annotations in various formats into a list of objects.
const annotations = parsers.annotations("...")
tokens
Section titled “tokens”The parsers.tokens
estimates the number of tokens in a string
for the current model. This is useful for estimating the number of prompts that can be generated from a string.
const count = parsers.tokens("...")
validateJSON
Section titled “validateJSON”The parsers.validateJSON
function validates a JSON string against a schema.
const validation = parsers.validateJSON(schema, json)
mustache
Section titled “mustache”Runs the mustache template engine in the string and arguments.
const rendered = parsers.mustache("Today is {{date}}.", { date: new Date() })
Runs the jinja template (using @huggingface/jinja).
const rendered = parsers.jinja("Today is {{date}}.", { date: new Date() })
tidyData
Section titled “tidyData”A set of data manipulation options that is internally
used with defData
.
const d = parsers.tidyData(rows, { sliceSample: 100, sort: "name" })
Apply a GROQ query to a JSON object.
const d = parsers.GROQ( `*[completed == true && userId == 2]{ title}`, data)
Utility to hash an object, array into a string that is appropriate for hashing purposes.
const h = parsers.hash({ obj, other }, { length: 12 })
By default, uses sha-1
, but sha-256
can also be used. The hash packing logic may change between versions of genaiscript.
unthink
Section titled “unthink”Some models return their internal reasonings inside <think>
tags.
<think>This is my reasoning...</think>Yes
The unthink
function removes the <think>
tags.
const text = parsers.unthink(res.text)
Command line
Section titled “Command line”Use the parse command from the CLI to try out various parsers.
# convert any known data format to JSONgenaiscript parse data mydata.csv