Skip to content

Parsers

The parsers object provide various parers for commomn data formats.

JSON5

The parsers.json5 function parses the JSON5 format. JSON5 is an extension to the popular JSON file format that aims to be easier to write and maintain by hand (e.g. for config files).

In general, parsing a JSON file as JSON5 does not hurt but it might be more merciful to syntactic errors. In addition to JSON5, JSON repair is applied with the initial parse fails.

  • JSON5 example
{
// comments
unquoted: "and you can quote me on that",
singleQuotes: 'I can use "double quotes" here',
lineBreaks: "Look, Mom! \
No \\n's!",
hexadecimal: 0xdecaf,
leadingDecimalPoint: 0.8675309,
andTrailing: 8675309,
positiveSign: +1,
trailingComma: "in objects",
andIn: ["arrays"],
backwardsCompatible: "with JSON",
}

To parse, use parsers.JSON5. It supports both a text content or a file as input.

const res = parsers.JSON5("...")

YAML

The parsers.YAML function parses for the YAML format. YAML is more friendly to the LLM tokenizer than JSON. YAML is commonly used in configuration files.

fields:
number: 1
boolean: true
string: foo
array:
- 1
- 2

To parse, use parsers.YAML. It supports both a text content or a file as input.

const res = parsers.YAML("...")

TOML

The parsers.TOML function parses for the TOML format. YAML is more friendly to the LLM tokenizer than JSON. YAML is commonly used in configuration files.

# This is a TOML document
title = "TOML Example"
[object]
string = "foo"
number = 1

To parse, use parsers.TOML. It supports both a text content or a file as input.

const res = parsers.TOML("...")

JSONL

JSONL is a format that stores JSON objects in a line-by-line format. Each line is a valid JSON(5) object (we use the JSON5 parser to be more error resilient).

data.jsonl
{"name": "Alice"}
{"name": "Bob"}

You can use parsers.JSONL to parse the JSONL files into an array of object (any[]).

const res = parsers.JSONL(file)

XML

The parsers.XML function parses for the XML format.

const res = parsers.XML("<xml></xml>")

front matter

Front matter is a metadata section at the head of a file, typically formatted as YAML.

---
title: "Hello, World!"
---
...

You can use the parsers.frontmatter to parse out the metadata into an object

const meta = parsers.frontmatter(file)

CSV

The parsers.CSV function parses for the CSV format. If successful, the function returns an array of object where each object represents a row in the CSV file.

const res = parsers.CSV("...")

The parsers will auto-detect the header names if present; otherwise you should pass an array of header names in the options.

const res = parsers.CSV("...", { delimiter: "\t", headers: ["name", "age"] })

PDF

The parsers.PDF function reads a PDF file and attempts to cleanly convert it into a text format. Read the /genaiscript/reference/scripts/pdf for more information.

DOCX

The parsers.DOCX function reads a .docx file as raw text.

INI

The parsers.INI parses .ini files, typically using for configuration files. This format is similar to the key=value format.

KEY=VALUE

XLSX

The parsers.XLSX function reads a .xlsx file and returns an array of objects where each object represents a row in the spreadsheet. The first row is used as headers. The function uses the xlsx library.

const sheets = await parsers.XLSX("...filename.xlsx")
const { rows } = sheets[0]

By default, it reads the first sheet and the first row as headers. You can pass a worksheet name and/or a range to process as options.

const res = await parsers.XLSX("...filename.xlsx", {
sheet: "Sheet2",
range: "A1:C10",
})

Unzip

Unpacks the contents of a zip file and returns an array of files.

const files = await parsers.unzip(env.files[0])

HTML to Text

The parsers.HTMLToText converts HTML to plain text using html-to-text.

const text = parsers.HTMLToText(html)

Code (JavaScript, Python, C, C++, Java, …)

The parsers.code function parses source code using the Tree Sitter library. It returns an AST (Abstract Syntax Tree) that can be used to analyze the code.

// the whole tree
const [tree] = await parsers.code(file)
// with a query
const captures = await parsers.code(file, "(interface_declaration) @i")

Math expression

The parsers.math function uses mathjs to parse a math expression.

const res = parsers.math("1 + 1")

.env

The parsers.dotEnv parses .env files, typically using for configuration files. This format is similar to the key=value format.

KEY=VALUE

fences

Parse output of LLM similar to output of genaiscript def() function. Expect text to look something like this:

Foo bar:
```js
var x = 1
...
```
Baz qux:

Also supported. …

Returns a list of parsed code sections.

const fences = parsers.fences("...")

annotations

Parses error, warning annotations in various formats into a list of objects.

const annotations = parsers.annotations("...")

tokens

The parsers.tokens estimates the number of tokens in a string for the current model. This is useful for estimating the number of prompts that can be generated from a string.

const count = parsers.tokens("...")

math

The parsers.math function uses mathjs to parse a math expression.

const res = parsers.math("1 + 1")

validateJSON

The parsers.validateJSON function validates a JSON string against a schema.

const validation = parsers.validateJSON(schema, json)