Skip to content

tree-sitter

1 post with the tag “tree-sitter”

AST Grep and Transform

This page describes a strategy to build GenAI scripts that use Abstract Syntax Trees (AST) to parse and modify source code. When applicable, it provides an extremely flexible and stable way to apply large scale changes to source code. Interested? Let’s dive in!

The strategy of AST-based code transformation

One of the challenge when creating GenAI scripts that update source code is to correctly locate and update source code. The slightest mistake in the location of the code to update can lead to a broken code. This is especially true when the code to update is not a simple string, but a complex structure like an object or a function call.

In some cases, you know “precisely” which part of the code you want to update. For example, you want to refresh the documentation of a function after a change. You know that the documentation is located just before the function definition at least in the sense of the programming language but the number of empty lines or space may vary.

math.ts
/** sums a and b */
function fn(a: number, b: number): number {
return a - b // oops outdated
}

In such scenario, you can use the Abstract Syntax Tree (AST) to locate the code to update. The AST is a tree representation of code.

So instead of fighting spaces and new lines, you can just locate the function_declaration node that follows a comment node.

docs.genai.mts
const node = sg.search("functions without comments")

Once you’ve located the node to update, you could do any transformation you want, e.g. replace it with another text. In terms of GenAI script, this means that you can build a prompt that includes as much context as you need, generate a response.

docs.genai.mts
$`Update the documentation of the function 'fn' to reflect the new behavior of the function.`
fence(node.text())
/* subs a and b */

Once the LLM responds with the new comment, you could insert it as the new content of the node in the AST.

docs.genai.mts
edits.replace(node.comment(), response)

Voila! You’ve only touched the part of the file you wanted to update!

math.ts
/** subs a and b */
function fn(a: number, b: number): number {
return a - b
}

To recap, this strategy is based on the following steps:

  1. search Use the AST to locate the node to update.
  2. transform and replace Use the LLM to generate the new content of the node.
  3. commit Update the node in the AST with the new content.

ast-grep

AST-Grep Logo

ast-grep(sg) is a fast and polyglot tool for code structural search, lint, rewriting at large scale. sg provides us the AST-search-and-replace capabilities we need to implement the strategy above.

GenAIScript benefits from the excellent Node.JS integration, which is available through the host.astGrep() method.

Searching

The sg.search method allows you to search for nodes in the AST. It takes the language, file glob and pattern syntax and returns a list of matches.

docs.genai.mts
// search
const { matches, replace } = await sg.search("ts", "src/*fib*.ts", {
rule: {
kind: "function_declaration",
not: {
precedes: {
kind: "comment",
stopBy: "neighbor",
},
},
},
})

Editing

The sg.changeset method creates a changeset that can be used to apply a set of edits to a set of files.

// transform
const edits = sg.changeset()
for (const match of matches) {
const { text } = await prompt`Generate new docs for ${match.text()}`
// replace
edits.replace(match.comment(), text) // it's somewhat more involved
}
// commit all edits to file
await workspace.writeFiles(edits.commit())

Sample: Doc generator / updater

You will find a full write down of the making of the documentation generator/updater script below in the documentation. I encourage you to read it to dig deeper.

The docs scripts is a documentation generator/updater.

  • uses ast-grep to find and generate missing documentation for exported TypeScript function. A second LLM-as-Judge request is used to check that the generated documentation is correct.
  • if the diff option is selected, it will filter out functions that do not intersect with the diff (this is rather naive but a good start…).
  • it can also be used to update the documentation of a function that has changed.
  • it works regardless of the file size or the number of files as most transformations are hyper-localized.
generate and refresh my docs plz
genaiscript run docs -- --diff

Here are some example of applications of the scripts (one-shot, no human edit, multi-edit per file):

Here it goes:

docs.genai.mts
import { classify } from "genaiscript/runtime"
import { docify } from "./src/docs.mts"
import { prettier } from "./src/prettier.mts"
script({
title: "Generate TypeScript function documentation using AST insertion",
group: "dev",
description: `
## Docs!
This script generates and updates TypeScript function using an AST/LLM hybrid approach.
It uses ast-grep to look for undocumented and documented functions,
then uses a combination of LLM, and LLM-as-a-judge to generate and validate the documentation.
It also uses prettier to format the code before and after the generation.
By default,
- no edits are applied on disk. It is recommended to
run this script with \`--vars 'applyEdits=true'\` to apply the edits.
- if a diff is available, it will only process the files with changes.
`,
accept: ".ts",
files: "src/cowsay.ts",
parameters: {
diff: {
type: "boolean",
default: false,
description:
"If true, the script will only process files with changes with respect to main.",
},
pretty: {
type: "boolean",
default: false,
description:
"If true, the script will prettify the files before analysis.",
},
applyEdits: {
type: "boolean",
default: false,
description: "If true, the script will not modify the files.",
},
missing: {
type: "boolean",
default: true,
description: "Generate missing docs.",
},
update: {
type: "boolean",
default: true,
description: "Update existing docs.",
},
maxFiles: {
type: "integer",
description: "Maximum number of files to process.",
},
},
})
const { output, dbg, vars } = env
let { files } = env
const { applyEdits, diff, pretty, missing, update, maxFiles } = vars
dbg({ applyEdits, diff, pretty, missing, update, maxFiles })
if (!missing && !update) cancel(`not generating or updating docs, exiting...`)
if (!applyEdits)
output.warn(
`edit not applied, use --vars 'applyEdits=true' to apply the edits`
)
// filter by diff
const gitDiff = diff ? await git.diff({ base: "dev" }) : undefined
console.debug(gitDiff)
const diffFiles = gitDiff ? DIFF.parse(gitDiff) : undefined
if (diffFiles?.length) {
dbg(`diff files: ${diffFiles.map((f) => f.to)}`)
files = files.filter(({ filename }) =>
diffFiles.some((f) => path.resolve(f.to) === path.resolve(filename))
)
dbg(`diff filtered files: ${files.length}`)
}
if (maxFiles && files.length > maxFiles) {
dbg(`random slicing files to ${maxFiles}`)
files = parsers.tidyData(files, {
sliceSample: maxFiles,
}) as WorkspaceFile[]
}
const sg = await host.astGrep()
const stats = []
for (const file of files) {
console.debug(file.filename)
// normalize spacing
if (pretty) await prettier(file)
// generate missing docs
if (missing) {
stats.push({
filename: file.filename,
kind: "new",
gen: 0,
genCost: 0,
judge: 0,
judgeCost: 0,
edits: 0,
updated: 0,
})
await generateDocs(file, stats.at(-1))
}
// generate updated docs
if (update) {
stats.push({
filename: file.filename,
kind: "update",
gen: 0,
genCost: 0,
judge: 0,
judgeCost: 0,
edits: 0,
updated: 0,
})
await updateDocs(file, stats.at(-1))
}
}
if (stats.length)
output.table(
stats.filter((row) =>
Object.values(row).some((d) => typeof d === "number" && d > 0)
)
)
async function generateDocs(file: WorkspaceFile, fileStats: any) {
const { matches: missingDocs } = await sg.search(
"ts",
file.filename,
{
rule: {
kind: "export_statement",
not: {
follows: {
kind: "comment",
stopBy: "neighbor",
},
},
has: {
kind: "function_declaration",
},
},
},
{ diff: gitDiff, applyGitIgnore: false }
)
dbg(`found ${missingDocs.length} missing docs`)
const edits = sg.changeset()
// for each match, generate a docstring for functions not documented
for (const missingDoc of missingDocs) {
const res = await runPrompt(
(_) => {
_.def("FILE", missingDoc.getRoot().root().text())
_.def("FUNCTION", missingDoc.text())
// this needs more eval-ing
_.$`Generate a function documentation for <FUNCTION>.
- Make sure parameters are documented.
- Be concise. Use technical tone.
- do NOT include types, this is for TypeScript.
- Use docstring syntax. do not wrap in markdown code section.
The full source of the file is in <FILE> for reference.`
},
{
model: "large",
responseType: "text",
label: missingDoc.text()?.slice(0, 20) + "...",
}
)
// if generation is successful, insert the docs
fileStats.gen += res.usage?.total || 0
fileStats.genCost += res.usage?.cost || 0
if (res.error) {
output.warn(res.error.message)
continue
}
const docs = docify(res.text.trim())
// sanity check
const judge = await classify(
(_) => {
_.def("FUNCTION", missingDoc.text())
_.def("DOCS", docs)
},
{
ok: "The content in <DOCS> is an accurate documentation for the code in <FUNCTION>.",
err: "The content in <DOCS> does not match with the code in <FUNCTION>.",
},
{
model: "small",
responseType: "text",
temperature: 0.2,
systemSafety: false,
system: ["system.technical", "system.typescript"],
}
)
fileStats.judge += judge.usage?.total || 0
fileStats.judgeCost += judge.usage?.cost || 0
if (judge.label !== "ok") {
output.warn(judge.label)
output.fence(judge.answer)
continue
}
const updated = `${docs}\n${missingDoc.text()}`
edits.replace(missingDoc, updated)
fileStats.edits++
}
// apply all edits and write to the file
const modifiedFiles = edits.commit()
if (!modifiedFiles?.length) {
dbg("no edits to apply")
return
}
fileStats.updated = 1
if (applyEdits) {
await workspace.writeFiles(modifiedFiles)
await prettier(file)
} else {
output.diff(file, modifiedFiles[0])
}
}
async function updateDocs(file: WorkspaceFile, fileStats: any) {
const { matches } = await sg.search(
"ts",
file.filename,
YAML`
rule:
kind: "export_statement"
follows:
kind: "comment"
stopBy: neighbor
has:
kind: "function_declaration"
`,
{ diff: gitDiff, applyGitIgnore: false }
)
dbg(`found ${matches.length} docs to update`)
const edits = sg.changeset()
// for each match, generate a docstring for functions not documented
for (const match of matches) {
const comment = match.prev()
const res = await runPrompt(
(_) => {
_.def("FILE", match.getRoot().root().text(), { flex: 1 })
_.def("DOCSTRING", comment.text(), { flex: 10 })
_.def("FUNCTION", match.text(), { flex: 10 })
// this needs more eval-ing
_.$`Update the docstring <DOCSTRING> to match the code in function <FUNCTION>.
- If the docstring is up to date, return /NOP/.
- do not rephrase an existing sentence if it is correct.
- Make sure parameters are documented.
- do NOT include types, this is for TypeScript.
- Use docstring syntax. do not wrap in markdown code section.
- Minimize updates to the existing docstring.
The full source of the file is in <FILE> for reference.
The source of the function is in <FUNCTION>.
The current docstring is <DOCSTRING>.
`
},
{
model: "large",
responseType: "text",
flexTokens: 12000,
label: match.text()?.slice(0, 20) + "...",
temperature: 0.2,
systemSafety: false,
system: ["system.technical", "system.typescript"],
}
)
fileStats.gen += res.usage?.total || 0
fileStats.genCost += res.usage?.cost || 0
// if generation is successful, insert the docs
if (res.error) {
output.warn(res.error.message)
continue
}
if (res.text.includes("/NOP/")) continue
const docs = docify(res.text.trim())
// ask LLM if change is worth it
const judge = await classify(
(_) => {
_.def("FUNCTION", match.text())
_.def("ORIGINAL_DOCS", comment.text())
_.def("NEW_DOCS", docs)
_.$`An LLM generated an updated docstring <NEW_DOCS> for function <FUNCTION>. The original docstring is <ORIGINAL_DOCS>.`
},
{
APPLY: "The <NEW_DOCS> is a significant improvement to <ORIGINAL_DOCS>.",
NIT: "The <NEW_DOCS> contains nits (minor adjustments) to <ORIGINAL_DOCS>.",
},
{
model: "large",
responseType: "text",
temperature: 0.2,
systemSafety: false,
system: ["system.technical", "system.typescript"],
}
)
fileStats.judge += judge.usage?.total || 0
fileStats.judgeCost += judge.usage?.cost || 0
if (judge.label === "NIT") {
output.warn("LLM suggests minor adjustments, skipping")
continue
}
edits.replace(comment, docs)
fileStats.edits++
}
// apply all edits and write to the file
const modifiedFiles = edits.commit()
if (!modifiedFiles?.length) {
dbg("no edits to apply")
return
}
fileStats.updated = 1
if (applyEdits) {
await workspace.writeFiles(modifiedFiles)
await prettier(file)
} else {
output.diff(file, modifiedFiles[0])
}
}