AST Grep and Transform
This page describes a strategy to build GenAI scripts that use Abstract Syntax Trees (AST) to parse and modify source code. When applicable, it provides an extremely flexible and stable way to apply large scale changes to source code. Interested? Let’s dive in!
The strategy of AST-based code transformation
One of the challenge when creating GenAI scripts that update source code is to correctly locate and update source code. The slightest mistake in the location of the code to update can lead to a broken code. This is especially true when the code to update is not a simple string, but a complex structure like an object or a function call.
In some cases, you know “precisely” which part of the code you want to update. For example, you want to refresh the documentation of a function after a change. You know that the documentation is located just before the function definition at least in the sense of the programming language but the number of empty lines or space may vary.
/** sums a and b */function fn(a: number, b: number): number { return a - b // oops outdated}
In such scenario, you can use the Abstract Syntax Tree (AST) to locate the code to update. The AST is a tree representation of code.
So instead of fighting spaces and new lines, you can just locate the function_declaration
node that follows
a comment
node.
const node = sg.search("functions without comments")
Once you’ve located the node to update, you could do any transformation you want, e.g. replace it with another text. In terms of GenAI script, this means that you can build a prompt that includes as much context as you need, generate a response.
$`Update the documentation of the function 'fn' to reflect the new behavior of the function.`fence(node.text())
/* subs a and b */
Once the LLM responds with the new comment, you could insert it as the new content of the node in the AST.
edits.replace(node.comment(), response)
Voila! You’ve only touched the part of the file you wanted to update!
/** subs a and b */function fn(a: number, b: number): number { return a - b}
To recap, this strategy is based on the following steps:
- search Use the AST to locate the node to update.
- transform and replace Use the LLM to generate the new content of the node.
- commit Update the node in the AST with the new content.
ast-grep
ast-grep(sg) is a fast and polyglot tool for code structural search, lint, rewriting at large scale. sg provides us the AST-search-and-replace capabilities we need to implement the strategy above.
GenAIScript benefits from the excellent Node.JS integration,
which is available through the host.astGrep()
method.
Searching
The sg.search
method allows you to search for nodes in the AST. It takes the language, file glob
and pattern syntax and returns a list of matches.
// searchconst { matches, replace } = await sg.search("ts", "src/*fib*.ts", { rule: { kind: "function_declaration", not: { precedes: { kind: "comment", stopBy: "neighbor", }, }, },})
Editing
The sg.changeset
method creates a changeset that can be used to apply a set of edits to a set of files.
// transformconst edits = sg.changeset()for (const match of matches) { const { text } = await prompt`Generate new docs for ${match.text()}` // replace edits.replace(match.comment(), text) // it's somewhat more involved}// commit all edits to fileawait workspace.writeFiles(edits.commit())
Sample: Doc generator / updater
You will find a full write down of the making of the documentation generator/updater script below in the documentation. I encourage you to read it to dig deeper.
The docs
scripts is a documentation generator/updater.
- uses ast-grep to find and generate missing documentation for exported TypeScript function. A second LLM-as-Judge request is used to check that the generated documentation is correct.
- if the
diff
option is selected, it will filter out functions that do not intersect with the diff (this is rather naive but a good start…). - it can also be used to update the documentation of a function that has changed.
- it works regardless of the file size or the number of files as most transformations are hyper-localized.
genaiscript run docs -- --diff
Here are some example of applications of the scripts (one-shot, no human edit, multi-edit per file):
- TypeScript codebase, I stopped the script after a while, it was humming.
- GenAIScript codebase, this change also contains the update to GenAIScript as I was building the feature.
Here it goes:
import { classify } from "genaiscript/runtime"import { docify } from "./src/docs.mts"import { prettier } from "./src/prettier.mts"
script({ title: "Generate TypeScript function documentation using AST insertion", group: "dev", description: `## Docs!
This script generates and updates TypeScript function using an AST/LLM hybrid approach.It uses ast-grep to look for undocumented and documented functions,then uses a combination of LLM, and LLM-as-a-judge to generate and validate the documentation.It also uses prettier to format the code before and after the generation.
By default,
- no edits are applied on disk. It is recommended torun this script with \`--vars 'applyEdits=true'\` to apply the edits.- if a diff is available, it will only process the files with changes.
`, accept: ".ts", files: "src/cowsay.ts", parameters: { diff: { type: "boolean", default: false, description: "If true, the script will only process files with changes with respect to main.", }, pretty: { type: "boolean", default: false, description: "If true, the script will prettify the files before analysis.", }, applyEdits: { type: "boolean", default: false, description: "If true, the script will not modify the files.", }, missing: { type: "boolean", default: true, description: "Generate missing docs.", }, update: { type: "boolean", default: true, description: "Update existing docs.", }, maxFiles: { type: "integer", description: "Maximum number of files to process.", }, },})const { output, dbg, vars } = envlet { files } = envconst { applyEdits, diff, pretty, missing, update, maxFiles } = vars
dbg({ applyEdits, diff, pretty, missing, update, maxFiles })
if (!missing && !update) cancel(`not generating or updating docs, exiting...`)
if (!applyEdits) output.warn( `edit not applied, use --vars 'applyEdits=true' to apply the edits` )
// filter by diffconst gitDiff = diff ? await git.diff({ base: "dev" }) : undefinedconsole.debug(gitDiff)const diffFiles = gitDiff ? DIFF.parse(gitDiff) : undefinedif (diffFiles?.length) { dbg(`diff files: ${diffFiles.map((f) => f.to)}`) files = files.filter(({ filename }) => diffFiles.some((f) => path.resolve(f.to) === path.resolve(filename)) ) dbg(`diff filtered files: ${files.length}`)}
if (maxFiles && files.length > maxFiles) { dbg(`random slicing files to ${maxFiles}`) files = parsers.tidyData(files, { sliceSample: maxFiles, }) as WorkspaceFile[]}
const sg = await host.astGrep()const stats = []for (const file of files) { console.debug(file.filename) // normalize spacing if (pretty) await prettier(file)
// generate missing docs if (missing) { stats.push({ filename: file.filename, kind: "new", gen: 0, genCost: 0, judge: 0, judgeCost: 0, edits: 0, updated: 0, }) await generateDocs(file, stats.at(-1)) }
// generate updated docs if (update) { stats.push({ filename: file.filename, kind: "update", gen: 0, genCost: 0, judge: 0, judgeCost: 0, edits: 0, updated: 0, }) await updateDocs(file, stats.at(-1)) }}
if (stats.length) output.table( stats.filter((row) => Object.values(row).some((d) => typeof d === "number" && d > 0) ) )
async function generateDocs(file: WorkspaceFile, fileStats: any) { const { matches: missingDocs } = await sg.search( "ts", file.filename, { rule: { kind: "export_statement", not: { follows: { kind: "comment", stopBy: "neighbor", }, }, has: { kind: "function_declaration", }, }, }, { diff: gitDiff, applyGitIgnore: false } ) dbg(`found ${missingDocs.length} missing docs`) const edits = sg.changeset() // for each match, generate a docstring for functions not documented for (const missingDoc of missingDocs) { const res = await runPrompt( (_) => { _.def("FILE", missingDoc.getRoot().root().text()) _.def("FUNCTION", missingDoc.text()) // this needs more eval-ing _.$`Generate a function documentation for <FUNCTION>. - Make sure parameters are documented. - Be concise. Use technical tone. - do NOT include types, this is for TypeScript. - Use docstring syntax. do not wrap in markdown code section.
The full source of the file is in <FILE> for reference.` }, { model: "large", responseType: "text", label: missingDoc.text()?.slice(0, 20) + "...", } ) // if generation is successful, insert the docs fileStats.gen += res.usage?.total || 0 fileStats.genCost += res.usage?.cost || 0 if (res.error) { output.warn(res.error.message) continue } const docs = docify(res.text.trim())
// sanity check const judge = await classify( (_) => { _.def("FUNCTION", missingDoc.text()) _.def("DOCS", docs) }, { ok: "The content in <DOCS> is an accurate documentation for the code in <FUNCTION>.", err: "The content in <DOCS> does not match with the code in <FUNCTION>.", }, { model: "small", responseType: "text", temperature: 0.2, systemSafety: false, system: ["system.technical", "system.typescript"], } ) fileStats.judge += judge.usage?.total || 0 fileStats.judgeCost += judge.usage?.cost || 0 if (judge.label !== "ok") { output.warn(judge.label) output.fence(judge.answer) continue } const updated = `${docs}\n${missingDoc.text()}` edits.replace(missingDoc, updated) fileStats.edits++ }
// apply all edits and write to the file const modifiedFiles = edits.commit() if (!modifiedFiles?.length) { dbg("no edits to apply") return } fileStats.updated = 1 if (applyEdits) { await workspace.writeFiles(modifiedFiles) await prettier(file) } else { output.diff(file, modifiedFiles[0]) }}
async function updateDocs(file: WorkspaceFile, fileStats: any) { const { matches } = await sg.search( "ts", file.filename, YAML`rule: kind: "export_statement" follows: kind: "comment" stopBy: neighbor has: kind: "function_declaration"`, { diff: gitDiff, applyGitIgnore: false } ) dbg(`found ${matches.length} docs to update`) const edits = sg.changeset() // for each match, generate a docstring for functions not documented for (const match of matches) { const comment = match.prev()
const res = await runPrompt( (_) => { _.def("FILE", match.getRoot().root().text(), { flex: 1 }) _.def("DOCSTRING", comment.text(), { flex: 10 }) _.def("FUNCTION", match.text(), { flex: 10 }) // this needs more eval-ing _.$`Update the docstring <DOCSTRING> to match the code in function <FUNCTION>. - If the docstring is up to date, return /NOP/. - do not rephrase an existing sentence if it is correct. - Make sure parameters are documented. - do NOT include types, this is for TypeScript. - Use docstring syntax. do not wrap in markdown code section. - Minimize updates to the existing docstring.
The full source of the file is in <FILE> for reference. The source of the function is in <FUNCTION>. The current docstring is <DOCSTRING>. ` }, { model: "large", responseType: "text", flexTokens: 12000, label: match.text()?.slice(0, 20) + "...", temperature: 0.2, systemSafety: false, system: ["system.technical", "system.typescript"], } ) fileStats.gen += res.usage?.total || 0 fileStats.genCost += res.usage?.cost || 0 // if generation is successful, insert the docs if (res.error) { output.warn(res.error.message) continue }
if (res.text.includes("/NOP/")) continue
const docs = docify(res.text.trim())
// ask LLM if change is worth it const judge = await classify( (_) => { _.def("FUNCTION", match.text()) _.def("ORIGINAL_DOCS", comment.text()) _.def("NEW_DOCS", docs) _.$`An LLM generated an updated docstring <NEW_DOCS> for function <FUNCTION>. The original docstring is <ORIGINAL_DOCS>.` }, { APPLY: "The <NEW_DOCS> is a significant improvement to <ORIGINAL_DOCS>.", NIT: "The <NEW_DOCS> contains nits (minor adjustments) to <ORIGINAL_DOCS>.", }, { model: "large", responseType: "text", temperature: 0.2, systemSafety: false, system: ["system.technical", "system.typescript"], } )
fileStats.judge += judge.usage?.total || 0 fileStats.judgeCost += judge.usage?.cost || 0 if (judge.label === "NIT") { output.warn("LLM suggests minor adjustments, skipping") continue } edits.replace(comment, docs) fileStats.edits++ }
// apply all edits and write to the file const modifiedFiles = edits.commit() if (!modifiedFiles?.length) { dbg("no edits to apply") return } fileStats.updated = 1 if (applyEdits) { await workspace.writeFiles(modifiedFiles) await prettier(file) } else { output.diff(file, modifiedFiles[0]) }}