Tests / Evals
It is possible to define tests/tests for the LLM scripts, to evaluate the output quality of the LLM over time and model types.
The tests are executed by promptfoo, a tool for evaluating LLM output quality.
You can also find AI vulnerabilities, such as bias, toxicity, and factuality issues, using the redteam feature.
Defining tests
Section titled “Defining tests”The tests are declared in the script function in your test.
You may define one or many tests (array).
script({ ..., tests: [{ files: "src/rag/testcode.ts", rubrics: "is a report with a list of issues", facts: `The report says that the input string should be validated before use.`, }, { ... }],})Test models
Section titled “Test models”You can specify a list of models (or model aliases) to test against.
script({ ..., testModels: ["ollama:phi3", "ollama:gpt-4o"],})The eval engine (PromptFoo) will run each test against each model in the list.
This setting can be overridden by the command line --models option.
External test files
Section titled “External test files”You can also specify the filename of external test files, in JSON, YAML, CSV formats
as well as .mjs, .mts JavaScript files will be executed to generate the tests.
script({ ..., tests: ["tests.json", "more-tests.csv", "tests.mjs"],})The JSON and YAML files assume that files to be a list of PromptTest objects and you can validate these files
using the JSON schema at https://microsoft.github.io/genaiscript/schemas/tests.json.
The CSV files assume that the first row is the header and the columns are mostly the properties of the PromptTest object.
The file column should be a filename, the fileContent column is the content of a virtual file.
content,rubrics,facts"const x = 1;",is a report with a list of issues,The report says that the input string should be validated before use.The JavaScript files should export a list of PromptTest objects or a function that generates the list of PromptTest objects.
export default [ { content: "const x = 1;", rubrics: "is a report with a list of issues", facts: "The report says that the input string should be validated before use.", },]files takes a list of file path (relative to the workspace) and populate the env.files
variable while running the test. You can provide multiple files by passing an array of strings.
script({ tests: { files: "src/rag/testcode.ts", ... }})rubrics
Section titled “rubrics”rubrics checks if the LLM output matches given requirements,
using a language model to grade the output based on the rubric (see llm-rubric).
You can specify multiple rubrics by passing an array of strings.
script({ tests: { rubrics: "is a report with a list of issues", ..., }})facts checks a factual consistency (see factuality).
You can specify multiple facts by passing an array of strings.
given a completion A and reference answer B evaluates whether A is a subset of B, A is a superset of B, A and B are equivalent, A and B disagree, or A and B differ, but difference don’t matter from the perspective of factuality.
script({ tests: { facts: `The report says that the input string should be validated before use.`, ..., }})asserts
Section titled “asserts”Other assertions on promptfoo assertions and metrics.
icontains(not-icontains") output contains substring case insensitiveequals(not-equals) output equals stringstarts-with(not-starts-with) output starts with string
script({ tests: { facts: `The report says that the input string should be validated before use.`, asserts: [ { type: "icontains", value: "issue", }, ], },})contains-all(not-contains-all) output contains all substringscontains-any(not-contains-any) output contains any substringicontains-all(not-icontains-all) output contains all substring case insensitive
script({ tests: { ..., asserts: [ { type: "icontains-all", value: ["issue", "fix"], }, ], },})transform
Section titled “transform”By default, GenAIScript extracts the text field from the output before sending it to PromptFoo.
You can disable this mode by setting format: "json"; then the the asserts are executed on the raw LLM output.
You can use a javascript expression to select a part of the output to test.
script({ tests: { files: "src/will-trigger.cancel.txt", format: "json", asserts: { type: "equals", value: "cancelled", transform: "output.status", }, },})Running tests
Section titled “Running tests”You can run tests from Visual Studio Code or using the command line. In both cases, genaiscript generates a promptfoo configuration file and execute promptfoo on it.
Visual Studio Code
Section titled “Visual Studio Code”- Open the script to test
- Right click in the editor and select Run GenAIScript Tests in the context menu
- The promptfoo web view will automatically open and refresh with the test results.
Command line
Section titled “Command line”Run the test command with the script file as argument.
npx genaiscript test <scriptid>You can specify additional models to test against by passing the --models option.
npx genaiscript test <scriptid> --models "ollama:phi3"