Authoring Eval Suites
A suite is a named group of evals defined in .vally.yaml. Suites let you run curated subsets of your eval library — fast smoke tests in your inner loop, safety checks in CI, or expensive capability evals on a nightly schedule.
Suites use two independent scoping mechanisms:
- Tag filters — select stimuli by metadata tags
- File paths — select eval files by glob pattern
You can use either one or both together.
Quick start — your first suite
Section titled “Quick start — your first suite”Add a suites section to your .vally.yaml:
suites: fast: filter: priority: p0Then run it:
vally eval --suite fastThis discovers all eval files in your project, then runs only the stimuli tagged priority: p0.
Scoping by tags
Section titled “Scoping by tags”Tagging stimuli
Section titled “Tagging stimuli”Tags are key-value pairs on stimuli (or at the eval level). Add them to any stimulus in your eval spec:
stimuli: - name: login-basic prompt: "Log in with valid credentials" tags: priority: p0 area: auth graders: - type: output-contains config: { substring: "success" }
- name: login-mfa prompt: "Log in with MFA enabled" tags: priority: p1 area: auth cost: high graders: - type: output-contains config: { substring: "authenticated" }Eval-level tags
Section titled “Eval-level tags”Tags on the eval itself are inherited by all stimuli in that file. Stimulus-level tags override eval-level tags on the same key:
tags: area: auth priority: p1
stimuli: - name: login-basic tags: priority: p0 # overrides eval-level p1 prompt: "Log in with valid credentials" graders: [{ type: output-contains, config: { substring: "success" } }]
- name: login-edge # inherits area=auth and priority=p1 from eval level prompt: "Log in with expired token" graders: [{ type: output-contains, config: { substring: "expired" } }]Filter semantics
Section titled “Filter semantics”Suite filters follow two rules:
- AND across keys — every key in the filter must match
- OR within values — for each key, at least one value must match
suites: # AND: match stimuli tagged priority=p0 AND area=auth auth-smoke: filter: priority: p0 area: auth
# OR: match priority=p0 OR priority=p1 ci-gate: filter: priority: [p0, p1]
# Combined: (priority=p0 OR p1) AND (area=auth) auth-ci: filter: priority: [p0, p1] area: authScoping by file path
Section titled “Scoping by file path”Use the evals field to select eval files by path or glob pattern. Paths are resolved relative to your project root (where .vally.yaml lives).
suites: safety: evals: - "evals/safety/**/*.eval.yaml" - "evals/shared/baseline.eval.yaml"You can also list specific files:
suites: regression: evals: - "evals/auth/login.eval.yaml" - "evals/auth/permissions.eval.yaml"When to use file paths vs tags
Section titled “When to use file paths vs tags”| Use file paths when… | Use tags when… |
|---|---|
You organize evals by directory (e.g., evals/safety/) | You want cross-cutting selections across directories |
| You need an explicit, auditable list of files | You want suites to grow automatically as new evals are added |
| You want to include files outside normal discovery | You need per-stimulus granularity |
Combining file paths and tags
Section titled “Combining file paths and tags”When a suite defines both evals and filter, they act as an intersection — file paths narrow which eval files to load, and tags narrow which stimuli within those files to run.
suites: safety-p0: evals: ["evals/safety/**/*.eval.yaml"] filter: { priority: p0 }This suite:
- Discovers only eval files matching
evals/safety/**/*.eval.yaml - Within those files, runs only stimuli tagged
priority: p0
Think of it as a three-tier cascade:
Project discovery (paths.evals + paths.evalFilenames) └─ Suite file scoping (evals globs narrow to specific files) └─ Tag filtering (filter narrows to specific stimuli)Tagging strategies
Section titled “Tagging strategies”Common tag dimensions
Section titled “Common tag dimensions”Pick a small, consistent set of tag keys across your eval library:
| Key | Values | Purpose |
|---|---|---|
priority | p0, p1, p2 | Gate CI vs nightly vs exploratory |
area | auth, search, perf | Functional area under test |
type | capability, regression | What the eval is checking |
cost | free, low, high | Token cost / run time budget |
Eval-level vs stimulus-level
Section titled “Eval-level vs stimulus-level”- Eval-level tags are defaults — use for tags that apply to most stimuli in a file (like
area) - Stimulus-level tags are overrides — use for tags that vary within a file (like
priorityorcost)
tags: area: auth # all stimuli in this file test auth cost: low # most are cheap
stimuli: - name: basic-login tags: { priority: p0 } prompt: "..." graders: [...]
- name: mfa-flow tags: { priority: p1, cost: high } # overrides cost=low prompt: "..." graders: [...]Keep tags flat and consistent
Section titled “Keep tags flat and consistent”- Use the same key names across all eval files (
priority, not sometimesprioand sometimespri) - Keep values lowercase and concise
- Avoid deeply nested or overly granular tag schemes — start simple and add keys only when you need a new suite dimension
Using suites in CI
Section titled “Using suites in CI”Wire suites into your CI pipeline to run the right evals at the right time:
jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4
- name: Run safety suite run: vally eval --suite safety
- name: Run fast suite run: vally eval --suite fastSuggested CI layout
Section titled “Suggested CI layout”| Pipeline stage | Suite | What it runs |
|---|---|---|
| PR check | fast | priority: p0 — must pass before merge |
| Merge gate | ci-gate | priority: [p0, p1] — broader coverage |
| Nightly | full | All evals (empty filter or no suite) |
| Safety review | safety | evals: ["evals/safety/**"] — dedicated safety evals |
Named environments in suites
Section titled “Named environments in suites”Suites can be combined with named environments defined in .vally.yaml. Named environments configure shared skills, files, and setup commands that eval specs can reference. When you run a suite, the environment configuration from your eval specs is applied as usual.
See the eval spec guide for details on how environments work.
Debugging suites
Section titled “Debugging suites”No stimuli matched?
Section titled “No stimuli matched?”If vally eval --suite <name> runs zero stimuli, check these common causes:
- Typo in tag key or value —
filter: { priortiy: p0 }won’t matchpriority: p0 - Missing tags on stimuli — a filter key that doesn’t exist on any stimulus matches nothing
- File path glob doesn’t match — double-check paths are relative to the project root
Verify your filter logic
Section titled “Verify your filter logic”Remember the AND/OR semantics: adding more keys to a filter makes it stricter (AND), while adding more values to a single key makes it broader (OR).
# Strict: must be p0 AND auth AND free — very narrowstrict: filter: priority: p0 area: auth cost: free
# Broad: p0 OR p1 OR p2 — very widebroad: filter: priority: [p0, p1, p2]Quick checklist
Section titled “Quick checklist”- ✅ Suite is defined under
suites:in.vally.yaml - ✅ Tag keys in
filtermatch tag keys in your eval specs exactly - ✅ Tag values are strings (not numbers — use
"1"not1) - ✅
evalsglob patterns use forward slashes and are relative to project root - ✅ At least one of
filterorevalsis present in the suite definition
Next steps
Section titled “Next steps”- Writing eval specs — how to write the eval files that suites select from
- eval.yaml schema reference — complete field specification
- Debugging evals — when evals fail unexpectedly