Skip to content

Authoring Eval Suites

A suite is a named group of evals defined in .vally.yaml. Suites let you run curated subsets of your eval library — fast smoke tests in your inner loop, safety checks in CI, or expensive capability evals on a nightly schedule.

Suites use two independent scoping mechanisms:

  1. Tag filters — select stimuli by metadata tags
  2. File paths — select eval files by glob pattern

You can use either one or both together.

Add a suites section to your .vally.yaml:

.vally.yaml
suites:
fast:
filter:
priority: p0

Then run it:

Terminal window
vally eval --suite fast

This discovers all eval files in your project, then runs only the stimuli tagged priority: p0.

Tags are key-value pairs on stimuli (or at the eval level). Add them to any stimulus in your eval spec:

eval.yaml
stimuli:
- name: login-basic
prompt: "Log in with valid credentials"
tags:
priority: p0
area: auth
graders:
- type: output-contains
config: { substring: "success" }
- name: login-mfa
prompt: "Log in with MFA enabled"
tags:
priority: p1
area: auth
cost: high
graders:
- type: output-contains
config: { substring: "authenticated" }

Tags on the eval itself are inherited by all stimuli in that file. Stimulus-level tags override eval-level tags on the same key:

eval.yaml
tags:
area: auth
priority: p1
stimuli:
- name: login-basic
tags:
priority: p0 # overrides eval-level p1
prompt: "Log in with valid credentials"
graders: [{ type: output-contains, config: { substring: "success" } }]
- name: login-edge
# inherits area=auth and priority=p1 from eval level
prompt: "Log in with expired token"
graders: [{ type: output-contains, config: { substring: "expired" } }]

Suite filters follow two rules:

  • AND across keys — every key in the filter must match
  • OR within values — for each key, at least one value must match
.vally.yaml
suites:
# AND: match stimuli tagged priority=p0 AND area=auth
auth-smoke:
filter:
priority: p0
area: auth
# OR: match priority=p0 OR priority=p1
ci-gate:
filter:
priority: [p0, p1]
# Combined: (priority=p0 OR p1) AND (area=auth)
auth-ci:
filter:
priority: [p0, p1]
area: auth

Use the evals field to select eval files by path or glob pattern. Paths are resolved relative to your project root (where .vally.yaml lives).

.vally.yaml
suites:
safety:
evals:
- "evals/safety/**/*.eval.yaml"
- "evals/shared/baseline.eval.yaml"

You can also list specific files:

.vally.yaml
suites:
regression:
evals:
- "evals/auth/login.eval.yaml"
- "evals/auth/permissions.eval.yaml"
Use file paths when…Use tags when…
You organize evals by directory (e.g., evals/safety/)You want cross-cutting selections across directories
You need an explicit, auditable list of filesYou want suites to grow automatically as new evals are added
You want to include files outside normal discoveryYou need per-stimulus granularity

When a suite defines both evals and filter, they act as an intersection — file paths narrow which eval files to load, and tags narrow which stimuli within those files to run.

.vally.yaml
suites:
safety-p0:
evals: ["evals/safety/**/*.eval.yaml"]
filter: { priority: p0 }

This suite:

  1. Discovers only eval files matching evals/safety/**/*.eval.yaml
  2. Within those files, runs only stimuli tagged priority: p0

Think of it as a three-tier cascade:

Project discovery (paths.evals + paths.evalFilenames)
└─ Suite file scoping (evals globs narrow to specific files)
└─ Tag filtering (filter narrows to specific stimuli)

Pick a small, consistent set of tag keys across your eval library:

KeyValuesPurpose
priorityp0, p1, p2Gate CI vs nightly vs exploratory
areaauth, search, perfFunctional area under test
typecapability, regressionWhat the eval is checking
costfree, low, highToken cost / run time budget
  • Eval-level tags are defaults — use for tags that apply to most stimuli in a file (like area)
  • Stimulus-level tags are overrides — use for tags that vary within a file (like priority or cost)
eval.yaml
tags:
area: auth # all stimuli in this file test auth
cost: low # most are cheap
stimuli:
- name: basic-login
tags: { priority: p0 }
prompt: "..."
graders: [...]
- name: mfa-flow
tags: { priority: p1, cost: high } # overrides cost=low
prompt: "..."
graders: [...]
  • Use the same key names across all eval files (priority, not sometimes prio and sometimes pri)
  • Keep values lowercase and concise
  • Avoid deeply nested or overly granular tag schemes — start simple and add keys only when you need a new suite dimension

Wire suites into your CI pipeline to run the right evals at the right time:

.github/workflows/eval.yml
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run safety suite
run: vally eval --suite safety
- name: Run fast suite
run: vally eval --suite fast
Pipeline stageSuiteWhat it runs
PR checkfastpriority: p0 — must pass before merge
Merge gateci-gatepriority: [p0, p1] — broader coverage
NightlyfullAll evals (empty filter or no suite)
Safety reviewsafetyevals: ["evals/safety/**"] — dedicated safety evals

Suites can be combined with named environments defined in .vally.yaml. Named environments configure shared skills, files, and setup commands that eval specs can reference. When you run a suite, the environment configuration from your eval specs is applied as usual.

See the eval spec guide for details on how environments work.

If vally eval --suite <name> runs zero stimuli, check these common causes:

  1. Typo in tag key or valuefilter: { priortiy: p0 } won’t match priority: p0
  2. Missing tags on stimuli — a filter key that doesn’t exist on any stimulus matches nothing
  3. File path glob doesn’t match — double-check paths are relative to the project root

Remember the AND/OR semantics: adding more keys to a filter makes it stricter (AND), while adding more values to a single key makes it broader (OR).

# Strict: must be p0 AND auth AND free — very narrow
strict:
filter:
priority: p0
area: auth
cost: free
# Broad: p0 OR p1 OR p2 — very wide
broad:
filter:
priority: [p0, p1, p2]
  • ✅ Suite is defined under suites: in .vally.yaml
  • ✅ Tag keys in filter match tag keys in your eval specs exactly
  • ✅ Tag values are strings (not numbers — use "1" not 1)
  • evals glob patterns use forward slashes and are relative to project root
  • ✅ At least one of filter or evals is present in the suite definition