Authoring Eval Suites

A suite is a named group of evals defined in .vally.yaml. Suites let you run curated subsets of your eval library — fast smoke tests in your inner loop, safety checks in CI, or expensive capability evals on a nightly schedule.

Suites use two independent scoping mechanisms:

Tag filters — select stimuli by metadata tags
File paths — select eval files by glob pattern

You can use either one or both together.

Quick start — your first suite

Add a suites section to your .vally.yaml:

suites:
  fast:
    filter:
      priority: p0

Then run it:

vally eval --suite fast

This discovers all eval files in your project, then runs only the stimuli tagged priority: p0.

Scoping by tags

Tagging stimuli

Tags are key-value pairs on stimuli (or at the eval level). Add them to any stimulus in your eval spec:

stimuli:
  - name: login-basic
    prompt: "Log in with valid credentials"
    tags:
      priority: p0
      area: auth
    graders:
      - type: output-contains
        config: { substring: "success" }

  - name: login-mfa
    prompt: "Log in with MFA enabled"
    tags:
      priority: p1
      area: auth
      cost: high
    graders:
      - type: output-contains
        config: { substring: "authenticated" }

Eval-level tags

Tags on the eval itself are inherited by all stimuli in that file. Stimulus-level tags override eval-level tags on the same key:

tags:
  area: auth
  priority: p1

stimuli:
  - name: login-basic
    tags:
      priority: p0 # overrides eval-level p1
    prompt: "Log in with valid credentials"
    graders: [{ type: output-contains, config: { substring: "success" } }]

  - name: login-edge
    # inherits area=auth and priority=p1 from eval level
    prompt: "Log in with expired token"
    graders: [{ type: output-contains, config: { substring: "expired" } }]

Filter semantics

Suite filters follow two rules:

AND across keys — every key in the filter must match
OR within values — for each key, at least one value must match

suites:
  # AND: match stimuli tagged priority=p0 AND area=auth
  auth-smoke:
    filter:
      priority: p0
      area: auth

  # OR: match priority=p0 OR priority=p1
  ci-gate:
    filter:
      priority: [p0, p1]

  # Combined: (priority=p0 OR p1) AND (area=auth)
  auth-ci:
    filter:
      priority: [p0, p1]
      area: auth

Scoping by file path

Use the evals field to select eval files by path or glob pattern. Paths are resolved relative to your project root (where .vally.yaml lives).

suites:
  safety:
    evals:
      - "evals/safety/**/*.eval.yaml"
      - "evals/shared/baseline.eval.yaml"

You can also list specific files:

suites:
  regression:
    evals:
      - "evals/auth/login.eval.yaml"
      - "evals/auth/permissions.eval.yaml"

When to use file paths vs tags

Use file paths when…	Use tags when…
You organize evals by directory (e.g., `evals/safety/`)	You want cross-cutting selections across directories
You need an explicit, auditable list of files	You want suites to grow automatically as new evals are added
You want to include files outside normal discovery	You need per-stimulus granularity

Combining file paths and tags

When a suite defines both evals and filter, they act as an intersection — file paths narrow which eval files to load, and tags narrow which stimuli within those files to run.

suites:
  safety-p0:
    evals: ["evals/safety/**/*.eval.yaml"]
    filter: { priority: p0 }

This suite:

Discovers only eval files matching evals/safety/**/*.eval.yaml
Within those files, runs only stimuli tagged priority: p0

Think of it as a three-tier cascade:

Project discovery (paths.evals + paths.evalFilenames)
  └─ Suite file scoping (evals globs narrow to specific files)
      └─ Tag filtering (filter narrows to specific stimuli)

Tagging strategies

Common tag dimensions

Pick a small, consistent set of tag keys across your eval library:

Key	Values	Purpose
`priority`	`p0`, `p1`, `p2`	Gate CI vs nightly vs exploratory
`area`	`auth`, `search`, `perf`	Functional area under test
`type`	`capability`, `regression`	What the eval is checking
`cost`	`free`, `low`, `high`	Token cost / run time budget

Eval-level vs stimulus-level

Eval-level tags are defaults — use for tags that apply to most stimuli in a file (like area)
Stimulus-level tags are overrides — use for tags that vary within a file (like priority or cost)

tags:
  area: auth # all stimuli in this file test auth
  cost: low # most are cheap

stimuli:
  - name: basic-login
    tags: { priority: p0 }
    prompt: "..."
    graders: [...]

  - name: mfa-flow
    tags: { priority: p1, cost: high } # overrides cost=low
    prompt: "..."
    graders: [...]

Keep tags flat and consistent

Use the same key names across all eval files (priority, not sometimes prio and sometimes pri)
Keep values lowercase and concise
Avoid deeply nested or overly granular tag schemes — start simple and add keys only when you need a new suite dimension

Using suites in CI

Wire suites into your CI pipeline to run the right evals at the right time:

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run safety suite
        run: vally eval --suite safety

      - name: Run fast suite
        run: vally eval --suite fast

Suggested CI layout

Pipeline stage	Suite	What it runs
PR check	`fast`	`priority: p0` — must pass before merge
Merge gate	`ci-gate`	`priority: [p0, p1]` — broader coverage
Nightly	`full`	All evals (empty filter or no suite)
Safety review	`safety`	`evals: ["evals/safety/**"]` — dedicated safety evals

Named environments in suites

Suites can be combined with named environments defined in .vally.yaml. Named environments configure shared skills, files, and setup commands that eval specs can reference. When you run a suite, the environment configuration from your eval specs is applied as usual.

See the eval spec guide for details on how environments work.

Debugging suites

No stimuli matched?

If vally eval --suite <name> runs zero stimuli, check these common causes:

Typo in tag key or value — filter: { priortiy: p0 } won’t match priority: p0
Missing tags on stimuli — a filter key that doesn’t exist on any stimulus matches nothing
File path glob doesn’t match — double-check paths are relative to the project root

Verify your filter logic

Remember the AND/OR semantics: adding more keys to a filter makes it stricter (AND), while adding more values to a single key makes it broader (OR).

# Strict: must be p0 AND auth AND free — very narrow
strict:
  filter:
    priority: p0
    area: auth
    cost: free

# Broad: p0 OR p1 OR p2 — very wide
broad:
  filter:
    priority: [p0, p1, p2]

Quick checklist

✅ Suite is defined under suites: in .vally.yaml
✅ Tag keys in filter match tag keys in your eval specs exactly
✅ Tag values are strings (not numbers — use "1" not 1)
✅ evals glob patterns use forward slashes and are relative to project root
✅ At least one of filter or evals is present in the suite definition

Next steps

Writing eval specs — how to write the eval files that suites select from
eval.yaml schema reference — complete field specification
Debugging evals — when evals fail unexpectedly