CLI: grade

Usage

vally eval --output jsonl | vally grade --eval-spec eval.yaml [options]
vally grade --eval-spec eval.yaml [options] < outcomes.jsonl

Description

Grade trajectories read from stdin. Accepts trial-result JSONL from vally eval --output jsonl, legacy EvalOutcome JSONL, a single EvalOutcome JSON object from eval, ATIF, or a session run from an agent like Copilot CLI. The input format is auto-detected from the JSON shape. This lets you iterate on graders without re-running the (expensive) agent execution.

The command reads from stdin and matches each trajectory’s stimulus to the graders defined in the eval spec.

Options

Flag	Type	Required	Description
`--eval-spec, -e <path>`	string	Yes	Path to eval spec file
`--stimulus <name>`	string	No	Stimulus name to grade against (ATIF or session run input only; required when prompt matches multiple stimuli)
`--workspace <path>`	string	No	Path to the agent workspace directory for file-based graders (ATIF or session run input only; defaults to the directory in the session run, or process.cwd() for ATIF)
`--run-dir <path>`	string	No	Directory of a `vally eval --output-dir` run. Resolves `trajectory.diffPath` references in JSONL input so diff graders can re-grade from sidecar files
`--output jsonl`	string	No	Emit graded results as JSONL (one outcome per line)
`--judge-model <model>`	string	No	Model for LLM judge graders
`--judge-reasoning-effort <level>`	string	No	Reasoning effort for the judge model: `low`, `medium`, `high`, or `xhigh`
`--grader-plugin <specifier>`	string	No	Grader plugin to load (npm package name or local path)
`--param <key=value>`	string	No	Set a param value (repeatable, e.g. `--param MODEL=gpt-4o`). Overrides eval param files and `.vally.yaml`.
`--verbose`	boolean	No	Show detailed grader evidence

Offline re-grading

grade is the home for offline re-grading — running graders against a saved trajectory without re-running the (expensive) agent. Pipe a saved EvalOutcome/trial-result stream back in and tweak your graders freely:

vally eval --eval-spec eval.yaml --output jsonl > run.jsonl
# ...edit your graders in eval.yaml...
vally grade --eval-spec eval.yaml < run.jsonl

Output- and trajectory-based graders (output-contains, output-matches, tool-call, skill-invocation, prompt, …) read everything they need from the saved trajectory, so this “just works” with no workspace.

File-based graders (file-exists, file-contains, run-command, …) read the agent’s workspace on disk. Because the agent’s file changes aren’t stored in the trajectory, re-grading them offline requires the workspace to still exist:

For ATIF and session-run input, pass --workspace <path> to point the graders at a preserved workspace.
To verify graders against a known-good reference solution instead (harness verification / CI gate), use vally oracle with a golden patch.

Reference-based grading with a golden patch

If a stimulus declares a golden_patch (a reference-solution diff), grade resolves it and hands it to the graders. The prompt LLM judge renders it as a Reference Solution section, so the judge compares the agent’s real output against the known-good answer — improving judging consistency on open-ended tasks.

The golden patch is advisory here: it adds reference context but isn’t the thing under test. If its path can’t be read, grade warns and grades without it rather than failing. (Under vally oracle the same patch is the artifact under test, so a missing patch there is a hard error.)

Exit codes

Code	Meaning
`0`	All graders passed
`1`	One or more graders failed, or an error occurred

Examples

# Pipe directly from eval
vally eval --eval-spec eval.yaml --skip-grade --output jsonl \
  | vally grade --eval-spec eval.yaml

# Grade a saved JSONL file
vally grade --eval-spec eval.yaml < results/outcomes.jsonl

# Grade with verbose output
vally grade --eval-spec eval.yaml --verbose < results/outcomes.jsonl

# Grade and re-emit as JSONL (for downstream processing)
vally grade --eval-spec eval.yaml --output jsonl < results/outcomes.jsonl > graded.jsonl

# Grade an ATIF trajectory (auto-detected from schema_version)
vally grade --eval-spec eval.yaml < trajectory.json

# Grade ATIF with explicit stimulus and workspace
vally grade --eval-spec eval.yaml --stimulus basic --workspace /path/to/workspace < trajectory.json

# Grade a Copilot CLI session run, by id (auto-resolves from ~/.copilot/session-state/)
echo '{"sessionKind":"copilot-cli","sessionId":"abc123"}' | vally grade --eval-spec eval.yaml

# Grade a Copilot CLI session run by explicit path
echo '{"sessionKind":"copilot-cli","sessionPath":"/path/to/events.jsonl"}' | vally grade --eval-spec eval.yaml

# Grade a Copilot CLI session run with explicit stimulus and workspace
echo '{"sessionKind":"copilot-cli","sessionId":"abc123"}' | \
  vally grade --eval-spec eval.yaml --stimulus task-name --workspace /path/to/workspace

Input formats

The grade command auto-detects the input shape from the JSON content and accepts:

trial-result JSONL — current vally eval --output jsonl output, including trial-result records and typed metadata such as run-summary
Legacy EvalOutcome JSONL — one JSON object per line, as emitted by older eval --output jsonl flows
Single EvalOutcome JSON — a whole-file JSON object
ATIF JSON — an Agent Trajectory Interchange Format document (identified by schema_version: "ATIF-...")
Session Run JSON — an explicit session run reference (sessionKind: "copilot-cli" with either sessionId, set to the session ID or sessionPath, an explicit path to a copilot session log)

Session Run format

A session run provides a reference to Copilot CLI event logs:

{
  "sessionKind": "copilot-cli",
  "sessionId": "abc123"
}

Or with an explicit path:

{
  "sessionKind": "copilot-cli",
  "sessionPath": "/path/to/events.jsonl"
}

When sessionId is provided, the events file is auto-resolved from ~/.copilot/session-state/<sessionId>/events.jsonl. Specify sessionPath to provide an explicit location. Do not set both sessionId and sessionPath.

Mixing formats

ATIF documents and EvalOutcome records must not be mixed in the same stream; grade will exit with an error if both shapes appear together. Session runs cannot be mixed with EvalOutcomes or ATIF.

Output format

━━━ basic-test-generation ━━━
✅ basic-test-generation (2/2 graders passed)
  ✓ [file-exists] Files matching 'add.test.js' found: add.test.js
  ✓ [output-contains] 'test' found in output

Score: 100.0% | PASSED

On failure:

━━━ basic-test-generation ━━━
❌ basic-test-generation (1/2 graders passed, 1 failed)
  ✓ [output-contains] 'test' found in output
  ✗ [file-exists] No files matching 'add.test.js' found

Score: 33.3% | FAILED

Workflow

The typical workflow with grade:

Run eval with --skip-grade --output jsonl > outcomes.jsonl to capture trajectories
Iterate on your eval.yaml graders
Re-grade: vally grade --eval-spec eval.yaml < outcomes.jsonl
Repeat until graders work correctly, then run the full eval