Skip to content

CLI: grade

Terminal window
vally eval --output jsonl | vally grade --eval-spec eval.yaml [options]
vally grade --eval-spec eval.yaml [options] < outcomes.jsonl

Grade trajectories read from stdin. Accepts trial-result JSONL from vally eval --output jsonl, legacy EvalOutcome JSONL, a single EvalOutcome JSON object from eval, ATIF, or a session run from an agent like Copilot CLI. The input format is auto-detected from the JSON shape. This lets you iterate on graders without re-running the (expensive) agent execution.

The command reads from stdin and matches each trajectory’s stimulus to the graders defined in the eval spec.

FlagTypeRequiredDescription
--eval-spec <path>stringYesPath to eval spec file
--stimulus <name>stringNoStimulus name to grade against (ATIF or session run input only; required when prompt matches multiple stimuli)
--workspace <path>stringNoPath to the agent workspace directory for file-based graders (ATIF or session run input only; defaults to the directory in the session run, or process.cwd() for ATIF)
--output jsonlstringNoEmit graded results as JSONL (one outcome per line)
--judge-model <model>stringNoModel for LLM judge graders
--grader-plugin <specifier>stringNoGrader plugin to load (npm package name or local path)
--verbosebooleanNoShow detailed grader evidence
CodeMeaning
0All graders passed
1One or more graders failed, or an error occurred
Terminal window
# Pipe directly from eval
vally eval --eval-spec eval.yaml --skip-grade --output jsonl \
| vally grade --eval-spec eval.yaml
# Grade a saved JSONL file
vally grade --eval-spec eval.yaml < results/outcomes.jsonl
# Grade with verbose output
vally grade --eval-spec eval.yaml --verbose < results/outcomes.jsonl
# Grade and re-emit as JSONL (for downstream processing)
vally grade --eval-spec eval.yaml --output jsonl < results/outcomes.jsonl > graded.jsonl
# Grade an ATIF trajectory (auto-detected from schema_version)
vally grade --eval-spec eval.yaml < trajectory.json
# Grade ATIF with explicit stimulus and workspace
vally grade --eval-spec eval.yaml --stimulus basic --workspace /path/to/workspace < trajectory.json
# Grade a Copilot CLI session run, by id (auto-resolves from ~/.copilot/session-state/)
echo '{"sessionKind":"copilot-cli","sessionId":"abc123"}' | vally grade --eval-spec eval.yaml
# Grade a Copilot CLI session run by explicit path
echo '{"sessionKind":"copilot-cli","sessionPath":"/path/to/events.jsonl"}' | vally grade --eval-spec eval.yaml
# Grade a Copilot CLI session run with explicit stimulus and workspace
echo '{"sessionKind":"copilot-cli","sessionId":"abc123"}' | \
vally grade --eval-spec eval.yaml --stimulus task-name --workspace /path/to/workspace

The grade command auto-detects the input shape from the JSON content and accepts:

  1. trial-result JSONL — current vally eval --output jsonl output, including trial-result records and typed metadata such as run-summary
  2. Legacy EvalOutcome JSONL — one JSON object per line, as emitted by older eval --output jsonl flows
  3. Single EvalOutcome JSON — a whole-file JSON object
  4. ATIF JSON — an Agent Trajectory Interchange Format document (identified by schema_version: "ATIF-...")
  5. Session Run JSON — an explicit session run reference (sessionKind: "copilot-cli" with either sessionId, set to the session ID or sessionPath, an explicit path to a copilot session log)

A session run provides a reference to Copilot CLI event logs:

{
"sessionKind": "copilot-cli",
"sessionId": "abc123"
}

Or with an explicit path:

{
"sessionKind": "copilot-cli",
"sessionPath": "/path/to/events.jsonl"
}

When sessionId is provided, the events file is auto-resolved from ~/.copilot/session-state/<sessionId>/events.jsonl. Specify sessionPath to provide an explicit location. Do not set both sessionId and sessionPath.

ATIF documents and EvalOutcome records must not be mixed in the same stream; grade will exit with an error if both shapes appear together. Session runs cannot be mixed with EvalOutcomes or ATIF.

━━━ basic-test-generation ━━━
✅ basic-test-generation (2/2 graders passed)
✓ [file-exists] Files matching 'add.test.js' found: add.test.js
✓ [output-contains] 'test' found in output
Score: 100.0% | PASSED

On failure:

━━━ basic-test-generation ━━━
❌ basic-test-generation (1/2 graders passed, 1 failed)
✓ [output-contains] 'test' found in output
✗ [file-exists] No files matching 'add.test.js' found
Score: 33.3% | FAILED

The typical workflow with grade:

  1. Run eval with --skip-grade --output jsonl > outcomes.jsonl to capture trajectories
  2. Iterate on your eval.yaml graders
  3. Re-grade: vally grade --eval-spec eval.yaml < outcomes.jsonl
  4. Repeat until graders work correctly, then run the full eval