Debugging Evals

Evals fail. That’s their job — to tell you something isn’t right. This guide helps you figure out what isn’t right: the skill, the eval, or the grader.

Step 0: Validate your eval spec

Before debugging grader logic, make sure the eval spec itself is valid:

vally lint --eval-spec eval.yaml

This catches typos (e.g., output-contain → “Did you mean output-contains?”), invalid config keys, missing required fields, and scoring mismatches — all without running any agents.

Step 1: Read the evidence

Every grader result includes an evidence field — a human-readable explanation of why it passed or failed:

✘ file-exists     No files matching 'add.test.js' found
✔ output-contains 'test' found in output

Start here. The evidence often tells you exactly what went wrong.

Step 2: Use –verbose

The --verbose flag shows the full agent output alongside grader results:

vally eval --eval-spec eval.yaml --verbose

This prints the agent’s actual text response, which helps you understand:

Did the agent understand the prompt?
Did it try the right approach?
Did it produce output in the expected format?

Step 3: Inspect trajectories and executor logs

vally eval saves trajectories and executor logs to the run’s output directory, ./vally-results by default (specify --output-dir to change this path). In the run’s timestamped directory, look for the trajectory in results.jsonl and the per-trial session logs under <eval>/<stimulus>/<model>/<trial>/ directories within the run directory.

Each session directory always contains a metadata.json file with a logSource field indicating what was captured:

"native" — the executor wrote a full events.jsonl session log (best quality).
"raw-event-fallback" — events.jsonl was reconstructed from raw SDK events (partial fidelity).
"none" — no session events were captured (the executor didn’t emit events).

The events.jsonl file is best-effort — not all executors emit native session logs, and some sessions may have no events at all. If a stimulus configures artifacts, the copied files are saved in an artifacts/ subdirectory of the trial’s session directory.

Common things to look for in results.jsonl:

Symptom	What to check in trajectory
Grader says file doesn’t exist	Check `tool_call` events — did the agent actually write the file?
Unexpected output	Check `assistant_message` events — what did the agent actually say?
Timeout	Check `wallTimeMs` in metrics — is the timeout too short?
No tool calls	Check `toolCallCount` — maybe the agent solved it conversationally
Errors	Filter for `error` events — network issues, permission problems, etc.

Step 4: Re-grade without re-running

The grade command lets you test grader changes against saved trajectories:

# Re-grade saved outcomes
vally grade --eval-spec eval.yaml < ./debug-results/outcomes.jsonl

This is much faster than re-running the agent. Use it to:

Adjust grader config (change the substring, loosen the pattern)
Add or remove graders
Tweak scoring weights and thresholds

Step 5: Preserve the workspace

Use --workspace to keep the agent’s working directory after the run:

vally eval \
  --eval-spec eval.yaml \
  --workspace ./debug-workspace

After the run, you can inspect exactly what files the agent created:

ls -la ./debug-workspace/basic-test-generation/
cat ./debug-workspace/basic-test-generation/add.test.js

Common failure patterns

“File not found” but agent seems to have created it

The agent may have written the file to a different path than your grader expects. Check:

The trajectory’s tool_call events for the actual path used
Your file-exists config — does the glob pattern match?
The workspace directory — ls it to see what’s actually there

Output grader fails but output looks correct

Usually a case sensitivity issue. By default, output-contains is case-insensitive. If you set case_sensitive: true, check that the casing matches exactly.

Inconsistent pass/fail across trials

This is expected with LLM-based agents. Increase runs and look at the multi-trial metrics:

pass@k close to 1.0 → the agent CAN do it, just not reliably
pass@k close to 0 → the agent fundamentally can’t do this

All graders pass but score below threshold

Check your scoring.weights. If a high-weight grader scores 0.5 (partial credit), it can drag the aggregate below the threshold even if everything “passes.”

Debugging LLM judge graders

Prompt grader returns low scores

LLM judges can be sensitive to prompt wording. If scores seem too low:

Run with --verbose to see per-criterion breakdowns and the judge’s reasoning
Check that your prompt is specific enough — vague prompts produce vague (low) scores
Try a different scoring scale — binary forces a clear yes/no instead of a muddy 3/5
Try a more capable --judge-model — smaller models may misjudge complex outputs

Comparison disagrees across position swap

The prompt judge in comparison mode runs each comparison twice with the two responses swapped. If the two directions disagree on the winner, the result is forced to a tie (the judge couldn’t distinguish the trajectories independent of order). Use --verbose to see per-trial reasoning.

LLM judge retries exhausted

If you see “Trial execution error: retries exhausted” in evidence, the LLM API is rate-limiting or timing out. Solutions:

Reduce --workers to lower concurrent API calls
Use --judge-model to switch to a model with higher rate limits
Separate grading from execution: eval --skip-grade, then grade at a lower concurrency

Hermeticity: local Copilot settings

By default, eval runs are hermetic against your machine’s Copilot configuration. Each run points the spawned Copilot runtime at a fresh, empty COPILOT_HOME, so your personal ~/.copilot/settings.json never affects results. In particular, sticky /model and /subagents picker preferences (model, reasoning effort, context tier) and other persisted settings (permissions, MCP config, disabled skills, hooks, extensions) cannot silently leak into an eval — a common cause of “it passes for me but not in CI” and of inconsistent token/latency numbers across machines. Everything an eval actually needs (model, skills, MCP servers) is configured explicitly by the eval, not read from your local config.

Next steps

Writing eval specs — improve your stimuli
Writing custom graders — build better checks
Trajectory format reference — full event schema