Debugging Evals
Evals fail. That’s their job — to tell you something isn’t right. This guide helps you figure out what isn’t right: the skill, the eval, or the grader.
Step 0: Validate your eval spec
Section titled “Step 0: Validate your eval spec”Before debugging grader logic, make sure the eval spec itself is valid:
vally lint --eval eval.yamlThis catches typos (e.g., output-contain → “Did you mean output-contains?”), invalid config keys, missing required fields, and scoring mismatches — all without running any agents.
Step 1: Read the evidence
Section titled “Step 1: Read the evidence”Every grader result includes an evidence field — a human-readable explanation of why it passed or failed:
✘ file-exists No files matching 'add.test.js' found✔ output-contains 'test' found in outputStart here. The evidence often tells you exactly what went wrong.
Step 2: Use —verbose
Section titled “Step 2: Use —verbose”The --verbose flag shows the full agent output alongside grader results:
vally eval --eval-spec eval.yaml --verboseThis prints the agent’s actual text response, which helps you understand:
- Did the agent understand the prompt?
- Did it try the right approach?
- Did it produce output in the expected format?
Step 3: Inspect trajectories and executor logs
Section titled “Step 3: Inspect trajectories and executor logs”vally eval saves trajectories and executor logs to the run’s output directory, ./vally-results by default (specify --output-dir to change this path). In the run’s timestamped directory, look for the trajectory in results.jsonl and the session logs under executor-session-logs/.
Each session directory always contains a metadata.json file with a logSource field indicating what was captured:
"native"— the executor wrote a fullevents.jsonlsession log (best quality)."raw-event-fallback"—events.jsonlwas reconstructed from raw SDK events (partial fidelity)."none"— no session events were captured (the executor didn’t emit events).
The events.jsonl file is best-effort — not all executors emit native session logs, and some sessions may have no events at all.
Common things to look for in results.jsonl:
| Symptom | What to check in trajectory |
|---|---|
| Grader says file doesn’t exist | Check tool_call events — did the agent actually write the file? |
| Unexpected output | Check assistant_message events — what did the agent actually say? |
| Timeout | Check wallTimeMs in metrics — is the timeout too short? |
| No tool calls | Check toolCallCount — maybe the agent solved it conversationally |
| Errors | Filter for error events — network issues, permission problems, etc. |
Step 4: Re-grade without re-running
Section titled “Step 4: Re-grade without re-running”The grade command lets you test grader changes against saved trajectories:
# Re-grade saved outcomesvally grade --eval-spec eval.yaml < ./debug-results/outcomes.jsonlThis is much faster than re-running the agent. Use it to:
- Adjust grader config (change the substring, loosen the pattern)
- Add or remove graders
- Tweak scoring weights and thresholds
Step 5: Preserve the workspace
Section titled “Step 5: Preserve the workspace”Use --workspace to keep the agent’s working directory after the run:
vally eval \ --eval-spec eval.yaml \ --workspace ./debug-workspaceAfter the run, you can inspect exactly what files the agent created:
ls -la ./debug-workspace/basic-test-generation/cat ./debug-workspace/basic-test-generation/add.test.jsCommon failure patterns
Section titled “Common failure patterns””File not found” but agent seems to have created it
Section titled “”File not found” but agent seems to have created it”The agent may have written the file to a different path than your grader expects. Check:
- The trajectory’s
tool_callevents for the actual path used - Your
file-existsconfig — does the glob pattern match? - The workspace directory —
lsit to see what’s actually there
Output grader fails but output looks correct
Section titled “Output grader fails but output looks correct”Usually a case sensitivity issue. By default, output-contains is case-insensitive. If you set case_sensitive: true, check that the casing matches exactly.
Inconsistent pass/fail across trials
Section titled “Inconsistent pass/fail across trials”This is expected with LLM-based agents. Increase runs and look at the multi-trial metrics:
- pass@k close to 1.0 → the agent CAN do it, just not reliably
- pass@k close to 0 → the agent fundamentally can’t do this
All graders pass but score below threshold
Section titled “All graders pass but score below threshold”Check your scoring.weights. If a high-weight grader scores 0.5 (partial credit), it can drag the aggregate below the threshold even if everything “passes.”
Debugging LLM judge graders
Section titled “Debugging LLM judge graders”Prompt grader returns low scores
Section titled “Prompt grader returns low scores”LLM judges can be sensitive to prompt wording. If scores seem too low:
- Run with
--verboseto see per-criterion breakdowns and the judge’s reasoning - Check that your
promptis specific enough — vague prompts produce vague (low) scores - Try a different
scoringscale —binaryforces a clear yes/no instead of a muddy 3/5 - Try a more capable
--judge-model— smaller models may misjudge complex outputs
Pairwise comparison disagrees with position swap
Section titled “Pairwise comparison disagrees with position swap”The pairwise grader runs the comparison twice with A/B swapped. If results are inconsistent (A wins in both positions), the judge model may not be able to distinguish the trajectories. Use --verbose to see both positions’ reasoning.
LLM judge retries exhausted
Section titled “LLM judge retries exhausted”If you see “Trial execution error: retries exhausted” in evidence, the LLM API is rate-limiting or timing out. Solutions:
- Reduce
--workersto lower concurrent API calls - Use
--judge-modelto switch to a model with higher rate limits - Separate grading from execution:
eval --skip-grade, thengradeat a lower concurrency
Next steps
Section titled “Next steps”- Writing eval specs — improve your stimuli
- Writing custom graders — build better checks
- Trajectory format reference — full event schema