Skip to content

Debugging Evals

Evals fail. That’s their job — to tell you something isn’t right. This guide helps you figure out what isn’t right: the skill, the eval, or the grader.

Before debugging grader logic, make sure the eval spec itself is valid:

Terminal window
vally lint --eval eval.yaml

This catches typos (e.g., output-contain → “Did you mean output-contains?”), invalid config keys, missing required fields, and scoring mismatches — all without running any agents.

Every grader result includes an evidence field — a human-readable explanation of why it passed or failed:

✘ file-exists No files matching 'add.test.js' found
✔ output-contains 'test' found in output

Start here. The evidence often tells you exactly what went wrong.

The --verbose flag shows the full agent output alongside grader results:

Terminal window
vally eval --eval-spec eval.yaml --verbose

This prints the agent’s actual text response, which helps you understand:

  • Did the agent understand the prompt?
  • Did it try the right approach?
  • Did it produce output in the expected format?

Step 3: Inspect trajectories and executor logs

Section titled “Step 3: Inspect trajectories and executor logs”

vally eval saves trajectories and executor logs to the run’s output directory, ./vally-results by default (specify --output-dir to change this path). In the run’s timestamped directory, look for the trajectory in results.jsonl and the session logs under executor-session-logs/.

Each session directory always contains a metadata.json file with a logSource field indicating what was captured:

  • "native" — the executor wrote a full events.jsonl session log (best quality).
  • "raw-event-fallback"events.jsonl was reconstructed from raw SDK events (partial fidelity).
  • "none" — no session events were captured (the executor didn’t emit events).

The events.jsonl file is best-effort — not all executors emit native session logs, and some sessions may have no events at all.

Common things to look for in results.jsonl:

SymptomWhat to check in trajectory
Grader says file doesn’t existCheck tool_call events — did the agent actually write the file?
Unexpected outputCheck assistant_message events — what did the agent actually say?
TimeoutCheck wallTimeMs in metrics — is the timeout too short?
No tool callsCheck toolCallCount — maybe the agent solved it conversationally
ErrorsFilter for error events — network issues, permission problems, etc.

The grade command lets you test grader changes against saved trajectories:

Terminal window
# Re-grade saved outcomes
vally grade --eval-spec eval.yaml < ./debug-results/outcomes.jsonl

This is much faster than re-running the agent. Use it to:

  • Adjust grader config (change the substring, loosen the pattern)
  • Add or remove graders
  • Tweak scoring weights and thresholds

Use --workspace to keep the agent’s working directory after the run:

Terminal window
vally eval \
--eval-spec eval.yaml \
--workspace ./debug-workspace

After the run, you can inspect exactly what files the agent created:

Terminal window
ls -la ./debug-workspace/basic-test-generation/
cat ./debug-workspace/basic-test-generation/add.test.js

”File not found” but agent seems to have created it

Section titled “”File not found” but agent seems to have created it”

The agent may have written the file to a different path than your grader expects. Check:

  1. The trajectory’s tool_call events for the actual path used
  2. Your file-exists config — does the glob pattern match?
  3. The workspace directory — ls it to see what’s actually there

Output grader fails but output looks correct

Section titled “Output grader fails but output looks correct”

Usually a case sensitivity issue. By default, output-contains is case-insensitive. If you set case_sensitive: true, check that the casing matches exactly.

This is expected with LLM-based agents. Increase runs and look at the multi-trial metrics:

  • pass@k close to 1.0 → the agent CAN do it, just not reliably
  • pass@k close to 0 → the agent fundamentally can’t do this

All graders pass but score below threshold

Section titled “All graders pass but score below threshold”

Check your scoring.weights. If a high-weight grader scores 0.5 (partial credit), it can drag the aggregate below the threshold even if everything “passes.”

LLM judges can be sensitive to prompt wording. If scores seem too low:

  • Run with --verbose to see per-criterion breakdowns and the judge’s reasoning
  • Check that your prompt is specific enough — vague prompts produce vague (low) scores
  • Try a different scoring scale — binary forces a clear yes/no instead of a muddy 3/5
  • Try a more capable --judge-model — smaller models may misjudge complex outputs

Pairwise comparison disagrees with position swap

Section titled “Pairwise comparison disagrees with position swap”

The pairwise grader runs the comparison twice with A/B swapped. If results are inconsistent (A wins in both positions), the judge model may not be able to distinguish the trajectories. Use --verbose to see both positions’ reasoning.

If you see “Trial execution error: retries exhausted” in evidence, the LLM API is rate-limiting or timing out. Solutions:

  • Reduce --workers to lower concurrent API calls
  • Use --judge-model to switch to a model with higher rate limits
  • Separate grading from execution: eval --skip-grade, then grade at a lower concurrency