Run Your First Eval
This quickstart runs a full evaluation: define stimuli, execute an agent, capture a trajectory, and grade the results.
What you’ll do
Section titled “What you’ll do”- Write an
eval.yamlwith stimuli and graders - Run the eval and inspect the trajectory
- Understand the scores
-
Set up a skill directory
If you don’t have one already, create a minimal skill:
Terminal window mkdir -p my-skill && cd my-skillSKILL.md ---name: test-writerdescription: Helps users write unit tests for their code.---## UsageWhen a user asks for unit tests, analyze their code and producecomprehensive test cases with good coverage. -
Write an eval spec
Create
eval.yamlin your skill directory:eval.yaml name: test-writer-evaldescription: Evaluate the test-writer skilltype: capabilityconfig:runs: 1timeout: 120smodel: gpt-5.5stimuli:- name: basic-test-generationprompt: |Write unit tests for this function:function add(a, b) { return a + b; }Save the tests to a file called add.test.js.graders:- type: file-existsconfig:path: "add.test.js"- type: output-containsconfig:substring: "test"scoring:weights:file-exists: 1.0output-contains: 0.5threshold: 0.7 -
Run the eval
Terminal window vally eval \--eval-spec eval.yaml \--skill-dir . \--output-dir ./results \--verboseYou’ll see output like:
Found 1 skill(s): test-writer━━━ basic-test-generation ━━━Write unit tests for this function: function add(a, b) { return a + b; }...Metrics─────────────────────────────────────────Tokens 2,847Turns 3Tool calls 2Wall time 8.3sErrors 0Skills used 1Model gpt-5.5Graders (2/2)─────────────────────────────────────────✔ file-exists Files matching 'add.test.js' found: add.test.js✔ output-contains 'test' found in outputAll graders passed.Saved artifactsJSONL → ./results/2025-01-15T10-30-00/results.jsonlMarkdown → ./results/2025-01-15T10-30-00/eval-results.md -
Re-grade saved trajectories
Each run’s
results.jsonlcontains onetrial-resultrecord per trial, with the full trajectory embedded inline. You can pipe it back throughvally gradeto re-score with different graders without re-running the (expensive) agent execution:Terminal window cat ./results/2025-01-15T10-30-00/results.jsonl | vally grade --eval-spec eval.yaml
Understanding the output
Section titled “Understanding the output”Metrics
Section titled “Metrics”Every eval run captures a trajectory — a record of everything the agent did:
| Metric | What it means |
|---|---|
| Tokens | Total input + output tokens across all LLM calls |
| Turns | Number of agent conversation turns |
| Tool calls | How many tools the agent invoked |
| Wall time | Real clock time for the run |
| Skills used | How many skills were activated by the agent |
Grader results
Section titled “Grader results”Each grader produces a pass/fail with evidence explaining why:
✔ file-exists Files matching 'add.test.js' found: add.test.js— the grader checked, file exists, passed.✘ output-contains 'jest' NOT found in output— the grader checked, substring missing, failed.
Scores
Section titled “Scores”The final score is a weighted combination of grader results against a threshold. See Scoring for the math.
Next steps
Section titled “Next steps”- Writing eval specs — advanced stimulus patterns
- Add to CI — automate this in GitHub Actions
- Grader catalog — all built-in graders
- Debugging evals — when things go wrong