Run Your First Eval

This quickstart runs a full evaluation: define stimuli, execute an agent, capture a trajectory, and grade the results.

What you’ll do

Write an eval.yaml with stimuli and graders
Run the eval and inspect the trajectory
Understand the scores

Set up a skill directory

If you don’t have one already, create a minimal skill:

mkdir -p my-skill && cd my-skill

---
name: test-writer
description: Helps users write unit tests for their code.
---

## Usage

When a user asks for unit tests, analyze their code and produce
comprehensive test cases with good coverage.

Write an eval spec

Create eval.yaml in your skill directory:

name: test-writer-eval
description: Evaluate the test-writer skill
type: capability

defaults:
  runs: 1
  timeout: 120s
  model: gpt-5.5

stimuli:
  - name: basic-test-generation
    prompt: |
      Write unit tests for this function:

      function add(a, b) { return a + b; }

      Save the tests to a file called add.test.js.
    graders:
      - type: file-exists
        config:
          path: "add.test.js"
      - type: output-contains
        config:
          substring: "test"

scoring:
  weights:
    file-exists: 0.7
    output-contains: 0.3
  threshold: 0.7

Run the eval

vally eval \
  --eval-spec eval.yaml \
  --skill-dir . \
  --output-dir ./results \
  --verbose

You’ll see output like:

Found 1 skill(s): test-writer

━━━ basic-test-generation ━━━
  Write unit tests for this function: function add(a, b) { return a + b; }...

  Metrics
  ─────────────────────────────────────────
  Tokens        2,847
  Turns         3
  Tool calls    2
  Wall time     8.3s
  Errors        0
  Skills used   1
  Model         gpt-5.5

  Graders (2/2)
  ─────────────────────────────────────────
  ✔ file-exists     Files matching 'add.test.js' found: add.test.js
  ✔ output-contains 'test' found in output

  All graders passed.

Saved artifacts
JSONL → ./results/2025-01-15T10-30-00/results.jsonl
Markdown → ./results/2025-01-15T10-30-00/eval-results.md

Re-grade saved trajectories

Each run’s results.jsonl contains one trial-result record per trial, with the full trajectory embedded inline. You can pipe it back through vally grade to re-score with different graders without re-running the (expensive) agent execution:
Terminal window
```
cat ./results/2025-01-15T10-30-00/results.jsonl | vally grade --eval-spec eval.yaml
```

Understanding the output

Metrics

Every eval run captures a trajectory — a record of everything the agent did:

Metric	What it means
Tokens	Total input + output tokens across all LLM calls
Turns	Number of agent conversation turns
Tool calls	How many tools the agent invoked
Wall time	Real clock time for the run
Skills used	How many skills were activated by the agent

Grader results

Each grader produces a pass/fail with evidence explaining why:

✔ file-exists Files matching 'add.test.js' found: add.test.js — the grader checked, file exists, passed.
✘ output-contains 'jest' NOT found in output — the grader checked, substring missing, failed.

Scores

The final score is a weighted combination of grader results against a threshold. See Scoring for the math.

Next steps

Writing eval specs — advanced stimulus patterns
Add to CI — automate this in GitHub Actions
Grader catalog — all built-in graders
Debugging evals — when things go wrong