Skip to content

Run Your First Eval

This quickstart runs a full evaluation: define stimuli, execute an agent, capture a trajectory, and grade the results.

  1. Write an eval.yaml with stimuli and graders
  2. Run the eval and inspect the trajectory
  3. Understand the scores
  1. Set up a skill directory

    If you don’t have one already, create a minimal skill:

    Terminal window
    mkdir -p my-skill && cd my-skill
    SKILL.md
    ---
    name: test-writer
    description: Helps users write unit tests for their code.
    ---
    ## Usage
    When a user asks for unit tests, analyze their code and produce
    comprehensive test cases with good coverage.
  2. Write an eval spec

    Create eval.yaml in your skill directory:

    eval.yaml
    name: test-writer-eval
    description: Evaluate the test-writer skill
    type: capability
    config:
    runs: 1
    timeout: 120s
    model: gpt-5.5
    stimuli:
    - name: basic-test-generation
    prompt: |
    Write unit tests for this function:
    function add(a, b) { return a + b; }
    Save the tests to a file called add.test.js.
    graders:
    - type: file-exists
    config:
    path: "add.test.js"
    - type: output-contains
    config:
    substring: "test"
    scoring:
    weights:
    file-exists: 1.0
    output-contains: 0.5
    threshold: 0.7
  3. Run the eval

    Terminal window
    vally eval \
    --eval-spec eval.yaml \
    --skill-dir . \
    --output-dir ./results \
    --verbose

    You’ll see output like:

    Found 1 skill(s): test-writer
    ━━━ basic-test-generation ━━━
    Write unit tests for this function: function add(a, b) { return a + b; }...
    Metrics
    ─────────────────────────────────────────
    Tokens 2,847
    Turns 3
    Tool calls 2
    Wall time 8.3s
    Errors 0
    Skills used 1
    Model gpt-5.5
    Graders (2/2)
    ─────────────────────────────────────────
    ✔ file-exists Files matching 'add.test.js' found: add.test.js
    ✔ output-contains 'test' found in output
    All graders passed.
    Saved artifacts
    JSONL → ./results/2025-01-15T10-30-00/results.jsonl
    Markdown → ./results/2025-01-15T10-30-00/eval-results.md
  4. Re-grade saved trajectories

    Each run’s results.jsonl contains one trial-result record per trial, with the full trajectory embedded inline. You can pipe it back through vally grade to re-score with different graders without re-running the (expensive) agent execution:

    Terminal window
    cat ./results/2025-01-15T10-30-00/results.jsonl | vally grade --eval-spec eval.yaml

Every eval run captures a trajectory — a record of everything the agent did:

MetricWhat it means
TokensTotal input + output tokens across all LLM calls
TurnsNumber of agent conversation turns
Tool callsHow many tools the agent invoked
Wall timeReal clock time for the run
Skills usedHow many skills were activated by the agent

Each grader produces a pass/fail with evidence explaining why:

  • ✔ file-exists Files matching 'add.test.js' found: add.test.js — the grader checked, file exists, passed.
  • ✘ output-contains 'jest' NOT found in output — the grader checked, substring missing, failed.

The final score is a weighted combination of grader results against a threshold. See Scoring for the math.