Migrating from Other Systems

This guide helps teams migrate from custom eval scripts or other frameworks to Vally. The core idea: your existing concepts map cleanly to Vally’s model.

Concept mapping

Your concept	Vally equivalent	Notes
Test case / example	Stimulus	A prompt + grader config in `eval.yaml`
Assertion / check	Grader	Implements the `Grader` interface
Test run / execution	Trajectory	Flat event log captured during agent execution
Test result	GraderResult	`{ passed, score, evidence }`
Test suite / harness	eval.yaml	Collection of stimuli with scoring config
Test config / profile	eval.yaml `defaults`	Execution settings (runs, timeout, model, executor, judge_model)
CI check	`vally lint` or `eval`	CLI commands with exit codes
Gold reference / expected output	`reference-based` grader input	Comparison runs via `vally compare` (baseline vs. treatment)
Test report	LintReporter / EvalReporter	`LintConsoleReporter`, `EvalConsoleReporter` built-in, extensible

Migration steps

Inventory your existing checks

List every assertion your current system makes. For each one, categorize it:

Check	Type	Vally grader
“Output contains X”	String match	`output-contains` (built-in)
“Output does NOT contain X”	Negated string match	`output-not-contains` (built-in)
“File was created”	File check	`file-exists` (built-in)
“File was NOT created”	Negated file check	`file-not-exists` (built-in)
“File contains pattern”	Content check	`file-contains` (built-in)
“File does NOT contain text”	Negated content check	`file-not-contains` (built-in)
“Output matches regex”	Pattern match	`output-matches` (built-in)
“Output does NOT match regex”	Negated pattern match	`output-not-matches` (built-in)
“Command exits 0”	Script check	`run-command` (built-in)
“Agent produced output”	Session check	`exit-success` (built-in)
“Session completed cleanly”	Session health	`completed` (built-in)
“LLM says output is good”	LLM judge	`prompt` (built-in)
“Compare A vs B”	A/B comparison	`prompt` comparison mode (`vally compare`)

Convert test cases to stimuli

Each test case becomes a stimulus in eval.yaml:

# run_test("Write tests for add()", expect_file="add.test.js")

stimuli:
  - name: test-generation
    prompt: "Write tests for the add() function"
    graders:
      - type: file-exists
        config:
          path: "add.test.js"

Convert custom assertions to graders

For checks that don’t map to built-in graders, implement the Grader interface:

import type { Grader, GraderInput, GraderResult, GraderMetadata } from "@microsoft/vally";

export class MyCustomCheck implements Grader {
  metadata: GraderMetadata = {
    name: "my-custom-check",
    description: "Checks something specific to my domain",
    behavior: { execution: "single" },
    determinism: "complex-static",
    costProfile: "low",
    reference: "reference-free",
    temporalScope: "trajectory-level",
  };

  async grade(input: GraderInput): Promise<GraderResult> {
    // Port your existing assertion logic here
    const passed = /* your check */;
    return {
      name: this.metadata.name,
      kind: "code",
      passed,
      score: passed ? 1 : 0,
      evidence: passed ? "Check passed" : "Check failed because...",
    };
  }
}

Set up your eval.yaml

Combine your stimuli and configure scoring:

name: migrated-eval-suite
type: capability
defaults:
  runs: 3
  timeout: 120s
  model: gpt-5.5
stimuli:
  # All your converted test cases
scoring:
  threshold: 0.7

Update CI

Replace your existing eval CI step with Vally:

- run: npm install -g @microsoft/vally-cli
- run: vally lint .
- run: vally eval --eval-spec eval.yaml --output-dir ./results

What you gain

After migration, you get several things that are hard to build with custom scripts:

Feature	Custom scripts	Vally
Shared grader taxonomy	❌ Ad-hoc	✅ Declared per grader
Multi-trial metrics (pass@k)	❌ Build yourself	✅ Built-in
Trajectory capture + replay	❌ Build yourself	✅ Built-in
Re-grade without re-running agent	❌ Not possible	✅ `vally grade`
Shared graders across teams	❌ Copy-paste	✅ Plugin registry

Next steps

Writing eval specs — advanced patterns
Writing custom graders — full guide
Add to CI — GitHub Actions setup