Skip to content

Migrating from Other Systems

This guide helps teams migrate from custom eval scripts or other frameworks to Vally. The core idea: your existing concepts map cleanly to Vally’s model.

Your conceptVally equivalentNotes
Test case / exampleStimulusA prompt + grader config in eval.yaml
Assertion / checkGraderImplements the Grader interface
Test run / executionTrajectoryFlat event log captured during agent execution
Test resultGraderResult{ passed, score, evidence }
Test suite / harnesseval.yamlCollection of stimuli with scoring config
Test config / profileeval.yaml configExecution settings (runs, timeout, model, executor, judge_model)
CI checkvally lint or evalCLI commands with exit codes
Gold reference / expected outputreference-based grader inputPassed as trajectoryB in pairwise graders
Test reportLintReporter / EvalReporterLintConsoleReporter, EvalConsoleReporter built-in, extensible
  1. Inventory your existing checks

    List every assertion your current system makes. For each one, categorize it:

    CheckTypeVally grader
    ”Output contains X”String matchoutput-contains (built-in)
    “Output does NOT contain X”Negated string matchoutput-not-contains (built-in)
    “File was created”File checkfile-exists (built-in)
    “File was NOT created”Negated file checkfile-not-exists (built-in)
    “File contains pattern”Content checkfile-contains (built-in)
    “File does NOT contain text”Negated content checkfile-not-contains (built-in)
    “Output matches regex”Pattern matchoutput-matches (built-in)
    “Output does NOT match regex”Negated pattern matchoutput-not-matches (built-in)
    “Command exits 0”Script checkrun-command (built-in)
    “Agent produced output”Session checkexit-success (built-in)
    “Session completed cleanly”Session healthcompleted (built-in)
    “LLM says output is good”LLM judgeprompt (built-in)
    “Compare A vs B”A/B comparisonpairwise (built-in)
  2. Convert test cases to stimuli

    Each test case becomes a stimulus in eval.yaml:

    Before: custom script
    # run_test("Write tests for add()", expect_file="add.test.js")
    After: eval.yaml
    stimuli:
    - name: test-generation
    prompt: "Write tests for the add() function"
    graders:
    - type: file-exists
    config:
    path: "add.test.js"
  3. Convert custom assertions to graders

    For checks that don’t map to built-in graders, implement the Grader interface:

    import type { Grader, GraderInput, GraderResult, GraderMetadata } from "@microsoft/vally";
    export class MyCustomCheck implements Grader {
    metadata: GraderMetadata = {
    name: "my-custom-check",
    description: "Checks something specific to my domain",
    determinism: "complex-static",
    costProfile: "low",
    portability: "t2-domain",
    reference: "reference-free",
    temporalScope: "trajectory-level",
    };
    async grade(input: GraderInput): Promise<GraderResult> {
    // Port your existing assertion logic here
    const passed = /* your check */;
    return {
    name: this.metadata.name,
    kind: "code",
    passed,
    score: passed ? 1 : 0,
    evidence: passed ? "Check passed" : "Check failed because...",
    };
    }
    }
  4. Set up your eval.yaml

    Combine your stimuli and configure scoring:

    eval.yaml
    name: migrated-eval-suite
    type: capability
    config:
    runs: 3
    timeout: 120s
    model: gpt-5.5
    stimuli:
    # All your converted test cases
    scoring:
    threshold: 0.7
  5. Update CI

    Replace your existing eval CI step with Vally:

    .github/workflows/eval.yml
    - run: npm install -g @microsoft/vally-cli
    - run: vally lint .
    - run: vally eval --eval-spec eval.yaml --output-dir ./results

After migration, you get several things that are hard to build with custom scripts:

FeatureCustom scriptsVally
Shared grader taxonomy❌ Ad-hoc✅ Declared per grader
Multi-trial metrics (pass@k)❌ Build yourself✅ Built-in
Trajectory capture + replay❌ Build yourself✅ Built-in
Re-grade without re-running agent❌ Not possiblevally grade
Shared graders across teams❌ Copy-paste✅ Plugin registry