Migrating from Other Systems
This guide helps teams migrate from custom eval scripts or other frameworks to Vally. The core idea: your existing concepts map cleanly to Vally’s model.
Concept mapping
Section titled “Concept mapping”| Your concept | Vally equivalent | Notes |
|---|---|---|
| Test case / example | Stimulus | A prompt + grader config in eval.yaml |
| Assertion / check | Grader | Implements the Grader interface |
| Test run / execution | Trajectory | Flat event log captured during agent execution |
| Test result | GraderResult | { passed, score, evidence } |
| Test suite / harness | eval.yaml | Collection of stimuli with scoring config |
| Test config / profile | eval.yaml config | Execution settings (runs, timeout, model, executor, judge_model) |
| CI check | vally lint or eval | CLI commands with exit codes |
| Gold reference / expected output | reference-based grader input | Passed as trajectoryB in pairwise graders |
| Test report | LintReporter / EvalReporter | LintConsoleReporter, EvalConsoleReporter built-in, extensible |
Migration steps
Section titled “Migration steps”-
Inventory your existing checks
List every assertion your current system makes. For each one, categorize it:
Check Type Vally grader ”Output contains X” String match output-contains(built-in)“Output does NOT contain X” Negated string match output-not-contains(built-in)“File was created” File check file-exists(built-in)“File was NOT created” Negated file check file-not-exists(built-in)“File contains pattern” Content check file-contains(built-in)“File does NOT contain text” Negated content check file-not-contains(built-in)“Output matches regex” Pattern match output-matches(built-in)“Output does NOT match regex” Negated pattern match output-not-matches(built-in)“Command exits 0” Script check run-command(built-in)“Agent produced output” Session check exit-success(built-in)“Session completed cleanly” Session health completed(built-in)“LLM says output is good” LLM judge prompt(built-in)“Compare A vs B” A/B comparison pairwise(built-in) -
Convert test cases to stimuli
Each test case becomes a stimulus in
eval.yaml:Before: custom script # run_test("Write tests for add()", expect_file="add.test.js")After: eval.yaml stimuli:- name: test-generationprompt: "Write tests for the add() function"graders:- type: file-existsconfig:path: "add.test.js" -
Convert custom assertions to graders
For checks that don’t map to built-in graders, implement the
Graderinterface:import type { Grader, GraderInput, GraderResult, GraderMetadata } from "@microsoft/vally";export class MyCustomCheck implements Grader {metadata: GraderMetadata = {name: "my-custom-check",description: "Checks something specific to my domain",determinism: "complex-static",costProfile: "low",portability: "t2-domain",reference: "reference-free",temporalScope: "trajectory-level",};async grade(input: GraderInput): Promise<GraderResult> {// Port your existing assertion logic hereconst passed = /* your check */;return {name: this.metadata.name,kind: "code",passed,score: passed ? 1 : 0,evidence: passed ? "Check passed" : "Check failed because...",};}} -
Set up your eval.yaml
Combine your stimuli and configure scoring:
eval.yaml name: migrated-eval-suitetype: capabilityconfig:runs: 3timeout: 120smodel: gpt-5.5stimuli:# All your converted test casesscoring:threshold: 0.7 -
Update CI
Replace your existing eval CI step with Vally:
.github/workflows/eval.yml - run: npm install -g @microsoft/vally-cli- run: vally lint .- run: vally eval --eval-spec eval.yaml --output-dir ./results
What you gain
Section titled “What you gain”After migration, you get several things that are hard to build with custom scripts:
| Feature | Custom scripts | Vally |
|---|---|---|
| Shared grader taxonomy | ❌ Ad-hoc | ✅ Declared per grader |
| Multi-trial metrics (pass@k) | ❌ Build yourself | ✅ Built-in |
| Trajectory capture + replay | ❌ Build yourself | ✅ Built-in |
| Re-grade without re-running agent | ❌ Not possible | ✅ vally grade |
| Shared graders across teams | ❌ Copy-paste | ✅ Plugin registry |
Next steps
Section titled “Next steps”- Writing eval specs — advanced patterns
- Writing custom graders — full guide
- Add to CI — GitHub Actions setup