Why Vally
The problem
Section titled “The problem”You’ve built an agent. It calls tools, reads files, writes code, talks to APIs. Sometimes it does exactly what you want. Sometimes it doesn’t. How do you know which?
Today, most teams answer that question with:
- Ad-hoc scripts — one-off checks that verify a file was created or output looks right. No shared vocabulary, no reuse, break silently when the agent changes.
- Generic eval frameworks — tools like Braintrust or Evalite work well for LLM output evaluation but weren’t designed for agent behavior. They don’t understand tool calls, multi-turn conversations, or the distinction between “the agent produced good output” and “the agent took a good path to get there.”
- Manual review — a human reads the output and decides “looks good.” Doesn’t scale, isn’t consistent, can’t gate a CI pipeline.
The result: every team reinvents assertion logic, and there’s no standard way to express “this check is cheap and deterministic” vs “this check requires an LLM call” so you can pick the right ones for local dev vs CI vs nightly runs.
What Vally does differently
Section titled “What Vally does differently”1. Everything is a Grader
Section titled “1. Everything is a Grader”Every check — from “does the output contain this string?” to “did an LLM judge rate this response 4/5?” — implements one interface:
interface Grader { metadata: GraderMetadata; grade(input: GraderInput): Promise<GraderResult>;}Static lints, runtime assertions, command execution checks, LLM judges — all composable, all interchangeable. You don’t need different frameworks for different kinds of checks.
2. Taxonomy for loop-appropriate checks
Section titled “2. Taxonomy for loop-appropriate checks”Every grader declares metadata about what it is:
metadata: GraderMetadata = { name: "output-contains", determinism: "static", // static | complex-static | slm | llm costProfile: "low", // free | low | medium | high portability: "t1-universal", // t1-universal | t2-domain | t3a-scenario // ...};You use the taxonomy to pick the right graders for each context — fast/cheap checks for inner-loop and CI runs, heavier LLM judges for nightly runs:
| Context | What runs |
|---|---|
| Local development | Static graders only |
| CI / PR gate | Static + complex-static graders |
| Nightly / outer loop | Everything, including LLM judges and A/B |
3. Trajectory as first-class data
Section titled “3. Trajectory as first-class data”When an agent runs, Vally captures a Trajectory — a flat event log of everything that happened: tool calls and results, token usage, turn boundaries, skill activations, errors.
Graders inspect trajectories, not just final output. This lets you write behavioral assertions that output-matching can’t express:
- “The agent called the
write_filetool” (not just “a file exists”) - “It completed in under 5 turns”
- “It activated the right skill”
- “It didn’t call any disallowed tools”
4. Designed for the full development loop
Section titled “4. Designed for the full development loop”Vally covers the workflow from first prototype to production monitoring:
Write code → Lint locally → Run evals → Gate in CI → Track regressions (instant) (thorough) (automated) (nightly)Other frameworks cover one piece — typically just “run eval, get score.” Vally has opinions about how each phase should work, and the shared grader interface plus taxonomy metadata lets you reuse the same checks across phases by picking the appropriate ones for each loop.
When to choose Vally
Section titled “When to choose Vally”| Situation | Recommendation |
|---|---|
| Evaluating agent behavior (tool calls, multi-turn, trajectories) | Use Vally — built for this |
| Need different checks at different stages (dev / CI / nightly) | Use Vally — taxonomy lets you pick graders per stage |
| Want to share graders across teams | Use Vally — plugin architecture |
| Building Copilot skills and need quality gates | Use Vally — has skill-specific graders built in |
| Running generic LLM benchmarks (MMLU, HumanEval, etc.) | Other frameworks may be simpler |
Migration mapping
Section titled “Migration mapping”If you’re coming from another system, here’s how your concepts map:
| Your concept | Vally equivalent |
|---|---|
| Test case / example | Stimulus — a prompt + grader config |
| Assertion / check | Grader — implements the Grader interface |
| Test run / execution | Trajectory — captured event log |
| Test result | GraderResult — pass/fail + score + evidence |
| Test suite | eval.yaml — collection of stimuli |
| CI check | vally lint or eval command |
Next steps
Section titled “Next steps”- How it works — the pipeline model in detail
- Get started — hands-on quickstarts
- Migration guide — move your existing eval infrastructure