Why Vally

The problem

You’ve built an agent. It calls tools, reads files, writes code, talks to APIs. Sometimes it does exactly what you want. Sometimes it doesn’t. How do you know which?

Today, most teams answer that question with:

Ad-hoc scripts — one-off checks that verify a file was created or output looks right. No shared vocabulary, no reuse, break silently when the agent changes.
Generic eval frameworks — tools like Braintrust or Evalite work well for LLM output evaluation but weren’t designed for agent behavior. They don’t understand tool calls, multi-turn conversations, or the distinction between “the agent produced good output” and “the agent took a good path to get there.”
Manual review — a human reads the output and decides “looks good.” Doesn’t scale, isn’t consistent, can’t gate a CI pipeline.

The result: every team reinvents assertion logic, and there’s no standard way to express “this check is cheap and deterministic” vs “this check requires an LLM call” so you can pick the right ones for local dev vs CI vs nightly runs.

What Vally does differently

1. Everything is a Grader

Every check — from “does the output contain this string?” to “did an LLM judge rate this response 4/5?” — implements one interface:

interface Grader {
  metadata: GraderMetadata;
  grade(input: GraderInput): Promise<GraderResult>;
}

Static lints, runtime assertions, command execution checks, LLM judges — all composable, all interchangeable. You don’t need different frameworks for different kinds of checks.

2. Taxonomy for loop-appropriate checks

Every grader declares metadata about what it is:

metadata: GraderMetadata = {
  name: "output-contains",
  determinism: "static", // static | complex-static | slm | llm
  costProfile: "free", // free | low | medium | high
  // ...
};

You use the taxonomy to pick the right graders for each context — fast/cheap checks for inner-loop and CI runs, heavier LLM judges for nightly runs:

Context	What runs
Local development	Static graders only
CI / PR gate	Static + complex-static graders
Nightly / outer loop	Everything, including LLM judges and A/B

3. Trajectory as first-class data

When an agent runs, Vally captures a Trajectory — a flat event log of everything that happened: tool calls and results, token usage, turn boundaries, skill activations, errors.

Graders inspect trajectories, not just final output. This lets you write behavioral assertions that output-matching can’t express:

“The agent called the write_file tool” (not just “a file exists”)
“It completed in under 5 turns”
“It activated the right skill”
“It didn’t call any disallowed tools”

4. Designed for the full development loop

Vally covers the workflow from first prototype to production monitoring:

Write code  →  Lint locally  →  Run evals  →  Gate in CI  →  Track regressions
               (instant)       (thorough)    (automated)     (nightly)

Other frameworks cover one piece — typically just “run eval, get score.” Vally has opinions about how each phase should work, and the shared grader interface plus taxonomy metadata lets you reuse the same checks across phases by picking the appropriate ones for each loop.

When to choose Vally

Situation	Recommendation
Evaluating agent behavior (tool calls, multi-turn, trajectories)	Use Vally — built for this
Need different checks at different stages (dev / CI / nightly)	Use Vally — taxonomy lets you pick graders per stage
Want to share graders across teams	Use Vally — plugin architecture
Building Copilot skills and need quality gates	Use Vally — has skill-specific graders built in
Running generic LLM benchmarks (MMLU, HumanEval, etc.)	Other frameworks may be simpler

Migration mapping

If you’re coming from another system, here’s how your concepts map:

Your concept	Vally equivalent
Test case / example	Stimulus — a prompt + grader config
Assertion / check	Grader — implements the `Grader` interface
Test run / execution	Trajectory — captured event log
Test result	GraderResult — pass/fail + score + evidence
Test suite	eval.yaml — collection of stimuli
CI check	`vally lint` or `eval` command

Next steps

How it works — the pipeline model in detail
Get started — hands-on quickstarts
Migration guide — move your existing eval infrastructure