Skip to content

Why Vally

You’ve built an agent. It calls tools, reads files, writes code, talks to APIs. Sometimes it does exactly what you want. Sometimes it doesn’t. How do you know which?

Today, most teams answer that question with:

  • Ad-hoc scripts — one-off checks that verify a file was created or output looks right. No shared vocabulary, no reuse, break silently when the agent changes.
  • Generic eval frameworks — tools like Braintrust or Evalite work well for LLM output evaluation but weren’t designed for agent behavior. They don’t understand tool calls, multi-turn conversations, or the distinction between “the agent produced good output” and “the agent took a good path to get there.”
  • Manual review — a human reads the output and decides “looks good.” Doesn’t scale, isn’t consistent, can’t gate a CI pipeline.

The result: every team reinvents assertion logic, and there’s no standard way to express “this check is cheap and deterministic” vs “this check requires an LLM call” so you can pick the right ones for local dev vs CI vs nightly runs.

Every check — from “does the output contain this string?” to “did an LLM judge rate this response 4/5?” — implements one interface:

interface Grader {
metadata: GraderMetadata;
grade(input: GraderInput): Promise<GraderResult>;
}

Static lints, runtime assertions, command execution checks, LLM judges — all composable, all interchangeable. You don’t need different frameworks for different kinds of checks.

Every grader declares metadata about what it is:

metadata: GraderMetadata = {
name: "output-contains",
determinism: "static", // static | complex-static | slm | llm
costProfile: "low", // free | low | medium | high
portability: "t1-universal", // t1-universal | t2-domain | t3a-scenario
// ...
};

You use the taxonomy to pick the right graders for each context — fast/cheap checks for inner-loop and CI runs, heavier LLM judges for nightly runs:

ContextWhat runs
Local developmentStatic graders only
CI / PR gateStatic + complex-static graders
Nightly / outer loopEverything, including LLM judges and A/B

When an agent runs, Vally captures a Trajectory — a flat event log of everything that happened: tool calls and results, token usage, turn boundaries, skill activations, errors.

Graders inspect trajectories, not just final output. This lets you write behavioral assertions that output-matching can’t express:

  • “The agent called the write_file tool” (not just “a file exists”)
  • “It completed in under 5 turns”
  • “It activated the right skill”
  • “It didn’t call any disallowed tools”

Vally covers the workflow from first prototype to production monitoring:

Write code → Lint locally → Run evals → Gate in CI → Track regressions
(instant) (thorough) (automated) (nightly)

Other frameworks cover one piece — typically just “run eval, get score.” Vally has opinions about how each phase should work, and the shared grader interface plus taxonomy metadata lets you reuse the same checks across phases by picking the appropriate ones for each loop.

SituationRecommendation
Evaluating agent behavior (tool calls, multi-turn, trajectories)Use Vally — built for this
Need different checks at different stages (dev / CI / nightly)Use Vally — taxonomy lets you pick graders per stage
Want to share graders across teamsUse Vally — plugin architecture
Building Copilot skills and need quality gatesUse Vally — has skill-specific graders built in
Running generic LLM benchmarks (MMLU, HumanEval, etc.)Other frameworks may be simpler

If you’re coming from another system, here’s how your concepts map:

Your conceptVally equivalent
Test case / exampleStimulus — a prompt + grader config
Assertion / checkGrader — implements the Grader interface
Test run / executionTrajectory — captured event log
Test resultGraderResult — pass/fail + score + evidence
Test suiteeval.yaml — collection of stimuli
CI checkvally lint or eval command