Vally

Stop guessing.
Start grading.

Vally is the modular platform for agent evaluation.
Define stimuli, capture trajectories, grade behavior — and catch regressions before they reach production.

Get started

How it works

See it in action

One CLI. Every platform

Terminal — vally eval

Dashboard & API

Explore results. Spot regressions.

Spin up the built-in dashboard to explore scores, catch regressions, and track trends over time. Or go further — the Results API puts every eval metric within reach of your pipelines, your workflows, and your agents.

Vally Serve

Results API

Vally dashboard showing pass rate stats and duration chart

Vally dashboard score matrix and all outcomes table

Why Vally?

Fast by default

Static graders finish in under 2 seconds. LLM judges are opt-in. You get instant feedback during development and full confidence in CI.

Everything is a Grader

File checks, string matching, LLM judges, custom scripts — one interface. Write a grader once, compose it into any eval.

One pipeline, every loop

Same model from local dev to CI to production. Define stimuli, capture trajectories, grade with any combination.

Extend, don’t fork

Custom graders, executors, and reporters plug in cleanly. Bring your own scoring logic without changing the framework.

Ready to grade your agent?

Install Vally

Run your first eval