Skip to content

Graders: metric thresholds

Five graders that check trajectory metrics against a configurable max budget. All read values that the pipeline already computes — no additional analysis cost.

Checks that total tokens (input + output) do not exceed max.

graders:
- type: token-budget
config:
max: 50000

Evidence: 1500 tokens (within budget of 50000) or 75000 tokens exceeds max of 50000.


Checks that the number of tool calls does not exceed max.

graders:
- type: tool-call-count
config:
max: 20

Checks that the number of agent turns does not exceed max.

graders:
- type: turn-count
config:
max: 10

Checks that the number of error events does not exceed max. Use max: 0 to require zero errors.

graders:
- type: error-count
config:
max: 0

Checks that wall-clock execution time does not exceed max. Accepts duration strings ("30s", "2m", "1h").

graders:
- type: wall-time
config:
max: "2m"

All five require a max field:

FieldTypeRequiredDescription
maxinteger or durationyesUpper bound. wall-time requires a duration string (e.g. "30s", "2m"); the rest require non-negative integers.

max must be a finite non-negative value. Omitting it is a validation error.

When within budget (value ≤ max): score = 1.

When over budget: score degrades linearly — score = max(0, 1 − (value − max) / max(max, 1)), floored at 0. This means:

  • At exactly the threshold → score 1 (pass)
  • At 1.5× the threshold → score 0.5 (for thresholds ≥ 1)
  • At 2× the threshold or above → score 0 (for thresholds ≥ 1)
  • For max: 0, any value above 0 immediately scores 0

All five share identical taxonomy metadata:

PropertyValue
Determinismstatic
Costfree
Portabilityt1-universal
Referencereference-free
Temporal scopetrajectory-level
Score kindcode

Every result includes structured metadata:

{
"value": 1500,
"max": 50000
}