Graders: metric thresholds

Five graders that check trajectory metrics against a configurable max budget. All read values that the pipeline already computes — no additional analysis cost.

`token-budget`

Checks that total tokens (input + output) do not exceed max.

graders:
  - type: token-budget
    config:
      max: 50000

Evidence: 1500 tokens (within budget of 50000) or 75000 tokens exceeds max of 50000.

`tool-call-count`

Checks that the number of tool calls does not exceed max.

graders:
  - type: tool-call-count
    config:
      max: 20

`turn-count`

Checks that the number of agent turns does not exceed max.

graders:
  - type: turn-count
    config:
      max: 10

`error-count`

Checks that the number of error events does not exceed max. Use max: 0 to require zero errors.

graders:
  - type: error-count
    config:
      max: 0

`wall-time`

Checks that wall-clock execution time does not exceed max. Accepts duration strings ("30s", "2m", "1h").

graders:
  - type: wall-time
    config:
      max: "2m"

Shared behavior

Config

All five require a max field:

Field	Type	Required	Description
`max`	`integer` or `duration`	yes	Upper bound. `wall-time` requires a duration string (e.g. `"30s"`, `"2m"`); the rest require non-negative integers.

max must be a finite non-negative value. Omitting it is a validation error.

Scoring

When within budget (value ≤ max): score = 1.

When over budget: score degrades linearly — score = max(0, 1 − (value − max) / max(max, 1)), floored at 0. This means:

At exactly the threshold → score 1 (pass)
At 1.5× the threshold → score 0.5 (for thresholds ≥ 1)
At 2× the threshold or above → score 0 (for thresholds ≥ 1)
For max: 0, any value above 0 immediately scores 0

Taxonomy

All five share identical taxonomy metadata:

Property	Value
Determinism	`static`
Cost	`free`
Reference	`reference-free`
Temporal scope	`trajectory-level`
Score kind	`code`

Metadata

Every result includes structured metadata:

{
  "value": 1500,
  "max": 50000
}