Grader: tool-calls

Taxonomy

Property	Value
Determinism	`static`
Cost	`low`
Reference	`reference-free`
Temporal scope	`trajectory-level`
Score kind	`code`

Config

graders:
  - type: tool-calls
    config:
      required:
        - create
        - name: bash
          command: "npm test"
      disallowed:
        - name: view
          path: "secret\\.env$"
      sequence:
        - name: validate
        - name: create

Field	Type	Required	Default	Description
`required`	(ToolMatch \| string)[]	No	—	Tools that must be called. Fails if any are missing
`disallowed`	(ToolMatch \| string)[]	No	—	Tools that must NOT be called. Fails if any are found
`sequence`	(ToolMatch \| string)[]	No	—	Tools that must be called in this relative order (see below)
`parallel`	(ToolMatch \| string)[]	No	—	Tools that must be dispatched together in one model response (see below)

At least one of required, disallowed, sequence, or parallel must be provided.

ToolMatch

Each entry can be a plain string (shorthand for { name: ... } — also a regex matched against the tool name) or an object with additional argument filters:

Field	Type	Required	Description
`name`	string	Yes	Regex matched against the tool name. Unanchored — use `^name$` for an exact match.
`command`	string	No	Regex matched against the tool’s `command` argument
`path`	string	No	Regex matched against the tool’s `path` argument
`args`	`Record<string, string>`	No	Regex patterns matched against arbitrary call arguments, keyed by argument name (a generalization of `command`/`path`).
`pattern`	string	No	Escape-hatch regex matched against the full `JSON.stringify(arguments)` of the call — tool-agnostic (see below). Prefer `args` for structured checks.
`result`	string	No	Regex matched against the tool’s result/observation (stringified). Not supported on `sequence` or `parallel` entries.
`min_count`	number	No	Minimum number of matching calls required (default `1`). `required` entries only.
`final`	boolean	No	When `true`, the matching call must be the final tool call in the trajectory. `required` entries only.
`at_step`	number	No	Positional constraint: the call must occur in exactly this agent step (0-based). `required` entries only.
`before_step`	number	No	Positional constraint: the call must occur before this agent step (i.e. step `0 .. before_step - 1`); must be `>= 1`. `required` entries only.

Matching arguments

command and path are convenient shortcuts for the most common arguments. args generalizes this to any argument: args: { query: "vally" } matches calls whose query argument matches the regex vally. Every listed argument must be present and match.

The two forms differ when a referenced argument is absent from a call: command/path raise a configuration error (they are expected to exist on the tools they describe), while args simply treats the call as a non-match, so the same matcher can be applied across heterogeneous tools.

Only string-valued arguments are matchable. Numeric, boolean, and object/array argument values are treated as absent, so a matcher against them never matches (and an args matcher against them is a non-match rather than an error).

The `pattern` escape hatch

command/path/args all match a named argument. When the assertion can’t be pinned to a single argument key — most commonly a content check across a heterogeneous edit set whose tools name their content argument differently (edit, create, apply_patch, bash, Edit, Write, …) — use pattern. It is a regex matched against the full JSON.stringify(arguments) of each candidate call, so it is tool-agnostic and can also reach non-string / nested argument values that args cannot:

# A LICENSE header must be written by *some* edit-like tool, whichever one the agent picks
- type: tool-calls
  config:
    required:
      - name: "^(edit|create|apply_patch|bash|Edit|Write)$"
        pattern: "Copyright \\(c\\) Microsoft"

pattern is an invocation-time matcher (like name/command/path/args), so it is honored on all four lists (required, disallowed, sequence, parallel) and composes with min_count, final, at_step, and before_step.

Keep these caveats in mind:

It matches the JSON-escaped, whitespace-free serialization (e.g. {"path":"a.ts","content":"..."}), not the raw argument values. Escape regex metacharacters accordingly, and don’t rely on the order of keys — it follows the source adapter’s property order and is not a stable contract. Prefer args whenever a structured, per-argument check is possible.
Serialization of an argument-less call depends on how the adapter represents it: a call whose arguments field is absent/null serializes to the empty string "" (so ^$ matches it), while a call with an empty object {} serializes to "{}". Match accordingly (e.g. ^(?:|\{\})$ to accept either).
If a call’s arguments can’t be JSON-serialized (e.g. a bigint value or a circular structure), the grader falls back to a best-effort String(arguments) (typically "[object Object]") instead of throwing, so an odd call can’t crash the whole grade. In that rare case a pattern written against the JSON shape simply won’t match — prefer args for such tools.
Combining pattern with command/path across a broad name can still raise a configuration error, because command/path throw when the referenced argument is absent on a matched tool. Use pattern on its own, or alongside args, when spanning heterogeneous tools.

Counting and result matching

min_count requires at least N matching (completed) calls instead of just one — useful for “the agent retried at least twice” style checks.
result matches against the tool’s output/observation. Non-string results are JSON-stringified before matching. It is evaluated on required and disallowed entries; it is not supported on sequence or parallel entries, which match at invocation time before the result is known.
final asserts that the matching call is the last tool call in the trajectory — useful for “the agent finished by reporting its result” checks.

Steps

A step is an agent turn, counted by turn_start events in the trajectory (the first turn is step 0). The granularity depends on the executor: the ATIF adapter emits one turn per model response, while the Copilot adapter emits one turn per agent turn. Trajectories without any turn boundaries are treated as a single step 0.

at_step and before_step are only honored on required entries; using them on disallowed or sequence entries raises a configuration error. When both are set on the same entry, at_step must be strictly less than before_step (otherwise no step could satisfy both, which raises a configuration error).

Parallel dispatch

parallel asserts that a set of tools were dispatched together in the same model-response batch — i.e. the agent parallelized independent operations rather than issuing them one turn at a time. It is a top-level list (like sequence) of ToolMatch/string entries:

- type: tool-calls
  config:
    parallel:
      - name: "^read_file$"
      - name: "^web_search$"

The check passes when a single agent step (model-response batch) contains a distinct matching call for every entry, and fails otherwise. “Same batch” is determined by the step: all tool calls emitted within one turn_start boundary belong to the same batch (see Steps). Repeated entries require distinct calls — parallel: [edit, edit] needs two edit calls in the same batch — and overlapping matchers are assigned to distinct calls, so a generic matcher never double-counts a call already consumed by a more specific one.

Only invocation-time fields (name, command, path, args, pattern) are honored on parallel entries. min_count, final, result, at_step, and before_step are rejected as configuration errors: parallel already concerns co-occurrence within a single (unspecified) batch, and result is not known at invocation time.

parallel asserts a single co-occurrence group. To assert that two independent groups of tools each ran in parallel, use two separate grader entries — one per group.

Behavior

Scans trajectory events for tool_call and tool_result pairs. For each completed tool invocation, checks the tool name and arguments against the required and disallowed matchers.

name, command, path, args values, pattern, and result are matched as unanchored regex patterns. To require an exact match, anchor the pattern: ^value$.
A required matcher is satisfied when at least one tool call matches it (or min_count calls when set). When a required entry also has at_step / before_step, the matching call must additionally occur in the specified step; with final: true, the matching call must be the last tool call in the trajectory.
A required entry with at_step: N is satisfied only if a matching call occurs in step N; with before_step: N, only if a matching call occurs in one of steps 0 .. N-1. The step is the turn in which the tool was invoked (the tool_call), not where its result arrived.
A disallowed matcher is violated when any tool call matches it (including its args / result filters).
sequence matchers must be satisfied as an ordered subsequence: for [A, B, C], the trajectory must contain a tool call matching A, then a later one matching B, then a later one matching C. Other tool calls may be interleaved between them, and repeated entries (e.g. [poll, poll]) require that many distinct calls in order.
parallel matchers must all be satisfied by distinct calls within a single model-response batch (same step). Unlike sequence — which allows the calls to be spread across steps with interleaving — parallel requires co-occurrence in one batch.

Passes when all required matchers are satisfied, no disallowed matchers are violated, the sequence (if any) is satisfied in order, and the parallel group (if any) co-occurs in one batch. Fails otherwise.

Use cases

# Agent must use the create tool at some point
- type: tool-calls
  config:
    required:
      - create

# Agent must run npm test via bash
- type: tool-calls
  config:
    required:
      - name: bash
        command: "npm test"

# Agent must use either bash or powershell (regex name with alternation)
- type: tool-calls
  config:
    required:
      - name: "^(bash|powershell)$"

# Agent must not view sensitive files
- type: tool-calls
  config:
    disallowed:
      - name: view
        path: "\\.env$"

# Combine required and disallowed
- type: tool-calls
  config:
    required:
      - name: edit
        path: "src/.*\\.ts$"
    disallowed:
      - name: bash
        command: "rm -rf"

# Agent must validate before it mutates (ordered subsequence)
- type: tool-calls
  config:
    sequence:
      - name: "^validate_.*"
      - name: "^(create|update)_.*"

# Agent must load the skill early, in the first three steps
- type: tool-calls
  config:
    required:
      - name: "^load_skill$"
        before_step: 3

# Agent must validate in the very first step
- type: tool-calls
  config:
    required:
      - name: "^validate$"
        at_step: 0

# Agent must try the upload at least twice
- type: tool-calls
  config:
    required:
      - name: "^upload$"
        min_count: 2

# Agent must search for a specific query (generic argument matching)
- type: tool-calls
  config:
    required:
      - name: "^web_search$"
        args:
          query: "release notes"

# Some edit-like tool must write a license header, whichever tool the agent chose
# (tool-agnostic content check via the `pattern` escape hatch)
- type: tool-calls
  config:
    required:
      - name: "^(edit|create|apply_patch|bash|Edit|Write)$"
        pattern: "Copyright \\(c\\) Microsoft"

# Agent's build must succeed (match on tool output)
- type: tool-calls
  config:
    required:
      - name: "^bash$"
        command: "npm run build"
        result: "BUILD SUCCEEDED"

# Agent must finish by reporting its result (final tool call)
- type: tool-calls
  config:
    required:
      - name: "^report_result$"
        final: true

# Agent must fan out its reads in parallel (same model response)
- type: tool-calls
  config:
    parallel:
      - name: "^read_file$"
      - name: "^web_search$"

# Parent must delegate the edit, not perform it itself
- type: tool-calls
  scope: parent
  config:
    disallowed:
      - name: "^edit$"

# The delegated subagent must actually perform the edit
- type: tool-calls
  scope: subagent
  config:
    required:
      - name: "^edit$"

# Assert behavior of a subagent matched by its agent identity
- type: tool-calls
  scope:
    agent: explore-1
  config:
    required:
      - name: "^grep$"

Scoping to an agent

tool-calls matches across the whole trajectory by default. Use the grader-level scope field to constrain matching to the parent (root) agent or a subagent — for example to assert the parent delegated work (scope: parent with a disallowed entry) while the subagent did it (scope: subagent with a required entry). scope composes with turn.

Subagent scoping relies on per-call agent identity from the trajectory adapter, whose meaning differs by source. On Copilot, scope: parent and scope: subagent work reliably, but each subagent’s agentId is the opaque launching tool-call id (e.g. toolu_…), not a friendly name — so { agent } isn’t practically authorable there; prefer parent / subagent. ATIF supports scoping when the document inlines its subagent trajectories (subagent_trajectories), stamping events with the subagent’s agent name, so { agent: "<name>" } is authorable; subagents referenced only as external files carry no events in the document, so scope: subagent / { agent } matches nothing for those.