Grader: run-command

Taxonomy

Property	Value
Determinism	`complex-static`
Cost	`low`
Reference	`reference-free`
Temporal scope	`trajectory-level`
Score kind	`code`

Config

graders:
  - type: run-command
    config:
      command: "node"
      args: ["validate.js", "--strict"]
      expected_exit_code: 0
      stdout_contains: "OK"
      stdout_matches: "(?i)success"
      sub_path: "src/app"
      timeout: 30s

Field	Type	Required	Default	Description
`command`	string	Yes	—	Binary to run (when `args` provided) or shell command string (when `args` omitted)
`args`	string[]	No	—	Arguments passed directly to the binary. When provided, the command runs without a shell for reliable exit codes and cross-platform behavior
`expected_exit_code`	number	No	`0`	Expected exit code for a passing result
`stdout_contains`	string \| string[]	No	—	Substring(s) that must appear in stdout. Array form requires all entries to appear.
`stderr_contains`	string \| string[]	No	—	Substring(s) that must appear in stderr. Array form requires all entries to appear.
`stdout_matches`	string \| string[]	No	—	Regex pattern(s) that must match stdout. Array form requires all patterns to match. Supports inline flags: `(?i)` for case-insensitive, `(?m)` for multiline, `(?s)` for dotAll
`stdout_not_matches`	string \| string[]	No	—	Regex pattern(s) that must not match stdout. Array form requires none of the patterns to match. Same inline flag support as `stdout_matches`
`sub_path`	string	No	—	Subdirectory under the workspace root to use as the working directory. Must stay within the workspace root.
`timeout`	duration	No	`30s`	Timeout (e.g. `30s`, `2m`). Must be positive (`0` is rejected).
`env`	`Record<string, string>`	No	—	Additional environment variables passed to the subprocess. Merged with the process environment.

Behavior

Executes the command in the trajectory’s workDir as a child process. Checks exit code and optionally validates stdout/stderr content.

When args is omitted, command is passed to the system shell (/bin/sh on Unix, cmd.exe on Windows) for parsing — shell built-ins and metacharacters work in this mode. When args is provided, command is the binary path and is executed directly without a shell, giving reliable exit codes and cross-platform arg handling.

Passes when all configured checks pass (exit code match + any stdout/stderr assertions). Fails otherwise.

When any of the array-accepting fields (stdout_contains, stderr_contains, stdout_matches, stdout_not_matches) is given an array, each entry produces its own grader-result detail whose name is the dash-cased form of the field with the index appended — e.g. stdout-contains[0], stdout-contains[1], stdout-not-matches[2] — so failure evidence still names the specific check that failed. (Note the dash-cased detail name, distinct from the underscore config key.) Empty values are rejected at validation time — that includes empty scalars (""), empty arrays, and empty-string entries inside arrays — so a config edit can’t silently turn a check into a no-op (an empty substring matches anything; an empty regex matches everything).

This is a complex-static grader — deterministic but with side effects and I/O. Because of the I/O cost, it’s typically appropriate for CI and outer-loop evals rather than fast inner-loop checks.

Use cases

# Run tests to verify agent's code works
- type: run-command
  config:
    command: "npm test"
    expected_exit_code: 0

# Check a file is valid JSON (shell command string mode)
- type: run-command
  config:
    command: 'node -e ''JSON.parse(require("fs").readFileSync("output.json"))'''
    expected_exit_code: 0

# Compile TypeScript
- type: run-command
  config:
    command: "npx tsc --noEmit"
    expected_exit_code: 0

# Direct exec with explicit args (cross-platform, reliable exit codes)
- type: run-command
  config:
    command: "node"
    args: ["validate.js", "--strict"]
    expected_exit_code: 0
    stdout_contains: "OK"

# Regex match on stdout (case-insensitive)
- type: run-command
  config:
    command: "dotnet"
    args: ["run"]
    stdout_matches: "(?i)hello world"

# Assert stdout does NOT contain error patterns
- type: run-command
  config:
    command: "node"
    args: ["app.js"]
    stdout_not_matches: "ERROR|FATAL|WARN"

# Array form: every listed substring must appear in stdout
- type: run-command
  config:
    command: "python3"
    args: ["main.py", "5", "3"]
    stdout_contains:
      - "area=15"
      - "perimeter=16"

# Array form for the regex variant: every pattern must match
- type: run-command
  config:
    command: "dotnet"
    args: ["test"]
    stdout_matches:
      - "Passed!"
      - "Total tests: \\d+"
    stdout_not_matches:
      - "ERROR"
      - "FATAL"

# Run tests from a subdirectory within the workspace
- type: run-command
  config:
    command: "npm test"
    sub_path: "packages/core"

# Pass custom environment variables to the command
- type: run-command
  config:
    command: pytest
    args: [tests/test_output.py, -x]
    env:
      TEST_MODE: "eval"
      PYTHONPATH: "/app/src"

Evidence examples

✔ Command 'npm test' exited with code 0 (expected 0)
✘ Command 'npm test' exited with code 1 (expected 0)