Grader: run-command
Taxonomy
Section titled “Taxonomy”| Property | Value |
|---|---|
| Determinism | complex-static |
| Cost | low |
| Portability | t2-domain |
| Reference | reference-free |
| Temporal scope | trajectory-level |
| Score kind | code |
Config
Section titled “Config”graders: - type: run-command config: command: "node" args: ["validate.js", "--strict"] expected_exit_code: 0 stdout_contains: "OK" stdout_matches: "(?i)success" sub_path: "src/app" timeout: 30s| Field | Type | Required | Default | Description |
|---|---|---|---|---|
command | string | Yes | — | Binary to run (when args provided) or shell command string (when args omitted) |
args | string[] | No | — | Arguments passed directly to the binary. When provided, the command runs without a shell for reliable exit codes and cross-platform behavior |
expected_exit_code | number | No | 0 | Expected exit code for a passing result |
stdout_contains | string | No | — | Substring that must appear in stdout |
stderr_contains | string | No | — | Substring that must appear in stderr |
stdout_matches | string | No | — | Regex pattern that must match stdout. Supports inline flags: (?i) for case-insensitive, (?m) for multiline, (?s) for dotAll |
stdout_not_matches | string | No | — | Regex pattern that must not match stdout. Same inline flag support as stdout_matches |
sub_path | string | No | — | Subdirectory under the workspace root to use as the working directory. Must stay within the workspace root. |
timeout | duration | No | 30s | Timeout (e.g. 30s, 2m) |
env | Record<string, string> | No | — | Additional environment variables passed to the subprocess. Merged with the process environment. |
Behavior
Section titled “Behavior”Executes the command in the trajectory’s workDir as a child process. Checks exit code and optionally validates stdout/stderr content.
When args is omitted, command is passed to the system shell (/bin/sh on Unix, cmd.exe on Windows) for parsing — shell built-ins and metacharacters work in this mode. When args is provided, command is the binary path and is executed directly without a shell, giving reliable exit codes and cross-platform arg handling.
Passes when all configured checks pass (exit code match + any stdout/stderr assertions). Fails otherwise.
This is a complex-static grader — deterministic but with side effects and I/O. Because of the I/O cost, it’s typically appropriate for CI and outer-loop evals rather than fast inner-loop checks.
Use cases
Section titled “Use cases”# Run tests to verify agent's code works- type: run-command config: command: "npm test" expected_exit_code: 0
# Check a file is valid JSON (shell command string mode)- type: run-command config: command: 'node -e ''JSON.parse(require("fs").readFileSync("output.json"))''' expected_exit_code: 0
# Compile TypeScript- type: run-command config: command: "npx tsc --noEmit" expected_exit_code: 0
# Direct exec with explicit args (cross-platform, reliable exit codes)- type: run-command config: command: "node" args: ["validate.js", "--strict"] expected_exit_code: 0 stdout_contains: "OK"
# Regex match on stdout (case-insensitive)- type: run-command config: command: "dotnet" args: ["run"] stdout_matches: "(?i)hello world"
# Assert stdout does NOT contain error patterns- type: run-command config: command: "node" args: ["app.js"] stdout_not_matches: "ERROR|FATAL|WARN"
# Run tests from a subdirectory within the workspace- type: run-command config: command: "npm test" sub_path: "packages/core"
# Pass custom environment variables to the command- type: run-command config: command: pytest args: [tests/test_output.py, -x] env: TEST_MODE: "eval" PYTHONPATH: "/app/src"Evidence examples
Section titled “Evidence examples”✔ Command 'npm test' exited with code 0 (expected 0)✘ Command 'npm test' exited with code 1 (expected 0)