Grader: program
Taxonomy
Section titled “Taxonomy”| Property | Value |
|---|---|
| Determinism | static |
| Cost | low |
| Portability | t3a-scenario |
| Reference | reference-free |
| Temporal scope | trajectory-level |
| Score kind | code |
Config
Section titled “Config”graders: - type: program config: program: "python" args: ["graders/check_output.py"] timeout: 60s sub_path: "src/app"| Field | Type | Required | Default | Description |
|---|---|---|---|---|
program | string | Yes | — | The program to run (e.g. python, node, bash) |
args | string[] | No | — | Command-line arguments passed to the program |
shell | boolean | No | false | Run the program using a shell instead of exec |
sub_path | string | No | — | Subdirectory under the workspace root to use as the working directory. Must stay within the workspace root. |
timeout | duration | No | 60s | Maximum time the process can run before being terminated (e.g. 60s, 2m) |
env | Record<string, string> | No | — | Additional environment variables passed to the subprocess. Merged with the process environment. EVALUATE_WORKSPACE and EVALUATE_GRADER_INPUT are always set by the grader and cannot be specified in config.env. |
Behavior
Section titled “Behavior”Executes the program in a child process and sets two environment variables so your grader can load the full evaluation context:
| Environment variable | Value |
|---|---|
EVALUATE_WORKSPACE | Path to the workspace root (always the top-level workspace, even when sub_path changes the working directory) |
EVALUATE_GRADER_INPUT | Path to a temporary JSON file containing the serialized GraderInput |
When sub_path is set, the child process’s working directory (cwd) is the resolved subdirectory, but EVALUATE_WORKSPACE still points to the workspace root.
The reserved variables EVALUATE_WORKSPACE and EVALUATE_GRADER_INPUT are rejected if they appear in env (case-insensitive), because the grader owns their values.
Your program communicates its result back in one of two ways:
Exit-code mode
Section titled “Exit-code mode”Print nothing to stdout (stderr is fine). Exit 0 to pass, non-zero to fail. The score is 1 on pass, 0 on fail.
GraderResult JSON mode
Section titled “GraderResult JSON mode”Print a JSON object conforming to the GraderResult schema to stdout. This lets you return a custom score (between 0 and 1 inclusive), evidence text, and metadata. Do not print anything else to stdout — use stderr for diagnostics.
{ "name": "my-custom-check", "passed": true, "score": 0.85, "evidence": "14 of 16 assertions passed", "kind": "code"}If stdout contains text that is not valid JSON, or the JSON does not match the GraderResult schema, the grader fails with a 0 score.
Use cases
Section titled “Use cases”# Python script that inspects the workspace- type: program config: program: "python" args: ["graders/validate_schema.py"]
# Bash script using exit-code mode- type: program config: program: "bash" args: ["graders/check.sh"] shell: true
# Node.js script returning a GraderResult JSON- type: program config: program: "node" args: ["graders/score.js"] timeout: 120s
# Pass custom environment variables to the grader- type: program config: program: python args: [-m, my_checker] env: API_ENDPOINT: "https://test.example.com" EXPECTED_STATUS: "active"Evidence examples
Section titled “Evidence examples”✔ Grader exited successfully✘ Grader exited with exit code 1✘ Grader timed out✘ Failed to start grader program: ENOENT (spawn)✘ Grader returned unparseable JSON output on stdout: SyntaxError: …✘ Grader output did not match schema