Grader: program

Taxonomy

Property	Value
Determinism	`static`
Cost	`low`
Reference	`reference-free`
Temporal scope	`trajectory-level`
Score kind	`code`

Config

graders:
  - type: program
    config:
      program: "python"
      args: ["graders/check_output.py"]
      timeout: 60s
      sub_path: "src/app"

Field	Type	Required	Default	Description
`program`	string	Yes	—	The program to run (e.g. `python`, `node`, `bash`)
`args`	string[]	No	—	Command-line arguments passed to the program
`shell`	boolean	No	`false`	Run the program using the system shell (`/bin/sh` on Unix, `cmd.exe` on Windows) instead of direct exec
`sub_path`	string	No	—	Subdirectory under the workspace root to use as the working directory. Must stay within the workspace root.
`timeout`	duration	No	`60s`	Maximum time the process can run before being terminated (e.g. `60s`, `2m`). Must be positive (`0` is rejected).
`env`	`Record<string, string>`	No	—	Additional environment variables passed to the subprocess. Merged with the process environment. `EVALUATE_WORKSPACE` and `EVALUATE_GRADER_INPUT` are always set by the grader and cannot be specified in config.env.

Behavior

Executes the program in a child process and sets two environment variables so your grader can load the full evaluation context:

Environment variable	Value
`EVALUATE_WORKSPACE`	Path to the workspace root (always the top-level workspace, even when `sub_path` changes the working directory)
`EVALUATE_GRADER_INPUT`	Path to a temporary JSON file containing the serialized `GraderInput`

When sub_path is set, the child process’s working directory (cwd) is the resolved subdirectory, but EVALUATE_WORKSPACE still points to the workspace root.

The reserved variables EVALUATE_WORKSPACE and EVALUATE_GRADER_INPUT are rejected if they appear in env (case-insensitive), because the grader owns their values.

Your program communicates its result back in one of two ways:

Exit-code mode

Print nothing to stdout (stderr is fine). Exit 0 to pass, non-zero to fail. The score is 1 on pass, 0 on fail.

GraderResult JSON mode

Print a JSON object conforming to the GraderResult schema to stdout. This lets you return a custom score (between 0 and 1 inclusive), evidence text, and metadata. Do not print anything else to stdout — use stderr for diagnostics.

{
  "name": "my-custom-check",
  "passed": true,
  "score": 0.85,
  "evidence": "14 of 16 assertions passed",
  "kind": "code"
}

If stdout contains text that is not valid JSON, or the JSON does not match the GraderResult schema, the grader fails with a 0 score.

Use cases

# Python script that inspects the workspace
- type: program
  config:
    program: "python"
    args: ["graders/validate_schema.py"]

# Bash script using exit-code mode
- type: program
  config:
    program: "bash"
    args: ["graders/check.sh"]
    shell: true

# Node.js script returning a GraderResult JSON
- type: program
  config:
    program: "node"
    args: ["graders/score.js"]
    timeout: 120s

# Pass custom environment variables to the grader
- type: program
  config:
    program: python
    args: [-m, my_checker]
    env:
      API_ENDPOINT: "https://test.example.com"
      EXPECTED_STATUS: "active"

Evidence examples

✔ Grader exited successfully
✘ Grader exited with exit code 1
✘ Grader timed out
✘ Failed to start grader program: ENOENT (spawn)
✘ Grader returned unparseable JSON output on stdout: SyntaxError: …
✘ Grader output did not match schema