Skip to main content

Evals in CI

This guide describes how the vally eval pipeline authenticates in CI, how forked pull requests are handled, and how to add a new eval spec for an AI artifact (agent, prompt, instructions file, or skill).

Required Secret

The eval-execute job in .github/workflows/pr-validation.yml runs the vally eval command for each changed AI artifact. The vally CLI delegates to the @github/copilot CLI, which requires a GitHub credential exported as COPILOT_GITHUB_TOKEN.

Configure the secret at the repository (or organization) level:

  • Settings -> Secrets and variables -> Actions -> New repository secret
  • Name: COPILOT_GITHUB_TOKEN
  • Value: a token from one of the accepted token types listed below.

Token-Type Guidance

The @github/copilot CLI accepts the following token prefixes. Classic personal access tokens (ghp_) are rejected at runtime.

PrefixToken TypeUse in CI
ghs_GitHub App installation tokenPreferred. Short-lived, scoped, auditable
github_pat_Fine-grained personal access tokenAcceptable when a GitHub App is not feasible
gho_, ghu_OAuth / user-to-server tokenAvoid. Tied to a user identity
ghp_Classic personal access tokenRejected at runtime. The probe fails fast
GITHUB_TOKENActions-issued tokenScope-limited. Not sufficient for vally eval

For hve-core, the recommended pattern is a GitHub App with Copilot SDK scopes that mints an installation token in CI and exports it as COPILOT_GITHUB_TOKEN.

Probe Behavior

scripts/evals/Test-CopilotToken.ps1 runs before any vally eval invocation and exits non-zero with a ::error:: annotation when:

  • COPILOT_GITHUB_TOKEN is missing or empty and gh auth token cannot supply a token (fallback for local runs with the GitHub CLI logged in)
  • the token begins with ghp_ (classic PAT)
  • the optional -SmokeTest switch invokes vally --version and the CLI exits non-zero

The pass-path Reason includes (source: COPILOT_GITHUB_TOKEN) or (source: gh auth token) so contributors can confirm which credential path was used. The smoke test reports a clean skip when vally is not installed locally, so contributors can run the probe outside CI without installing the CLI.

Per-Job COPILOT_HOME Isolation

The @github/copilot CLI persists state (logged-in users, caches) under the directory named by COPILOT_HOME, defaulting to ~/.copilot. CI jobs share runner home directories across steps and can pollute each other when this state leaks.

Export COPILOT_HOME to a job-local path in every workflow job that invokes vally:

env:
COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}
COPILOT_HOME: ${{ runner.temp }}/copilot-home

This pattern keeps each eval job hermetic, prevents credential bleed-through between matrix legs, and avoids the deprecated --config-dir CLI flag.

Fork PR Policy

GitHub Actions does not expose repository secrets to workflows triggered by pull requests from forks. Without COPILOT_GITHUB_TOKEN, the eval-execute job cannot succeed.

The pipeline clean-skips eval execution for fork PRs rather than failing the check:

jobs:
eval-execute:
if: github.event.pull_request.head.repo.fork == false

The eval-presence and eval-lint jobs do run on fork PRs because they require no secrets. Structural problems with eval specs (missing coverage, schema violations, profanity in stimulus text) surface immediately. Eval execution itself runs only after a maintainer merges the fork branch into a trusted topic branch on the upstream repository.

Adding a New Eval Spec

When you add or modify an AI artifact under .github/agents/, .github/prompts/, .github/instructions/, or .github/skills/, the eval-presence job fails the PR until a matching eval spec exists.

Steps to add coverage:

  1. Create an eval spec under evals/ that follows the structure documented in evals/README.md.

  2. Set the spec's stimulus.backlink field to the absolute repository path of the artifact under test (for example, .github/agents/coding-standards/researcher-subagent.agent.md).

  3. Ensure the spec declares an executor compatible with the vally CLI (typically the CopilotSdkExecutor with a model: hint).

  4. Run the presence check locally to confirm the artifact is covered:

    pwsh scripts/evals/Get-ChangedAIArtifact.ps1 -BaseRef origin/main -HeadRef HEAD -OutFile logs/changed-ai-artifacts.json
    pwsh scripts/evals/Test-StimulusPresence.ps1 -ManifestPath logs/changed-ai-artifacts.json
  5. Run the eval locally (requires COPILOT_GITHUB_TOKEN in your shell environment):

    pwsh scripts/evals/Test-CopilotToken.ps1 -SmokeTest
    pwsh scripts/evals/Invoke-VallyEvals.ps1 -ManifestPath logs/changed-ai-artifacts.json

Commit the new spec alongside the artifact change. The PR comment summary in eval-execute reports per-artifact pass/fail with links to the captured logs/eval-results-<artifact-id>.json payloads.

Stimulus presence linter

scripts/evals/Test-StimulusPresence.ps1 is the gate that fails an eval-presence job when a changed AI artifact lacks an eval spec backlink. It reads the manifest produced by scripts/evals/Get-ChangedAIArtifact.ps1 and builds a coverage index from every evals/**/*.yaml spec.

Each changed artifact is matched against the stimuli[].tags.<kind> = <slug> backlinks in that index. Deleted artifacts (manifest status D) are skipped because coverage cannot be required for removed files.

The script writes a structured report to logs/stimulus-presence.json (covered, missing, errors, skipped) and emits a single ::error file=...:: annotation per missing artifact, so a PR comment names the file that needs coverage.

Exit codes:

ExitMeaning
0Every changed artifact is covered, or the manifest is empty or contains only deletions.
1At least one changed artifact is missing an eval-spec backlink.
2Invalid input: missing manifest, missing evals/ root, or unrecoverable YAML parse fail.

The -FailOnSpecError switch promotes recoverable YAML parse failures to a hard exit 2 so a malformed spec cannot mask a missing-backlink failure during local hardening sweeps.

Run the linter locally before pushing artifact changes:

pwsh scripts/evals/Get-ChangedAIArtifact.ps1 -BaseRef origin/main -HeadRef HEAD -OutFile logs/changed-ai-artifacts.json
pwsh scripts/evals/Test-StimulusPresence.ps1 -ManifestPath logs/changed-ai-artifacts.json -FailOnSpecError

To add coverage for a missing artifact, create or extend an eval spec under evals/ and set stimuli[].tags.<kind> to the artifact slug (the basename minus the .agent.md, .prompt.md, .instructions.md, or SKILL.md suffix); the next run reports it covered.

Per-Spec Moderation Threshold

The moderation.threshold schema field on an eval spec sets the per-spec Detoxify cutoff (any label score exceeding the value hard-fails the spec):

moderation:
threshold: 0.7

The validator accepts numeric values in [0.0, 1.0]; out-of-range or non-numeric values emit ModerationThresholdOutOfRange / ModerationThresholdType diagnostics during eval:lint:schema. The default is 0.5 when the field is omitted.

Invoke-VallyEvals.ps1 -ModerationThreshold <value> overrides every spec's threshold for a run. CLI override wins over the per-spec value, which wins over the default.

Content moderation coverage

Content moderation runs in two complementary CI lanes, each scoped to a different surface.

LaneJob in pr-validation.ymlScriptToolchainSurface
Markdown corpuseval-lintscripts/evals/Test-EvalSpecText.ps1Node (alex.js, retext-profanities).github/{agents,prompts,instructions,skills}/**/*.md and docs/**/*.md
Eval-spec stimulicontent-moderationscripts/evals/Invoke-CorpusModeration.ps1Python + Detoxify (unitary/toxic-bert)Stimulus text and expected-output fixtures inside evals/**/*.yaml

The two lanes target different surfaces and do not overlap: the markdown-corpus lane keeps the AI artifacts that ship to contributors free of insensitive or foul language; the eval-spec stimuli lane scores adversarial test inputs against a Detoxify cutoff so a spec that probes a model with toxic content cannot itself ship unredacted.

The content-moderation job is the only path that exercises the real Detoxify model in CI. The job installs the Python dependencies (scripts/evals/moderation/requirements.txt) via uv pip install, caches the Detoxify weights between runs, then invokes Invoke-CorpusModeration.ps1 per spec.

Invoke-CorpusModeration.ps1 shells out to scripts/evals/Invoke-ContentModeration.ps1 for each stimulus. The default Detoxify threshold is 0.5; per-spec overrides come from the moderation.threshold field documented above.

Local opt-in for the Detoxify lane:

uv pip install -r scripts/evals/moderation/requirements.txt
pwsh scripts/evals/Invoke-CorpusModeration.ps1 -SpecGlob 'evals/**/*.yaml'

Without the Python dependencies installed, Invoke-ContentModeration.ps1 exits 2 with a setup error rather than silently passing. The markdown-corpus lane (Test-EvalSpecText.ps1) requires only Node and runs in lint:all without any opt-in.

Eval Lint Scripts

Four eval-lint commands run in lint:all:

ScriptToolPurpose
eval:lint:vallyvally lint --eval-spec evals/Spec validation via the upstream CLI
eval:lint:schemaTest-EvalSpec.ps1hve-core schema lint (graders, executor, moderation.threshold)
eval:lint:textTest-EvalSpecText.ps1retext-profanities + alex.js gate on the AI-artifact corpus
eval:lint:safetyTest-VallyTestSafety.ps1Safety gate on vally test stimuli (logs/vally-test-safety.json)

eval:lint:text scans .github/{agents,prompts,instructions,skills}/**/*.md and docs/**/*.md. By default retext-profanities findings flip the exit code (errors) and alex findings emit ::warning annotations only. Pass -FailOnAlex to promote alex findings to errors for local hardening:

pwsh scripts/evals/Test-EvalSpecText.ps1 -FailOnAlex

False-positive lexical matches (e.g., penetration test, attack surface, token abuse) are filtered by the phrase-aware allowlist in scripts/evals/Modules/retext-runner.mjs (PHRASE_ALLOWLIST keyed by retext rule id; ±60-character context window).

Test-EvalSpecText.ps1 exit codes:

ExitMeaning
0No error-level findings (alex.js findings may still be reported as warnings).
1At least one retext-profanities finding, or any alex.js finding when -FailOnAlex is set.
2Setup failure (corpus expansion failed, Node shim missing, or node not on PATH).

Baseline-equivalence specs

eval:lint:vally runs vally lint --eval-spec evals/, which validates the eval YAML files immediately under evals/ but does not recurse into nested subdirectories. The baseline-equivalence suite under evals/baseline-equivalence/ ships nested specs (baseline/eval.yaml, customized/eval.yaml, and compare.eval.yml) that need explicit per-file lint invocations:

vally lint --eval-spec evals/baseline-equivalence/baseline/eval.yaml
vally lint --eval-spec evals/baseline-equivalence/customized/eval.yaml
vally lint --eval-spec evals/baseline-equivalence/compare.eval.yml

scripts/evals/Invoke-BaselineEquivalence.ps1 runs all three implicitly during npm run eval:run:equivalence. See evals/baseline-equivalence/README.md for the suite operator guide and driver-output contract.

Running Pester Tests Locally

npm run test:ps wraps scripts/tests/Invoke-PesterTests.ps1. The default invocation applies ExcludeTag=@('Integration','Slow'):

npm run test:ps # default green-bar (excludes Integration + Slow)
npm run test:ps -- -ExcludeTag Slow # include Integration, exclude Slow
npm run test:ps -- -Tag Integration # run only Integration-tagged tests
npm run test:ps -- -TestPath scripts/tests/evals/ # scope to one directory

-Tag (with -IncludeTag as an alias) and -ExcludeTag flow through to the inner Pester configuration only when explicitly bound, so omitting them preserves the default exclusion. CI matches the default invocation; opt-in tag overrides are intended for targeted local runs.

Results land in logs/pester-summary.json (overall counts) and logs/pester-failures.json (per-failure detail).

Testing PowerShell Wrappers Around Python Subprocesses

Invoke-ContentModeration.ps1 invokes python through Start-Process in a child pwsh -NoProfile -File boundary. The parent test scope's Mock / function: injections do not cross that boundary, so the test suite at scripts/tests/evals/Invoke-ContentModeration.Tests.ps1 uses a PATH-shimmed stub:

  1. Create a temp directory and write python.cmd containing a CMD wrapper that re-launches pwsh against a canned python.ps1.
  2. Prepend the temp directory to $env:PATH for the duration of the test.
  3. The child process resolves python to the shim, executes python.ps1, and observes real argv (--input, --output, --threshold).

This is the only viable mock boundary for cross-process invocation. Apply the same pattern when adding tests for any PowerShell script that shells out to a Python subprocess.

Test authoring patterns

When authoring new Pester suites for the evals scripts, three patterns recur often enough to call out:

  • Define helper functions inside BeforeAll { function ... } so Pester promotes them to the containing Describe scope for all It blocks. Functions defined directly inside Describe (outside BeforeAll) do not survive the fresh runspaces Pester uses for each It.
  • When the command under test is invoked through pwsh -File or Start-Process (so the parent runspace cannot install a Mock), declare a bare function at file scope in the test (or in a fixture script the child loads). The PATH-shim pattern above is one instance of this; the scripts/tests/evals/fixtures/stub-vally.ps1 fixture is another.
  • When a stub or script under test needs to signal a non-zero exit while $ErrorActionPreference = 'Stop' is in effect, write the diagnostic with [Console]::Error.WriteLine(...) and then call exit <code> explicitly. throw short-circuits the runspace before the intended exit code is set, which causes the parent process to observe exit 1 instead of the contract code.

The stub-vally fixture demonstrates the third pattern in practice. scripts/tests/evals/Invoke-VallyEvals.Tests.ps1 drives scripts/evals/Invoke-VallyEvals.ps1 against the fixture by passing -VallyCommand $script:StubPath and setting $env:STUB_VALLY_MODE to pass, fail, or crash per scenario. Per-spec overrides flow through $env:STUB_VALLY_MODES_JSON.

This lets the stub-mode aggregation tests exercise the real driver code paths (the manifest loop, threshold override, and summary writer) without invoking the vally CLI or paying Copilot SDK costs.

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.