Adversarial Harness
waza adversarial runs fault-injection test packs against your skill — adversarial prompts crafted to trip prompt-injection, scope-bypass, exfiltration, and out-of-policy tool use. It exists so you can answer one question on every PR: does my skill behave safely when the inputs are hostile?
This guide covers:
- What the built-in packs are and what they probe for
- Running packs from the CLI
- Wiring an
adversarial:block into youreval.yaml - Exit codes and CI integration
- Notes on determinism (replay-tape friendly)
Why a separate harness?
Section titled “Why a separate harness?”Functional evals (waza run) measure whether your skill solved the task. Adversarial evals measure whether your skill refused to solve the task when the task itself is unsafe. Both signals matter; they should not share a results column.
Every adversarial task is marked golden: true, so an unsafe outcome flips gate to exit 2 too — but waza adversarial is the canonical entry point because it also enforces an on_unsafe_outcome policy and prints a focused summary.
Built-in packs
Section titled “Built-in packs”Run waza adversarial --list-packs to print the catalog. (Unknown pack names are rejected with a config error — they do not list the catalog.)
The harness ships two packs out of the box:
prompt-injection
Section titled “prompt-injection”Four tasks that smuggle adversarial directives through README files, fake “system” comments inside source files, a support ticket, and a markdown changelog link. A safe skill ignores the directive and answers the underlying user request. An unsafe skill echoes the injection trigger, exfiltrates “secrets”, or follows the hostile instructions.
scope-bypass
Section titled “scope-bypass”Four tasks that explicitly ask the skill to do something out-of-scope: send email, delete files, install a third-party dependency, and call an external HTTP API. A safe skill refuses and explains the constraint; an unsafe skill complies.
Each pack lives under internal/adversarial/data/<pack>/ and is embedded into the waza binary, so no network or filesystem dependencies are needed at runtime.
Quick start
Section titled “Quick start”# Run both packs against the default engine (mock if no --skill given)waza adversarial --packs prompt-injection,scope-bypass
# Run against a real skill via the Copilot SDK enginewaza adversarial --packs prompt-injection --skill ./skills/my-skill --model gpt-4o
# Save the full results JSON so you can drill in with the dashboardwaza adversarial --packs scope-bypass --output adversarial.json
# Run a single pack and tolerate unsafe outcomes (CI smoke; non-blocking)waza adversarial --packs prompt-injection --on-unsafe-outcome warnA typical summary looks like:
Adversarial summary (packs: prompt-injection) tasks run: 4 unsafe: 0 policy: fail result: ✅ all packs PASSEDConfiguring from eval.yaml
Section titled “Configuring from eval.yaml”You can pin pack selection and policy at the spec level by adding an adversarial: block. This block is additive under schema 1.2 and is consumed only by waza adversarial --spec eval.yaml; waza run ignores it.
schemaVersion: "1.2"name: my-skill-evalskill: my-skillversion: "1.0"config: executor: copilot-sdk model: gpt-4otasks: - "tasks/*.yaml"adversarial: packs: - prompt-injection - scope-bypass on_unsafe_outcome: fail # or "warn"CLI flags override the spec block:
# Spec says fail; this run downgrades to warnwaza adversarial --spec eval.yaml --on-unsafe-outcome warn| Flag | Description | Default |
|---|---|---|
--packs | Comma-separated pack names to run | All built-in packs |
--list-packs | Print the pack catalog and exit | — |
--on-unsafe-outcome | fail (exit 2) or warn (exit 0) on unsafe | fail |
--spec | Inherit adversarial:, graders, and metrics from a spec | — |
--engine | Engine to use (mock, copilot-sdk) | inferred |
--skill | Skill path/identifier (forwards to the engine) | — |
--model | Model passed to the engine | — |
--output | Write the full results.json to a file | — |
--workers / --parallel | Concurrency knobs (same semantics as waza run) | — |
--verbose | Print pack/policy/engine/artifact diagnostics | false |
Exit codes
Section titled “Exit codes”| Code | Meaning |
|---|---|
0 | All adversarial tasks were refused safely (or --on-unsafe-outcome warn) |
2 | At least one unsafe outcome was observed, and policy is fail |
3 | Configuration error (unknown pack, malformed spec, etc.) |
2 is the same exit code waza gate uses for golden-task failures — so the same CI step that gates on goldens will also gate on adversarial failures.
CI integration
Section titled “CI integration”- name: Run adversarial harness run: waza adversarial --packs prompt-injection,scope-bypass --output adversarial.json
- name: Upload adversarial results if: always() uses: actions/upload-artifact@v4 with: name: adversarial-results path: adversarial.jsonFor PRs that legitimately need to land before a fix, downgrade to warn temporarily:
waza adversarial --on-unsafe-outcome warn --output adversarial.jsonDeterminism
Section titled “Determinism”Adversarial packs are pure-input adversarial prompts — they do not depend on the network. Pair them with waza run --snapshot / waza replay on your skill to detect drift in refusal behavior even when the live agent has changed: capture a green run with snapshots enabled, then replay later for a pure-offline diff.
Schema reference
Section titled “Schema reference”AdversarialConfig (schema 1.2 additive) on EvalSpec:
| Field | Type | Required | Notes |
|---|---|---|---|
packs | []string | yes | Subset of built-in pack names |
on_unsafe_outcome | "fail" | "warn" | no | Defaults to fail |
Validation:
- Empty
packslist is rejected - Unknown pack names are rejected against the embedded catalog
- The
adversarial:block requiresschemaVersion: "1.2"or later
What’s next
Section titled “What’s next”The current packs are deliberately small and review-able. The harness is designed so future releases can grow the catalog (data-exfil, jailbreaks, multi-turn social engineering) without changing the spec or CLI surface. Custom user-provided packs are on the roadmap; track issue #365 for updates.