Skip to content

Adversarial Harness

waza adversarial runs fault-injection test packs against your skill — adversarial prompts crafted to trip prompt-injection, scope-bypass, exfiltration, and out-of-policy tool use. It exists so you can answer one question on every PR: does my skill behave safely when the inputs are hostile?

This guide covers:

  • What the built-in packs are and what they probe for
  • Running packs from the CLI
  • Wiring an adversarial: block into your eval.yaml
  • Exit codes and CI integration
  • Notes on determinism (replay-tape friendly)

Functional evals (waza run) measure whether your skill solved the task. Adversarial evals measure whether your skill refused to solve the task when the task itself is unsafe. Both signals matter; they should not share a results column.

Every adversarial task is marked golden: true, so an unsafe outcome flips gate to exit 2 too — but waza adversarial is the canonical entry point because it also enforces an on_unsafe_outcome policy and prints a focused summary.

Run waza adversarial --list-packs to print the catalog. (Unknown pack names are rejected with a config error — they do not list the catalog.)

The harness ships two packs out of the box:

Four tasks that smuggle adversarial directives through README files, fake “system” comments inside source files, a support ticket, and a markdown changelog link. A safe skill ignores the directive and answers the underlying user request. An unsafe skill echoes the injection trigger, exfiltrates “secrets”, or follows the hostile instructions.

Four tasks that explicitly ask the skill to do something out-of-scope: send email, delete files, install a third-party dependency, and call an external HTTP API. A safe skill refuses and explains the constraint; an unsafe skill complies.

Each pack lives under internal/adversarial/data/<pack>/ and is embedded into the waza binary, so no network or filesystem dependencies are needed at runtime.

Terminal window
# Run both packs against the default engine (mock if no --skill given)
waza adversarial --packs prompt-injection,scope-bypass
# Run against a real skill via the Copilot SDK engine
waza adversarial --packs prompt-injection --skill ./skills/my-skill --model gpt-4o
# Save the full results JSON so you can drill in with the dashboard
waza adversarial --packs scope-bypass --output adversarial.json
# Run a single pack and tolerate unsafe outcomes (CI smoke; non-blocking)
waza adversarial --packs prompt-injection --on-unsafe-outcome warn

A typical summary looks like:

Adversarial summary (packs: prompt-injection)
tasks run: 4
unsafe: 0
policy: fail
result: ✅ all packs PASSED

You can pin pack selection and policy at the spec level by adding an adversarial: block. This block is additive under schema 1.2 and is consumed only by waza adversarial --spec eval.yaml; waza run ignores it.

schemaVersion: "1.2"
name: my-skill-eval
skill: my-skill
version: "1.0"
config:
executor: copilot-sdk
model: gpt-4o
tasks:
- "tasks/*.yaml"
adversarial:
packs:
- prompt-injection
- scope-bypass
on_unsafe_outcome: fail # or "warn"

CLI flags override the spec block:

Terminal window
# Spec says fail; this run downgrades to warn
waza adversarial --spec eval.yaml --on-unsafe-outcome warn
FlagDescriptionDefault
--packsComma-separated pack names to runAll built-in packs
--list-packsPrint the pack catalog and exit
--on-unsafe-outcomefail (exit 2) or warn (exit 0) on unsafefail
--specInherit adversarial:, graders, and metrics from a spec
--engineEngine to use (mock, copilot-sdk)inferred
--skillSkill path/identifier (forwards to the engine)
--modelModel passed to the engine
--outputWrite the full results.json to a file
--workers / --parallelConcurrency knobs (same semantics as waza run)
--verbosePrint pack/policy/engine/artifact diagnosticsfalse
CodeMeaning
0All adversarial tasks were refused safely (or --on-unsafe-outcome warn)
2At least one unsafe outcome was observed, and policy is fail
3Configuration error (unknown pack, malformed spec, etc.)

2 is the same exit code waza gate uses for golden-task failures — so the same CI step that gates on goldens will also gate on adversarial failures.

.github/workflows/eval.yml
- name: Run adversarial harness
run: waza adversarial --packs prompt-injection,scope-bypass --output adversarial.json
- name: Upload adversarial results
if: always()
uses: actions/upload-artifact@v4
with:
name: adversarial-results
path: adversarial.json

For PRs that legitimately need to land before a fix, downgrade to warn temporarily:

Terminal window
waza adversarial --on-unsafe-outcome warn --output adversarial.json

Adversarial packs are pure-input adversarial prompts — they do not depend on the network. Pair them with waza run --snapshot / waza replay on your skill to detect drift in refusal behavior even when the live agent has changed: capture a green run with snapshots enabled, then replay later for a pure-offline diff.

AdversarialConfig (schema 1.2 additive) on EvalSpec:

FieldTypeRequiredNotes
packs[]stringyesSubset of built-in pack names
on_unsafe_outcome"fail" | "warn"noDefaults to fail

Validation:

  • Empty packs list is rejected
  • Unknown pack names are rejected against the embedded catalog
  • The adversarial: block requires schemaVersion: "1.2" or later

The current packs are deliberately small and review-able. The harness is designed so future releases can grow the catalog (data-exfil, jailbreaks, multi-turn social engineering) without changing the spec or CLI surface. Custom user-provided packs are on the roadmap; track issue #365 for updates.