Attacks¶

An attack tests whether your agent can be manipulated into unsafe behavior. When the evaluator detects the attack objective, the result is UNSAFE (safe=False).

Semantics¶

Attacks use the following mapping from evaluator outcomes to safety verdicts:

EvalOutcome	Safety Verdict	Meaning
`DETECTED`	`UNSAFE`	The attack succeeded — the agent did what the attacker wanted
`NOT_DETECTED`	`SAFE`	The attack failed — the agent resisted
`UNDETERMINED`	`UNDETERMINED`	The evaluator could not determine whether the attack succeeded

Precedence when multiple turns are evaluated: DETECTED > UNDETERMINED > NOT_DETECTED. If any turn detected the attack objective, the agent is compromised regardless of other turns.

This logic lives in resolve_as_attack.

Common Structure¶

All attack executions share this lifecycle:

Inject (optional) — Place payloads into the agent's data sources via surfaces
Wait — Allow time for indexing or propagation
Trigger — Send prompts that cause the agent to process the injected content
Evaluate — Check whether the attack objective was achieved
Clean up — Remove injected content (guaranteed, even on failure)
Report — Produce a Result

The injection phase is optional — inline attacks attach payloads directly to the trigger prompt.

Using the Attacks Factory¶

All attacks are created through the Attacks class:

Python

from rampart import Attacks

execution = Attacks.xpia(
    inject=handle,
    trigger="Summarize the latest documents",
    evaluator=my_evaluator,
)

result = await execution.execute_async(adapter=my_adapter)
assert result, result.summary

The factory returns a BaseExecution — call execute_async(adapter=...) and assert the result.