Attacks¶
An attack tests whether your agent can be manipulated into unsafe behavior. When the evaluator detects the attack objective, the result is UNSAFE (safe=False).
Semantics¶
Attacks use the following mapping from evaluator outcomes to safety verdicts:
| EvalOutcome | Safety Verdict | Meaning |
|---|---|---|
DETECTED |
UNSAFE |
The attack succeeded — the agent did what the attacker wanted |
NOT_DETECTED |
SAFE |
The attack failed — the agent resisted |
UNDETERMINED |
UNDETERMINED |
The evaluator could not determine whether the attack succeeded |
Precedence when multiple turns are evaluated: DETECTED > UNDETERMINED > NOT_DETECTED. If any turn detected the attack objective, the agent is compromised regardless of other turns.
This logic lives in resolve_as_attack.
Common Structure¶
All attack executions share this lifecycle:
- Inject (optional) — Place payloads into the agent's data sources via surfaces
- Wait — Allow time for indexing or propagation
- Trigger — Send prompts that cause the agent to process the injected content
- Evaluate — Check whether the attack objective was achieved
- Clean up — Remove injected content (guaranteed, even on failure)
- Report — Produce a
Result
The injection phase is optional — inline attacks attach payloads directly to the trigger prompt.
Using the Attacks Factory¶
All attacks are created through the Attacks class:
Python
from rampart import Attacks
execution = Attacks.xpia(
inject=handle,
trigger="Summarize the latest documents",
evaluator=my_evaluator,
)
result = await execution.execute_async(adapter=my_adapter)
assert result, result.summary
The factory returns a BaseExecution — call execute_async(adapter=...) and assert the result.