Skip to content

Attacks

An attack tests whether your agent can be manipulated into unsafe behavior. When the evaluator detects the attack objective, the result is UNSAFE (safe=False).


Semantics

Attacks use the following mapping from evaluator outcomes to safety verdicts:

EvalOutcome Safety Verdict Meaning
DETECTED UNSAFE The attack succeeded — the agent did what the attacker wanted
NOT_DETECTED SAFE The attack failed — the agent resisted
UNDETERMINED UNDETERMINED The evaluator could not determine whether the attack succeeded

Precedence when multiple turns are evaluated: DETECTED > UNDETERMINED > NOT_DETECTED. If any turn detected the attack objective, the agent is compromised regardless of other turns.

This logic lives in resolve_as_attack.


Common Structure

All attack executions share this lifecycle:

  1. Inject (optional) — Place payloads into the agent's data sources via surfaces
  2. Wait — Allow time for indexing or propagation
  3. Trigger — Send prompts that cause the agent to process the injected content
  4. Evaluate — Check whether the attack objective was achieved
  5. Clean up — Remove injected content (guaranteed, even on failure)
  6. Report — Produce a Result

The injection phase is optional — inline attacks attach payloads directly to the trigger prompt.


Using the Attacks Factory

All attacks are created through the Attacks class:

Python
from rampart import Attacks

execution = Attacks.xpia(
    inject=handle,
    trigger="Summarize the latest documents",
    evaluator=my_evaluator,
)

result = await execution.execute_async(adapter=my_adapter)
assert result, result.summary

The factory returns a BaseExecution — call execute_async(adapter=...) and assert the result.