Skip to content

Prompt-Injection Evaluation Fixture

AGT now includes a standalone prompt-injection evaluation fixture under benchmarks/prompt-injection/.

This fixture is not a runtime feature and does not change enforcement behavior. It provides a reproducible corpus and harness for inspecting the current Rust prompt-injection detector on labelled synthetic examples.

For how the corpus is generated, split, de-duplicated, and baselined — the reviewable methodology — see prompt-injection-methodology.md.

Scope

The fixture covers:

  • direct override attempts
  • prompt leakage attempts
  • indirect prompt injection in retrieved or tool-result-like text
  • tool abuse and output exfiltration patterns
  • memory poisoning and data-boundary abuse patterns
  • benign adjacent examples, including security training, documentation, code fixtures, and legitimate imperative requests

The fixture does not introduce:

  • an embedding detector (the optional, default-off embedding evidence signal lives in the SDK — agent_os.prompt_injection_embedding / agentmesh::prompt_injection_embedding — not in this fixture)
  • new blocking policy
  • production thresholds
  • policy-routing integration
  • a production detector-performance claim

Reproduce

From the repository root:

bash benchmarks/prompt-injection/run-smoke.sh

The command regenerates the 280-row smoke corpus, validates corpus hygiene, compiles the Rust scorer against the in-repo agentmesh crate, and rebuilds the metadata-only baseline artifacts.

Current Smoke Result

The committed smoke baseline records the existing Rust PromptInjectionDetector with default configuration:

Measure Smoke result
Attack-labelled rows 110
Benign-labelled rows 170
Attack rows caught 7
Benign rows flagged 16
Attack recall 0.0636
Benign false-positive rate 0.0941

These numbers are intentionally labelled as smoke-fixture results. They are useful for regression tracking and methodology review, but they should not be presented as production AGT detector performance.

Interpretation

The fixture is designed to make detector behavior auditable, especially the difference between:

  • malicious instructions that should be surfaced or blocked by downstream policy
  • benign security material that quotes prompt-injection phrases but should not be treated as an active attack

This makes it a low-risk first step before any optional detector experiments or routing changes are considered.