ADR 0031: Optional embedding evidence backend for prompt-injection detection¶
- Status: proposed
- Date: 2026-06-13
Context¶
The rules-based PromptInjectionDetector (Python agent_os and Rust agentmesh) is deliberately high-precision / low-recall: it catches obvious patterns but misses disguised or semantically novel injection. An optional embedding/kNN signal already exists in both SDKs (prompt_injection_embedding) — a local, default-off nearest-neighbour margin against a labelled exemplar bank — but it was not connected to the detection pipeline. Connecting that signal was the remaining gap discussed in #2918.
Two constraints shaped the design:
- The deterministic detector's behaviour must not change by default, and the embedding signal must never block on its own — governance/policy decides any action (consistent with the project's "controls are deterministic, models are evidence" posture).
- Content normalization (RFC #2957, PR #2991) sits upstream of any detector or backend and is unchanged here.
Decision¶
Introduce a pluggable, default-off evidence backend following the pluggable-backend pattern of ADR-0015:
- A small backend interface — a Python
Protocoland a Rusttrait(DetectionEvidenceBackend) — with a stablenameand anevaluate(text)that returns an advisoryEvidenceSignalor nothing. PromptInjectionDetectorconsults registered backends only after the deterministic verdict is computed, and appends theirEvidenceSignals to a new additiveDetectionResult.evidencefield. Evidence never influencesis_injection/threat_level/injection_type/confidence/matched_patterns, andEvidenceSignal.blocksis always false.- The evidence-only invariants are enforced, not conventional:
blocks=trueand a non-finite (NaN/inf) score are rejected at the boundary (Python raises in__post_init__; Rust forcesblocks=falseand drops a non-finite score to anon_finite_scoreerror code). A backend that raises — or, in Rust, panics (e.g. the embedding signal'scosine()asserting on a dimension mismatch) — is caught and recorded as a staticbackend_errorcode, so a misbehaving backend can never alter the verdict or break detection. EmbeddingSignalBackendadapts the existingprompt_injection_embeddingkNN signal to this interface. It is inert unless explicitly enabled, so the embedding model/runtime remains an optional dependency.- Backends are registered explicitly (Python
evidence_backends=, Rustwith_evidence_backends(...)). With none registered,detect()output is byte-identical to the rules-only path. - In Rust,
DetectionResultis marked#[non_exhaustive]so the additiveevidencefield is a non-breaking change.
EvidenceSignal carries only a static backend identifier, a numeric score, and a static error code — never raw input or input-derived text — so the audit surface stays hash/ID-only. The raw numeric score is additionally stripped from the durable audit copy: a continuous per-request score is an evasion oracle (anyone with audit-log access could watch the margin move and tune a payload toward a lower score), so only backend identity and error codes are persisted. The live DetectionResult returned to the caller keeps raw scores for in-process telemetry and aggregation.
Consequences¶
- The pipeline gains an optional, auditable recall signal for review/routing without any change to default behaviour, false-positive profile, or blocking.
- Adding another evidence backend (e.g. a classifier) is implementing the two-method interface and registering it — no detector changes.
- Cross-SDK parity is preserved: the Python
Protocoland Rusttraitare symmetric, with matching invariants and tests. - Trade-off: evidence is advisory only; turning it into an enforced control is a separate, explicit policy/governance decision. A model-backed backend adds an optional dependency only when enabled.
References¶
- ADR-0015 (pluggable external policy backends) — the pattern this follows.
-
2918 — the embedding-signal proposal and the "connect it to the pipeline" gap.¶
- RFC #2957 / PR #2991 — content normalization, upstream of any backend.
docs/benchmarks/prompt-injection-methodology.md— the evidence-only, default-off methodology and the kNN-margin definition.