OWASP Top 10 for LLM Applications โ Coverage Mapping¶
Disclaimer: This document is an internal self-assessment mapping, NOT a validated certification or third-party audit. It documents how the toolkit's capabilities align with the referenced standard. Organizations must perform their own compliance assessments with qualified auditors.
Mapping Version: 1.0 OWASP Reference: OWASP Top 10 for LLM Applications (2025) Toolkit Version: v1.1.0 Last Updated: April 2026
Edition note: This mapping uses the risk categories from the OWASP Top 10 for LLM Applications v1.1 (2023) as specified in Issue #697. The 2025 revision renumbers several risks (e.g., Sensitive Information Disclosure moved from LLM06 to LLM02), renames others (Model Theft โ Unbounded Consumption, Overreliance โ Misinformation), and introduces two new categories: System Prompt Leakage (LLM07) and Vector and Embedding Weaknesses (LLM08). Coverage notes for the new 2025 categories are included at the end of this document.
Executive Summary¶
The toolkit contains detection mechanisms for 9 of 10 LLM risks. However, 6 of 10 share a structural gap where detection modules exist as standalone utilities but are not wired into the BaseIntegration enforcement lifecycle. A single integration effort โ adding optional auto-wiring controlled by GovernancePolicy flags โ would close gaps across multiple risks simultaneously. The strongest enforcement is in plugin/tool security (MCPGateway) and execution privilege control (rings, kill switch). The widest gaps are in output sanitization and sensitive data protection.
Coverage Summary¶
| # | OWASP Risk | Coverage | Key Mechanism | Key Gap |
|---|---|---|---|---|
| LLM01 | Prompt Injection | Partial | 6 regex pattern groups + base64 decoding + MCP tool scanning | Regex-only; no semantic detection; opt-in, not default-wired |
| LLM02 | Insecure Output Handling | Partial | AST-based Python code validation + drift detection | post_execute() never blocks; Python-only; no text output sanitization |
| LLM03 | Training Data Poisoning | Partial | MemoryGuard for runtime memory stores | Training pipeline out of scope; MemoryGuard not wired into adapters |
| LLM04 | Model Denial of Service | Partial | Token/call/timeout limits + concurrency semaphore + circuit breakers | TokenBudgetTracker advisory-only; RateLimiter not wired; no payload size limits |
| LLM05 | Supply Chain Vulnerabilities | Partial | SBOM + Ed25519 signing + MCP fingerprinting + ContentHashInterceptor | SupplyChainGuard reporting-only; signing opt-in |
| LLM06 | Sensitive Information Disclosure | Partial | PII patterns in MCP gateway + secret detection in codegen + egress policy | Only 2 PII patterns; no output text filtering; audit-log PII minimization remains incomplete |
| LLM07 | Insecure Plugin Design | Partial | MCPGateway 5-stage pipeline + rug-pull detection + schema abuse scanning | JSON Schema composition ($ref/oneOf) unexamined; gateway and scanner disconnected |
| LLM08 | Excessive Agency | Partial | Execution rings + kill switch + rogue detection + scope guard | Kill switch manual-only; detection modules advisory, not auto-wired to enforcement |
| LLM09 | Overreliance | Partial | Drift detection + confidence threshold + adversarial evaluator | No fact-checking; confidence attribute never provided by frameworks |
| LLM10 | Model Theft | Gap | N/A | Out of scope โ toolkit wraps LLM APIs, does not host models |
Result: 0 fully mitigated, 9 partially covered, 1 out-of-scope gap.
Methodology¶
This mapping was produced using a structured multi-perspective analysis with independent redundancy at each stage:
- Discovery: Two independent code scans of all 4 packages, producing file:line citations for mitigations against each risk. Disagreements between scans flagged for deeper investigation.
- Adversarial validation: Every claimed mitigation subjected to bypass testing. Scope boundaries assessed for each gap.
- Compliance audit: All findings cross-validated for citation accuracy, evidence completeness, and missed sub-risks.
- Strategic review: Defense-in-depth assessment and enterprise readiness evaluation.
Evidence standard: line-level code citations + adversarial bypass testing.
Cross-Cutting Finding: Detection Without Enforcement¶
Six of ten risks share a structural pattern: detection mechanisms exist as standalone utilities but are not wired into the BaseIntegration lifecycle. Specifically:
| Module | What it does | What it doesn't do |
|---|---|---|
PromptInjectionDetector | Scans text for injection patterns | Not called by any adapter's pre_execute() |
TokenBudgetTracker | Tracks token usage, fires warning callbacks | Never blocks execution; is_exceeded flag unchecked |
RateLimiter | Token-bucket rate limiting with allow() | Not wired into any adapter or interceptor |
BoundedSemaphore | Concurrency limiter with backpressure | Not integrated into BaseIntegration.pre_execute() |
ScopeGuard | Evaluates file/line count scope | Returns advisory strings; nothing checks the decision |
SupplyChainGuard | Scans for supply chain risks | Returns findings; no blocking pipeline |
MCPSecurityScanner | Detects tool poisoning, rug pulls, schema abuse | Results not consumed by MCPGateway decisions |
post_execute() | Computes drift scores, emits events | Always returns (True, None) โ never blocks |
Note: These assessments apply to the BaseIntegration path used by most adapters. The MAF (Microsoft Agent Framework) adapter provides enforcement wiring for rogue detection, governance policy, and ring enforcement via its FunctionMiddleware pipeline โ but this covers only the Semantic Kernel integration path.
Recommendation: A unified integration effort โ adding optional auto-wiring in BaseIntegration.__init__() controlled by GovernancePolicy flags โ would close gaps across LLM01, LLM02, LLM04, LLM06, and LLM07 simultaneously. Recommended interceptor ordering: rate limiting first, then scope guard, then content inspection (to avoid expensive regex on requests that would be rate-limited anyway).
Detailed Mapping¶
LLM01: Prompt Injection¶
Attackers manipulate LLM behavior through crafted inputs that override system instructions, either directly or via poisoned external content.
Coverage: Partial
Mitigations:
PromptInjectionDetectorโprompt_injection.py:147-197โ 6 compiled regex pattern groups covering direct overrides, delimiters, role-play/jailbreak, context manipulation, multi-turn escalation, and encoding attacks. Configurable sensitivity (strict/balanced/permissive).- Base64 payload decoding โ
prompt_injection.py:548-563โ Decodes base64 candidates and checks for suspicious keywords (ignore, override, system, password, exec, eval, import os). - Canary token detection โ
prompt_injection.py:595-612โ Detects system prompt leakage via planted canary strings. CRITICAL threat level, confidence=1.0. - MCP tool description scanning โ
mcp_security.py:557-603โ ReusesPromptInjectionDetectoron MCP tool descriptions to catch injection embedded in tool metadata. ConversationGuardianโconversation_guardian.py:83-107โ Evasion resistance vianormalize_text()with homoglyph/leetspeak/zero-width stripping.- LlamaFirewall integration โ
llamafirewall.py:63-98โ Chains Meta's LlamaFirewall with Agent OS detection in 4 modes (CHAIN_BOTH, VOTE_MAJORITY, etc.). Graceful fallback when LlamaFirewall not installed. - Blocked patterns โ
base.py:153โ Supports substring, regex, and glob pattern blocking on tool arguments. - Fail-closed โ
prompt_injection.py:358-373โ Detection errors result in CRITICAL threat level withis_injection=True.
Coverage Boundaries:
| Technique Category | Covered | Notes |
|---|---|---|
| Direct English-language overrides | Yes | 6 compiled regex pattern groups match known phrases |
| Base64-encoded payloads | Yes | Decoded and inspected for suspicious keywords |
| Canary token leakage | Yes | Planted canary strings detected at CRITICAL level |
| Semantically equivalent paraphrasing | No | Regex patterns match literal phrases, not semantic intent |
| Non-English injections | No | All patterns are English-only; no multilingual normalization |
| Indirect injection via tool output | No | Detector screens input only; tool results are not scanned |
| URL encoding / ROT13 applied | No | Only base64 and hex/unicode escape sequences covered |
| Allowlist configuration abuse | No | DetectionConfig.allowlist (line 123) lacks input validation (see #744) |
Recommendations:
- Wire
PromptInjectionDetectorintoBaseIntegration.pre_execute()via aGovernancePolicy.prompt_injection_detectionflag (default: True for regex). - Document LlamaFirewall integration as the recommended ML-based upgrade path for semantic detection (covers paraphrasing and multilingual attacks).
- Share
ConversationGuardian'snormalize_text()homoglyph/evasion logic withPromptInjectionDetector. - Add allowlist validation (minimum length, format constraints) to prevent overly broad entries from disabling detection.
LLM02: Insecure Output Handling¶
Insufficient validation or sanitization of LLM outputs before passing them to downstream components, potentially enabling XSS, SSRF, or code execution.
Coverage: Partial
Mitigations:
CodeSecurityValidatorโsecure_codegen.py:179-237โ AST-based validation of LLM-generated Python code. Detects dangerous imports (17 modules), dangerous calls (22+ functions), shell injection (shell=True), SQL injection (string formatting), path traversal (../), and hardcoded secrets (5 patterns).- Code sanitization โ
secure_codegen.py:384-393โ Comments out dangerous lines in generated code. - Secure code templates โ
secure_codegen.py:401-526โ Pre-vetted templates for HTTP clients, file reads, SQL queries, and subprocess calls. GuardrailsKernelโguardrails_adapter.py:1-80โ Bridge to Guardrails AI validators for input and output validation with BLOCK/WARN/FIX actions.- Drift detection โ
base.py:977-1038โ Computes semantic drift score between baseline and actual output usingSequenceMatcher. EmitsDRIFT_DETECTEDevent when threshold exceeded.
Coverage Boundaries:
| Technique Category | Covered | Notes |
|---|---|---|
| Python code validation (AST) | Yes | 17 dangerous imports, 22+ dangerous calls, shell/SQL injection, path traversal, secrets |
| Drift detection on outputs | Partial | post_execute() emits DRIFT_DETECTED events but always returns (True, None) โ advisory only |
| Non-Python code validation | No | secure_codegen.py:193 raises ValueError for any language other than Python |
| Natural language output filtering | No | No text output sanitization exists for PII, secrets, or sensitive data in prose |
| HTML/XSS output encoding | No | No escaping or encoding for outputs rendered in web UIs |
Recommendations:
- Add
GovernancePolicy.block_on_drift: bool = Falseand honor it inpost_execute()(1-line change + policy flag). - Ship a basic
OutputSanitizerthat scans tool outputs for the dangerous patterns already defined inmcp_gateway.py. - Document that multi-language code validation requires CodeShield integration (available via LlamaFirewall's
scan_code()).
LLM03: Training Data Poisoning¶
Manipulation of training data to introduce vulnerabilities, biases, or backdoors into the model.
Coverage: Partial
Mitigations:
MemoryGuardโmemory_guard.py:186-242โ Guards agent runtime memory stores (RAG, episodic, working memory) against poisoning. Pre-write validation checks for injection patterns (7 regex), code injection (6 regex), excessive special characters (>30% threshold), and Unicode bidi/homoglyph manipulation.- Hash integrity โ
memory_guard.py:244-259โ SHA-256 hash comparison for tamper detection on stored entries. - Batch scanning โ
memory_guard.py:261-295โ Integrity + content scanning of existing memory entries. - Write audit trail โ
memory_guard.py:220-241โ Every write attempt logged with timestamp, source, content hash, and allow/deny decision. - Fail-closed โ
memory_guard.py:199-210โ Validation errors block the write.
Adversarial Validation:
Training pipeline data poisoning is architecturally out of scope โ the toolkit does not manage model training, fine-tuning, or dataset curation. MemoryGuard addresses the runtime variant: poisoning of RAG stores, episodic memory, and working context that influence agent behavior at inference time.
Recommendations:
- Wire
MemoryGuard.validate_write()into adapters that manage agent memory/context. - Document the scope boundary: "Training pipeline data poisoning is out of scope. Runtime memory and context poisoning (RAG injection, episodic memory tampering) is addressed by MemoryGuard."
LLM04: Model Denial of Service¶
Resource-intensive queries that consume excessive compute, degrade availability, or increase costs.
Coverage: Partial
Mitigations:
- Token limits โ
base.py:150โGovernancePolicy.max_tokens(default 4096). Validated as positive integer on construction. - Tool call limits โ
base.py:151โGovernancePolicy.max_tool_calls(default 10). Enforced byPolicyInterceptor(line 705-709). - Timeout โ
base.py:155โGovernancePolicy.timeout_seconds(default 300s). Checked inpre_execute(line 944). - Concurrency limits โ
base.py:805-859โBoundedSemaphorewith backpressure. Rejects requests when capacity exhausted. - MCPGateway rate limiting โ
mcp_gateway.py:219-225โ Per-agent call budget enforcement. Manual reset methods exist (reset_agentat line 292,reset_allat line 296) but no automatic time-window reset. RateLimiterโrate_limiter.py:93-101โ Token-bucket algorithm, thread-safe withthreading.Lock. ReturnsFalsewhen budget exhausted.RingBreachDetectorโbreach_detector.py:68-99โ Sliding-window call-rate analysis with severity thresholds. Per-agent circuit breaker trips on HIGH/CRITICAL.
Coverage Boundaries:
| Technique Category | Covered | Notes |
|---|---|---|
| Tool call count limits | Yes | PolicyInterceptor enforces max_tool_calls per session (line 705-709) |
| MCPGateway call budget | Yes | Per-agent budget enforcement; manual reset methods exist but no automatic time-window reset |
| Token budget tracking | Partial | TokenBudgetTracker tracks usage and fires warnings but never blocks execution |
| Token-bucket rate limiting | No (unwired) | RateLimiter has correct algorithm but is not imported by any adapter or interceptor |
| Concurrency limiting | No (unwired) | BoundedSemaphore exists but is not integrated into BaseIntegration.pre_execute() |
| Payload size validation | No | No input size limits; arbitrarily large parameters are serialized and processed |
Recommendations:
- Wire
RateLimiterandTokenBudgetTrackerintoBaseIntegrationwith blocking behavior controlled by policy flags (block_on_budget_exceeded,block_on_rate_limit). - Add
GovernancePolicy.max_input_lengthas a coarse payload size guard. - Add automatic time-window reset to MCPGateway's call counter.
- Note: prompt-length validation relative to model context windows is model-serving scope, not governance scope.
LLM05: Supply Chain Vulnerabilities¶
Vulnerabilities in third-party components, pre-trained models, or data pipelines used by LLM applications.
Coverage: Partial
Mitigations:
SupplyChainGuardโsupply_chain.py:72-79โ Detects freshly published packages (<7 days), unpinned versions, and typosquatting (SequenceMatcher ratio >0.85).- SBOM generation โ
sbom.py:46+โ SPDX 2.3 format with SHA-256 hashing and dependency tracking. - Artifact signing โ
signing.py:18-33โ Ed25519 signing withcryptographylibrary. Fail-closed when library missing (raisesImportError). - MCP tool fingerprinting โ
mcp_security.py:367-454โ SHA-256 fingerprints of tool definitions with change detection viacheck_rug_pull(). - MCP typosquatting โ
mcp_security.py:683-741โ Levenshtein distance check (<=2 edits) for tool name impersonation. ContentHashInterceptorโbase.py:714-782โ SHA-256 content hashing of tool callables. Strict mode blocks unregistered tools.- CI workflows โ
dependency-review.yml,codeql.yml,scorecard.yml,sbom.ymlโ Automated dependency audit, CodeQL scanning, OpenSSF Scorecard, SBOM generation.
Coverage Boundaries:
| Technique Category | Covered | Notes |
|---|---|---|
| Dependency metadata scanning | Partial | SupplyChainGuard produces findings but has no blocking pipeline โ reporting only |
| Artifact signing | Partial | Ed25519 signing is fail-closed when library missing, but signing itself is opt-in |
| Tool integrity verification | Partial | ContentHashInterceptor is fail-closed but requires cooperative adapter to set content_hash metadata |
| CI-level scanning | Yes | dependency-review, CodeQL, OpenSSF Scorecard, SBOM generation workflows active |
| MCP tool fingerprinting | Yes | SHA-256 fingerprints with rug-pull change detection |
Recommendations:
- Connect
SupplyChainGuardfindings to a blocking pipeline (e.g., raise on CRITICAL severity findings). - Make
ContentHashInterceptorhash computation automatic on tool registration, rather than requiring adapter cooperation. - Add SLSA provenance attestation generation.
LLM06: Sensitive Information Disclosure¶
LLM applications revealing confidential data, PII, or proprietary information through their outputs or logs.
Coverage: Partial
Mitigations:
- PII patterns โ
mcp_gateway.py:34-42โ Built-in regex for SSN (\b\d{3}-\d{2}-\d{4}\b) and credit card numbers in tool parameters. Returns(False, reason)on match. - Blocked patterns โ
base.py:695-701โPolicyInterceptor.intercept()checksblocked_patternsagainst tool arguments. - Secret detection โ
secure_codegen.py:346-360โ 5 regex patterns for API keys, passwords, tokens, AWS keys, private keys in generated code. CRITICAL severity. - Egress policy โ
egress_policy.py:113-139โ Domain-level egress filtering with first-match-wins and default-deny. - Canary leak detection โ
prompt_injection.py:595-612โ Detects system prompt canary tokens in user input.
Coverage Boundaries:
| Technique Category | Covered | Notes |
|---|---|---|
| SSN / credit card in tool parameters | Yes | Regex patterns in mcp_gateway.py:34-42 block matching arguments |
| Other PII (email, phone, address) | No | Only 2 PII patterns implemented |
| Sensitive data in LLM text output | No | Blocked patterns check tool arguments only, not LLM response text |
| Audit log parameter redaction | No | mcp_gateway.py:165 stores raw parameters=params with no redaction (see below) |
Audit Log Disclosure (elevated finding): The audit trail is the single most reliable disclosure vector because it is always active when log_all_calls=True (the default). Every tool call's full parameters โ including any PII, credentials, or tokens passed as arguments โ are stored verbatim in AuditEntry and exposed via logger.info(). This means the toolkit's own security logging is a data leak pathway. This finding warrants priority remediation.
Recommendations:
- Priority: Add
GovernancePolicy.redact_audit_pii: bool = Falsefor pattern-based redaction ofAuditEntry.parametersbefore persistence. - Expand default PII patterns to cover the OWASP-recommended set (email, phone, IP address, JWT tokens).
- Apply the same pattern scanning to LLM outputs via
post_execute()or a new output interceptor. - Document integration path for external DLP services as an advanced configuration.
LLM07: Insecure Plugin Design¶
Plugins or tools that accept untrusted input without adequate validation, enabling injection, privilege escalation, or data exfiltration.
Coverage: Partial
Mitigations:
- MCPGateway 5-stage pipeline โ
mcp_gateway.py:134-251โ Deny-list, allow-list, parameter sanitization (policy + built-in patterns), rate limiting, human-in-the-loop approval. Fail-closed on errors. - Tool definition scanning โ
mcp_security.py:300-331โ Comprehensive scan for hidden instructions, description injection, schema abuse, and cross-server attacks. - Rug-pull detection โ
mcp_security.py:413-454โ SHA-256 fingerprinting of tool definitions with change detection. - Schema abuse detection โ
mcp_security.py:605-681โ Flags overly permissive schemas, suspicious field names (system_prompt,webhook,callback_url), instruction-bearing defaults. ContentHashInterceptorโbase.py:714-782โ SHA-256 integrity verification of tool callables. Strict mode blocks unregistered tools.- RBAC โ
rbac.py:88-92โ 4-role access control with action-level permissions. - Human approval โ
mcp_gateway.py:231-249โ Sensitive tools require explicit human approval via configurable callback. Fails closed on callback errors.
Coverage Boundaries:
| Technique Category | Covered | Notes |
|---|---|---|
| Tool allow/deny lists | Yes | MCPGateway 5-stage pipeline with fail-closed error handling |
| Parameter pattern matching | Yes | json.dumps() flattens nested structures; regex catches patterns regardless of depth |
| Unicode-normalized parameters | No | mcp_gateway.py:39 regex matches ASCII semicolons only; fullwidth Unicode variants not normalized |
| JSON Schema composition ($ref/oneOf) | No | _check_schema_abuse() inspects top-level properties only; composition keywords unresolved |
| Runtime rug-pull detection | No | check_rug_pull() runs at scan-time; MCPGateway does not invoke scanner at execution time |
| Human-in-the-loop for sensitive tools | Yes | Configurable approval callback with fail-closed behavior |
Recommendations:
- Integrate
MCPSecurityScannerresults intoMCPGatewaydecisions (connect scan-time detection to execution-time enforcement). - Add JSON Schema composition keyword resolution ($ref, oneOf, allOf, anyOf) to
_check_schema_abuse(). - Apply Unicode normalization (NFKC) to tool parameters before pattern matching.
LLM08: Excessive Agency¶
LLM agents taking actions beyond their intended scope, including excessive autonomy, permissions, or functionality.
Coverage: Partial
Mitigations:
- Execution rings โ
enforcer.py:47-88โ 4-tier privilege model (Ring 0-3). Ring 0 always denied in Public Preview. Agent ring derived from trust score. - Ring enforcement โ
enforcer.py:94-97โshould_demote()checks trust score drops. - Kill switch โ
kill_switch.py:86-118โ Immediate agent termination with 6 kill reasons and saga step compensation. RogueDetectionMiddlewareโmaf_adapter.py:416-501โ RaisesMiddlewareTerminationwhenquarantine_recommended=True. Only enforcement point that auto-blocks.- Breach detector โ
breach_detector.py:104-178โ Sliding-window frequency analysis with severity classification and internal circuit breaker. - Rogue agent detection โ
rogue_detector.py:276-374โ Composite risk scoring from frequency z-score, entropy deviation, and capability profile violations. - Tool allowlist โ
base.py:689-693โPolicyInterceptorblocks tools not inallowed_tools. - Max tool calls โ
base.py:705-709โ Hard limit per session. - Human approval โ
base.py:682-685โrequire_human_approvalhalts all tool calls pending review. - Scope guard โ
scope_guard.py:139-249โ File/line count limits with PASS/SOFT_FAIL/HARD_FAIL decisions. - Escalation quorum โ
escalation.py:1-80โ M-of-N approval for sensitive actions with timeout and default-deny.
Coverage Boundaries:
| Technique Category | Covered | Notes |
|---|---|---|
| Tool allowlist enforcement | Yes | PolicyInterceptor blocks tools not in allowed_tools (line 689-693) |
| Max tool calls per session | Yes | Hard limit enforced at line 705-709 |
| Human approval gates | Yes | require_human_approval halts all tool calls pending review |
| Rogue detection โ auto-block (MAF path) | Yes | RogueDetectionMiddleware raises MiddlewareTermination (MAF adapter only) |
| Rogue detection โ auto-block (other paths) | No | Standalone RogueAgentDetector sets quarantine_recommended but takes no enforcement action |
| Kill switch auto-trigger | No | Exactly 1 call site: CLI handler at session_commands.py:120; no detection system invokes it |
| ScopeGuard enforcement | No | Returns advisory ScopeEvaluation; no production code checks the decision |
Recommendations:
- Ship an optional
RogueEnforcementBridgethat connectsRogueAgentDetector.assess()toKillSwitch.kill()with configurable risk thresholds. - Wire
ScopeGuardevaluation results into aToolCallInterceptorthat blocks on HARD_FAIL. - The separation between detection and enforcement is architecturally intentional (operators control enforcement policy), but the toolkit should provide the wiring as opt-in rather than requiring custom glue code.
LLM09: Overreliance¶
Uncritical acceptance of LLM outputs without verification, leading to misinformation, security vulnerabilities, or faulty decisions.
Coverage: Partial
Mitigations:
- Drift detection โ
base.py:977-1038โSequenceMatcher-based drift scoring between baseline and actual output. EmitsDRIFT_DETECTEDevent when threshold exceeded. - Confidence threshold โ
base.py:964-973โGovernancePolicy.confidence_threshold(default 0.8) gates actions below minimum confidence. - Adversarial evaluator โ
_adversarial_impl.py:120-191โ Runs 8 built-in attack vectors against governance interceptor. Produces per-category risk scores. Testing utility, not runtime enforcement. - Dry-run mode โ
dry_run.py:63-104โ Shadow-mode evaluation that records what would happen without blocking. - Trust scoring โ
agentmesh/reward/scoring.py:1-100โ 5-dimensional scoring (policy compliance, resource efficiency, output quality, security posture, collaboration health) with exponential moving average.
Coverage Boundaries:
| Technique Category | Covered | Notes |
|---|---|---|
| Drift detection (same agent) | Partial | post_execute() computes scores and emits events but never blocks (returns (True, None) always) |
| Confidence threshold gating | No (dead code) | base.py:966 uses getattr(input_data, 'confidence', None) โ no framework adapter populates this attribute |
| Cascading hallucination (cross-agent) | No | Drift detection is per-agent; no cross-agent hallucination propagation detection |
| Factual accuracy verification | No | No fact-checking, grounding, or retrieval verification โ application-layer concern |
| Adversarial governance testing | Yes | AdversarialEvaluator runs 8 attack vectors against interceptor (testing utility, not runtime) |
Recommendations:
- Document the scope boundary: "The toolkit detects behavioral anomalies that correlate with overreliance (drift, trust decay) but does not verify factual accuracy. Fact-checking and grounding are application-layer concerns."
- Make drift detection block-capable via
GovernancePolicy.block_on_driftflag. - Explore cross-agent drift correlation for multi-agent deployments.
LLM10: Model Theft¶
Unauthorized access to, copying, or extraction of proprietary LLM models through API queries, side channels, or direct access.
Coverage: Gap (Out of Scope)
The toolkit wraps LLM API clients. It does not host models, manage model weights, or control model serving infrastructure. Model theft prevention requires controls at the inference/serving layer, which is architecturally outside this toolkit's domain.
Indirect mitigations:
- Rate limiting via
RateLimiterโ limits query volume that could be used for extraction-via-distillation. - Audit logging of all tool calls โ forensic trail for detecting suspicious query patterns.
RogueAgentDetectorfrequency analysis โ detects high-volume systematic querying that could indicate extraction attempts.
These are defense-in-depth signals, not primary mitigations.
Recommendations:
- Document as out of scope with indirect mitigations cited.
- Recommend model-serving-layer protections: API key rotation, model endpoint access logging, output watermarking, extraction query detection.
Relationship to OWASP Agentic Top 10¶
This document covers the OWASP Top 10 for LLM Applications (2025) โ risks specific to LLM-powered applications. The toolkit also maps against the OWASP Top 10 for Agentic Applications (2026) in docs/OWASP-COMPLIANCE.md, which covers agent-specific risks (goal hijack, rogue agents, cascading failures, etc.).
Several risks overlap between the two lists:
| LLM Risk | Agentic Risk | Overlap |
|---|---|---|
| LLM01 Prompt Injection | ASI-01 Agent Goal Hijack | Prompt injection is one vector for goal hijacking |
| LLM05 Supply Chain | ASI-04 Supply Chain | Same risk, different framing |
| LLM07 Insecure Plugin | ASI-02 Tool Misuse | Plugin security is a subset of tool governance |
| LLM08 Excessive Agency | ASI-10 Rogue Agents | Excessive agency can manifest as rogue behavior |
The Agentic Top 10 mapping uses a different evidence standard (capability presence) than this document (verified enforcement + adversarial bypass testing). A similar adversarial validation of the Agentic Top 10 would likely surface comparable detection-without-enforcement gaps.
OWASP 2025 Edition: New Risk Categories¶
The 2025 revision of the OWASP Top 10 for LLM Applications introduces two new categories not present in the 2023 edition. Preliminary coverage notes:
LLM07 (2025): System Prompt Leakage¶
Unauthorized disclosure of system prompts that reveal internal logic, security controls, or sensitive configuration.
Coverage: Partial
The toolkit's canary token detection (prompt_injection.py:595-612) catches system prompt leakage when canary strings appear in user-visible output. MemoryGuard protects policy-controlled context (vfs://{agent_id}/policy/*) as read-only. However, there is no dedicated system prompt protection mechanism โ no prompt encryption, no output scanning for known system prompt fragments, and no monitoring for extraction attempts (repeated probing queries designed to elicit system instructions).
LLM08 (2025): Vector and Embedding Weaknesses¶
Vulnerabilities in RAG pipelines where embeddings, vector stores, or retrieval mechanisms are manipulated to inject malicious content or poison context.
Coverage: Partial
MemoryGuard (memory_guard.py:186-295) validates writes to agent memory stores (including RAG stores) with injection pattern detection, hash integrity, and content scanning. This addresses write-path poisoning. However, there is no read-path validation โ poisoned content that was written before MemoryGuard was deployed, or content poisoned at the embedding/indexing layer, would not be detected at retrieval time. No embedding-level integrity verification exists.