Skip to content

Governance Overhead Benchmark

Issue: #720 Date: 2026-04-04 Platform: Windows 11, Intel Core i7-13th Gen, Python 3.12.8

Summary

This benchmark measures the latency overhead that AGT governance layers add to agent actions. It exercises the real implementations from all three packages:

Package Component tested
agent-os PolicyEvaluator โ€” declarative YAML/JSON rule engine
agent-mesh TrustPolicyEvaluator โ€” trust-score-based policy DSL
agent-mesh AuditLog / MerkleAuditChain โ€” append-only Merkle audit
agent-mesh CredentialManager โ€” ephemeral credential issuance & validation
agent-hypervisor RingEnforcer โ€” execution ring computation & checks
agent-hypervisor DeltaEngine โ€” VFS delta capture with hash chains
agent-hypervisor Hypervisor โ€” session create/join/activate/terminate

Key finding

Full-stack governance (policy + trust + ring check + Merkle audit) adds ~0.07 ms at p50 and ~0.42 ms at p99 per agent action. For context, a single LLM API call typically takes 200-2000 ms, making governance overhead < 0.04% of end-to-end latency in practice.

Charts

Latency Comparison

Overhead Breakdown

End-to-End Comparison

                               p50 (ms)    p99 (ms)    ops/sec
Ungoverned action              0.0005      0.0007      2,000,000
Governed (policy only)         0.0086      0.0321        103,315
Governed (full stack)          0.0690      0.4160         11,227

Governance overhead (p50):     0.069 ms
Governance overhead (p99):     0.416 ms

The full governance stack is dominated by Merkle audit logging (~0.05 ms) and trust policy evaluation (~0.004 ms). Ring enforcement is sub-microsecond.

Detailed Results

Baseline (No Governance)

Operation p50 (ms) p95 (ms) p99 (ms) ops/sec
Bare action (dict construction) 0.0005 0.0007 0.0009 865,501
Simulated tool call + validation 0.0021 0.0023 0.0024 463,542

Policy Evaluation (agent-os)

Rules p50 (ms) p95 (ms) p99 (ms) ops/sec
1 0.0054 0.0059 0.0062 185,065
10 0.0061 0.0066 0.0092 102,017
50 0.0119 0.0147 0.0642 60,544
100 0.0193 0.0233 0.0629 31,185

Policy evaluation scales linearly with rule count. Even with 100 rules, p50 stays under 0.02 ms.

Trust Policy Evaluation (agent-mesh)

Rules p50 (ms) p95 (ms) p99 (ms) ops/sec
1 0.0030 0.0033 0.0037 279,666
10 0.0044 0.0052 0.0479 141,493
50 0.0101 0.0121 0.0431 23,646

Trust evaluation is slightly faster than agent-os policy evaluation at equivalent rule counts due to simpler condition logic.

Credential Operations (agent-mesh)

Operation p50 (ms) p95 (ms) p99 (ms) ops/sec
Credential issuance 0.0101 0.0144 0.0896 65,876
Token validation (manager lookup) 0.0450 0.0564 0.2190 17,027
Token hash verify (SHA-256) 0.0008 0.0009 0.0009 1,241,003

Token hash verification is sub-microsecond. The CredentialManager.validate() cost comes from the linear scan over stored credentials; production deployments with indexed stores would match the hash-only path.

Audit Logging (agent-mesh)

Operation p50 (ms) p95 (ms) p99 (ms) ops/sec
Audit log write (Merkle chain) 0.0472 0.0887 0.4428 14,121
Merkle chain verify (100 entries) 0.6602 1.9190 3.6661 1,163

Audit writes are the most expensive per-action governance cost due to SHA-256 hash computation and Merkle tree updates. Verification is an offline operation (not on the hot path).

Ring Enforcement (agent-hypervisor)

Operation p50 (ms) p95 (ms) p99 (ms) ops/sec
Ring computation 0.0003 0.0004 0.0005 2,762,049
Ring enforcement check 0.0013 0.0015 0.0018 576,814

Ring operations are sub-microsecond โ€” zero measurable overhead for privilege checks.

Delta Audit (agent-hypervisor)

Operation p50 (ms) p95 (ms) p99 (ms) ops/sec
Delta capture 0.0092 0.0111 0.0200 70,621
Hash chain root (10 deltas) 0.0939 0.1850 0.4678 8,638

Session Lifecycle (agent-hypervisor)

Operation p50 (ms) p95 (ms) p99 (ms) ops/sec
Full lifecycle (create+join+activate+terminate) 0.0176 0.0217 0.0685 49,812

Overhead Breakdown

Where does the governed-action overhead come from?

Component                  p50 contribution
-----------------------------------------
Policy eval (10 rules)     0.006 ms  (  9%)
Trust eval  (5 rules)      0.004 ms  (  6%)
Ring compute + check        0.002 ms  (  3%)
Audit log write (Merkle)   0.047 ms  ( 68%)
Other (timestamps, dicts)  0.010 ms  ( 14%)
-----------------------------------------
Total                      0.069 ms  (100%)

Merkle audit is the dominant cost. If cryptographic audit integrity is not required, replacing AuditLog with a plain append-list drops the governed-action overhead to ~0.02 ms (p50).

Methodology

  • Iterations: 1,000 per benchmark (10,000 for sub-microsecond operations)
  • Warmup: 100 iterations discarded before measurement
  • Timer: time.perf_counter() (nanosecond resolution on Windows)
  • Percentiles: computed with numpy.percentile()
  • Environment: in-process only, no I/O, no network
  • Async: asyncio.run() for hypervisor session benchmarks

Reproducing

py -3.12 agent-governance-python/benchmarks/governance_overhead.py

Raw results are saved to agent-governance-python/benchmarks/results/governance_overhead.json.