Skip to content

Tutorial: Chaos Testing Your AI Agents with Agent SRE

This tutorial walks through chaos testing patterns for AI agents governed by AGT. You will inject failures into a running agent system and observe how governance policies respond.

Prerequisites: Complete Tutorial 01: Policy Engine before starting.

What is Chaos Testing for AI Agents?

Chaos testing deliberately introduces failures into a system to verify that it degrades gracefully. For AI agents, this means testing what happens when:

  • Tools return errors or timeouts
  • Policies conflict or are missing
  • Resource limits are exceeded
  • External dependencies become unavailable

AGT's policy engine and audit logging make it straightforward to measure whether your governance holds up under stress.

Setup

Install the required packages:

pip install agent-os-kernel[full]

Create a test directory and policy file:

mkdir -p chaos-test/policies
# chaos-test/policies/chaos-test-policy.yaml
apiVersion: agent-governance/v1
kind: Policy
metadata:
  name: chaos-test-policy
  version: "1.0"
spec:
  defaults:
    action: deny
    max_tool_calls: 20
  rules:
    - name: allow-web-search
      tool_name: web_search
      action: allow
      priority: 90
    - name: allow-calculator
      tool_name: calculator
      action: allow
      priority: 90
    - name: deny-file-delete
      tool_name: file_delete
      action: deny
      priority: 100

Pattern 1: Latency Injection

Simulate slow tool responses to test timeout handling.

# chaos-test/01_latency_injection.py
import time
from agent_os.policies import PolicyEvaluator
from agent_os.policies.schema import PolicyDocument
from pathlib import Path

evaluator = PolicyEvaluator()
policy = PolicyDocument.from_yaml(Path("chaos-test/policies/chaos-test-policy.yaml"))
evaluator.policies.append(policy)

# Simulate a tool that takes too long
class SlowTool:
    def __init__(self, delay_seconds=5):
        self.delay = delay_seconds
        self.call_count = 0

    def execute(self, tool_name, **kwargs):
        self.call_count += 1
        time.sleep(self.delay)

        context = {"tool_name": tool_name}
        decision = evaluator.evaluate(context)

        if not decision.allowed:
            return {"error": f"Denied: {decision.reason}"}

        return {"result": f"Completed after {self.delay}s", "call": self.call_count}

# Run with increasing latency
tool = SlowTool(delay_seconds=0.1)
for i in range(25):  # Exceeds max_tool_calls of 20
    result = tool.execute("web_search", query=f"test query {i}")
    print(f"Call {i+1}: {result}")

What to observe: - Calls 1-20 should succeed (within max_tool_calls limit) - Calls 21-25 should be denied by the policy - The audit log should show the denial reason

Pattern 2: Error Injection

Force tools to return errors and verify the governance response.

# chaos-test/02_error_injection.py
import random
from agent_os.policies import PolicyEvaluator
from agent_os.policies.schema import PolicyDocument
from pathlib import Path

evaluator = PolicyEvaluator()
policy = PolicyDocument.from_yaml(Path("chaos-test/policies/chaos-test-policy.yaml"))
evaluator.policies.append(policy)

class FaultyTool:
    def __init__(self, error_rate=0.3):
        self.error_rate = error_rate
        self.results = {"success": 0, "error": 0, "denied": 0}

    def execute(self, tool_name, **kwargs):
        context = {"tool_name": tool_name}
        decision = evaluator.evaluate(context)

        if not decision.allowed:
            self.results["denied"] += 1
            return {"status": "denied", "reason": decision.reason}

        # Inject random error
        if random.random() < self.error_rate:
            self.results["error"] += 1
            return {"status": "error", "error_type": random.choice(["timeout", "connection_reset", "rate_limited"])}

        self.results["success"] += 1
        return {"status": "success"}

# Test with 30% error rate
tool = FaultyTool(error_rate=0.3)
tools = ["web_search"] * 15 + ["calculator"] * 5 + ["file_delete"] * 3

for t in tools:
    result = tool.execute(t)
    print(f"  {t}: {result['status']}")

print(f"\nResults: {tool.results}")

What to observe: - file_delete should always be denied (priority 100 deny rule) - web_search and calculator should succeed or error (never denied) - Error rate should approximate 30% for allowed tools

Pattern 3: Policy Conflict

Create conflicting policies and observe resolution.

# chaos-test/03_policy_conflict.py
from agent_os.policies import PolicyEvaluator
from agent_os.policies.schema import PolicyDocument
from pathlib import Path

evaluator = PolicyEvaluator()

# Load base policy
base = PolicyDocument.from_yaml(Path("chaos-test/policies/chaos-test-policy.yaml"))
evaluator.policies.append(base)

# Create a conflicting override
conflict_yaml = """
apiVersion: agent-governance/v1
kind: Policy
metadata:
  name: conflict-override
  version: "1.0"
spec:
  defaults:
    action: allow
  rules:
    - name: allow-file-delete
      tool_name: file_delete
      action: allow
      priority: 50
"""

import yaml, tempfile
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
    f.write(conflict_yaml)
    conflict_path = f.name

conflict = PolicyDocument.from_yaml(Path(conflict_path))
evaluator.policies.append(conflict)

# Test the conflict
test_tools = ["web_search", "calculator", "file_delete"]
for t in test_tools:
    decision = evaluator.evaluate({"tool_name": t})
    print(f"  {t}: {'ALLOWED' if decision.allowed else 'DENIED'} (reason: {decision.reason})")

What to observe: - file_delete should still be denied — the base policy has priority 100, the conflict has priority 50 - Higher priority rules win in AGT's evaluation - This demonstrates why priority ordering matters in multi-policy environments

Pattern 4: Resource Exhaustion

Test behavior when tool call budgets are exhausted.

# chaos-test/04_resource_exhaustion.py
from agent_os.policies import PolicyEvaluator
from agent_os.policies.schema import PolicyDocument
from pathlib import Path

evaluator = PolicyEvaluator()
policy = PolicyDocument.from_yaml(Path("chaos-test/policies/chaos-test-policy.yaml"))
evaluator.policies.append(policy)

# Rapid-fire calls to exhaust budget
results = []
for i in range(30):
    context = {"tool_name": "web_search", "call_index": i}
    decision = evaluator.evaluate(context)
    results.append("ALLOWED" if decision.allowed else "DENIED")

allowed = results.count("ALLOWED")
denied = results.count("DENIED")
print(f"Allowed: {allowed}, Denied: {denied}")
print(f"Budget respected: {allowed <= 20}")

What to observe: - The policy enforces a hard cap at max_tool_calls: 20 - No amount of retrying should bypass the budget

Interpreting Results

After running all four patterns, review the audit log:

grep "denied" chaos-test-audit.log | wc -l
grep "allowed" chaos-test-audit.log | wc -l

A well-governed system should show: - Latency injection: Graceful degradation, no unhandled exceptions - Error injection: Errors are logged, governance still enforced - Policy conflicts: Higher-priority rules consistently win - Resource exhaustion: Hard limits enforced, no budget bypass

Next Steps

  • Adjust error rates and latency values to match your production environment
  • Add these patterns to your CI pipeline for regression testing
  • Combine patterns (e.g., latency + errors simultaneously) for more realistic stress tests
  • Review Tutorial 14: Kill Switch and Rate Limiting for additional resource control patterns