Skip to content

Chapter 6: Policy Testing

In Chapters 1โ€“5, you checked your policies by running a script and eyeballing the output. That works when you have five rules. But you now have role-based policies, environment-aware rules, conflict resolution, and escalation workflows. A single typo in a YAML file can silently change an escalation into a hard deny โ€” and nobody notices until a real transfer fails in production.

Manual checking does not scale. You need automated tests that verify every tool gets the right decision, for every role, every time.

What you'll learn:

Section Topic
The problem Why eyeballing output is not enough
Validate the structure Catch structural errors before anything runs
Write test scenarios Declare expected outcomes, run them automatically
The test matrix Combine policies from chapters 2 + 4, test every role ร— environment ร— tool
Catch a regression Find the bug that manual checking misses
Try it yourself Exercises

The problem

Manual checking breaks down fast. Once you have multiple policies, you need a repeatable way to say "for this context, I expect this decision" and verify the result automatically.


Step 1: Validate the structure

Before testing any decisions, make sure the YAML is well-formed. A misspelled operator or a missing field will cause confusing failures later. Catch structural errors first.

If you are using the checked-in example files from the repo root, use the full paths shown in the commands below. If you created your own copies locally, replace them with your local filenames.

A valid policy (06_test_policy.yaml)

This policy combines concepts from earlier chapters โ€” allow, deny, escalation, and a default โ€” into a single file designed for testing:

version: "1.0"
name: test-policy
description: >
  Combined policy for automated testing.  Covers allow, deny,
  escalation-tagged deny, and default-allow so that test scenarios
  can verify every decision path in one pass.

rules:
  # Tier 1: Always denied โ€” irreversibly destructive
  - name: block-delete-database
    condition:
      field: tool_name
      operator: eq
      value: delete_database
    action: deny
    priority: 100
    message: "Destructive action: deleting databases is never allowed"

  # Tier 2: Escalation โ€” needs human review
  - name: escalate-transfer-funds
    condition:
      field: tool_name
      operator: eq
      value: transfer_funds
    action: deny
    priority: 90
    message: "Sensitive action: transfer_funds requires human approval"

  - name: escalate-send-email
    condition:
      field: tool_name
      operator: eq
      value: send_email
    action: deny
    priority: 85
    message: "Sensitive action: send_email requires human approval"

  # Tier 3: Always allowed โ€” safe, read-only actions
  - name: allow-search-documents
    condition:
      field: tool_name
      operator: eq
      value: search_documents
    action: allow
    priority: 80
    message: "Safe action: searching documents is always allowed"

  # Tier 4: Explicit deny โ€” not needed by this agent
  - name: block-write-file
    condition:
      field: tool_name
      operator: eq
      value: write_file
    action: deny
    priority: 70
    message: "Write access is not permitted for this agent"

defaults:
  action: allow
  max_tool_calls: 10

Five rules, four decision tiers, one default. Enough to test every path.

Loading and validating

from pathlib import Path

from agent_os.policies.schema import PolicyDocument

examples_dir = Path("docs/tutorials/policy-as-code/examples")

policy = PolicyDocument.from_yaml(examples_dir / "06_test_policy.yaml")
print(policy.name)        # "test-policy"
print(len(policy.rules))  # 5

PolicyDocument.from_yaml() does two things: it parses the YAML and validates it against the schema. If the file is valid, you get a PolicyDocument object. If not, you get a ValidationError that tells you exactly what is wrong.

A broken policy

What if someone types equals instead of eq?

from pydantic import ValidationError

broken = {
    "version": "1.0",
    "name": "broken-policy",
    "rules": [{
        "name": "bad-rule",
        "condition": {
            "field": "tool_name",
            "operator": "equals",   # wrong โ€” should be "eq"
            "value": "send_email",
        },
        "action": "deny",
    }],
}

try:
    PolicyDocument.model_validate(broken)
except ValidationError as exc:
    print(exc.errors()[0]["msg"])

Example output

  ๐Ÿšซ Validation failed (as expected):
     Field:   rules -> 0 -> condition -> operator
     Problem: Input should be 'eq', 'ne', 'gt', 'lt', 'gte', 'lte', 'in', 'matches' or 'contains'

The error message tells you the exact path (rules -> 0 -> condition -> operator) and the valid values. You do not need to guess.

Using the CLI

The same validation is available as a command:

python -m agent_os.policies.cli validate \
  docs/tutorials/policy-as-code/examples/06_test_policy.yaml
OK

Exit code 0 means the file is valid. Exit code 1 means validation failed (with the error printed to stderr). Exit code 2 means the file could not be found or parsed.


Step 2: Write test scenarios

Validation tells you the YAML is structured correctly. Test scenarios tell you the policy behaves correctly โ€” that each tool gets the right decision.

The scenarios file (06_test_scenarios.yaml)

scenarios:
  # Always allowed
  - name: search-documents-allowed
    context: { tool_name: search_documents }
    expected_action: allow

  # Always denied (destructive)
  - name: delete-database-denied
    context: { tool_name: delete_database }
    expected_action: deny

  # Escalation-tagged (deny with "requires human approval")
  - name: transfer-funds-denied
    context: { tool_name: transfer_funds }
    expected_action: deny

  - name: send-email-denied
    context: { tool_name: send_email }
    expected_action: deny

  # Explicit deny
  - name: write-file-denied
    context: { tool_name: write_file }
    expected_action: deny

  # Default action (tool not in any rule)
  - name: unknown-tool-uses-default
    context: { tool_name: read_logs }
    expected_action: allow

  # Same checks using expected_allowed (boolean)
  - name: search-documents-is-allowed
    context: { tool_name: search_documents }
    expected_allowed: true

  - name: delete-database-is-not-allowed
    context: { tool_name: delete_database }
    expected_allowed: false

Each scenario names a context and an expected result. You can check either the action string (expected_action) or the boolean (expected_allowed).

Running with the CLI

python -m agent_os.policies.cli test \
  docs/tutorials/policy-as-code/examples/06_test_policy.yaml \
  docs/tutorials/policy-as-code/examples/06_test_scenarios.yaml
8/8 scenarios passed

If any scenario fails, the CLI prints which one and what went wrong:

FAIL: transfer-funds-denied: expected deny, got allow
7/8 scenarios passed

Exit code 0 means all passed. Exit code 1 means at least one failed.

Running in Python

The CLI is convenient, but sometimes you want the results in Python โ€” for custom formatting, integration with a CI pipeline, or testing multiple policies at once.

from pathlib import Path

import yaml
from agent_os.policies import PolicyEvaluator
from agent_os.policies.schema import PolicyDocument

examples_dir = Path("docs/tutorials/policy-as-code/examples")

policy = PolicyDocument.from_yaml(examples_dir / "06_test_policy.yaml")
evaluator = PolicyEvaluator(policies=[policy])

with open(examples_dir / "06_test_scenarios.yaml") as f:
    scenarios = yaml.safe_load(f)["scenarios"]

for scenario in scenarios:
    decision = evaluator.evaluate(scenario["context"])
    expected = scenario.get("expected_action")
    actual = decision.action
    ok = (expected is None) or (actual == expected)
    status = "โœ… pass" if ok else "โŒ FAIL"
    print(f"{scenario['name']}: {status}")

Example output

  Scenario                         Expected   Actual     Result
  --------------------------------------------------------------------
  search-documents-allowed         allow      allow      โœ… pass
  delete-database-denied           deny       deny       โœ… pass
  transfer-funds-denied            deny       deny       โœ… pass
  send-email-denied                deny       deny       โœ… pass
  write-file-denied                deny       deny       โœ… pass
  unknown-tool-uses-default        allow      allow      โœ… pass
  search-documents-is-allowed      true       true       โœ… pass
  delete-database-is-not-allowed   false      false      โœ… pass

  โœ… 8/8 scenarios passed

Step 3: The test matrix

The scenarios in Step 2 test one policy in isolation. But in production, multiple policies apply at the same time: the reader policy from Chapter 2 and the environment policy from Chapter 4. When both are active, their rules merge and interact. A rule from one policy can override a rule from another โ€” and the result might not be what anyone intended.

A test matrix crosses every role, every environment, and every tool. It tests the combined system, not individual pieces.

Building the combined system

Load the role policies from Chapter 2 and the environment policy from Chapter 4. For each role, combine its policy with the shared environment policy:

from pathlib import Path

from agent_os.policies import PolicyEvaluator
from agent_os.policies.schema import PolicyDocument

examples_dir = Path("docs/tutorials/policy-as-code/examples")

reader_policy = PolicyDocument.from_yaml(examples_dir / "02_reader_policy.yaml")
admin_policy = PolicyDocument.from_yaml(examples_dir / "02_admin_policy.yaml")
env_policy = PolicyDocument.from_yaml(examples_dir / "04_env_policy.yaml")

# Each role gets its own policy + the shared environment policy.
# The evaluator merges all rules and sorts by priority.
role_policies = {
    "reader": [reader_policy, env_policy],
    "admin":  [admin_policy, env_policy],
}

tools = ["search_documents", "write_file", "send_email",
         "delete_database", "transfer_funds"]
environments = ["development", "production"]

for tool in tools:
    for role, policies in role_policies.items():
        for env in environments:
            evaluator = PolicyEvaluator(policies=list(policies))
            decision = evaluator.evaluate({"tool_name": tool, "environment": env})
            # check against expected ...

When two policies are loaded into one evaluator, their rules are merged into a single list sorted by priority. The first rule that matches the context wins. This is where surprising interactions happen.

Example output

  Tool                   reader/dev  reader/prod  admin/dev   admin/prod
  -----------------------------------------------------------------------
  search_documents       โœ… allow     ๐Ÿšซ deny      โœ… allow    ๐Ÿšซ deny
  write_file             โœ… allow โš ๏ธ  ๐Ÿšซ deny      โœ… allow    ๐Ÿšซ deny
  send_email             ๐Ÿšซ deny      ๐Ÿšซ deny      โœ… allow    ๐Ÿšซ deny
  delete_database        ๐Ÿšซ deny      ๐Ÿšซ deny      ๐Ÿšซ deny    ๐Ÿšซ deny
  transfer_funds         โœ… allow     ๐Ÿšซ deny      โœ… allow    ๐Ÿšซ deny

  19/20 cells match expectations.  1 surprise:

  โš ๏ธ  reader + development + write_file
     Expected: deny (reader policy blocks write_file at priority 80)
     Actual:   allow (environment policy allows development at priority 90)
     Reason:   Development environment: agents can act freely

What just happened?

The matrix found a real interaction bug. block-write-file is priority 80, but allow-development is priority 90, so the environment rule wins first and the reader is allowed to write files in development. You would not catch that by reading the YAML files one at a time.


Step 4: Catch a regression

This is the payoff. Here is a bug that would be nearly invisible to a human reviewer โ€” but a test catches it instantly.

The scenario

Someone edits the policy and changes the transfer_funds rule's message from:

"Sensitive action: transfer_funds requires human approval"

to:

"Sensitive action: transfer_funds is blocked"

The rule still says action: deny. Nothing else changed. A YAML diff shows one line modified. A human reviewer might glance at it and approve.

But in the code, the escalation system uses the phrase "requires human approval" in the message to distinguish an escalation from a hard deny (Chapter 5). Removing that phrase silently converts an escalation โ€” where a human could approve the transfer โ€” into an unconditional block.

What the test shows

  Original policy:  transfer_funds โ†’ โณ escalate (escalate)
  Modified policy:  transfer_funds โ†’ ๐Ÿšซ deny     (deny)

  โŒ Regression detected!
     transfer_funds changed from 'escalate' to 'deny'.
     The edit removed the escalation keyword, so the action
     that used to pause for human review now silently blocks.

The test compared the classification of the decision, not just the raw action string. Both versions return action: deny, but only the original still means "escalate."


Full example

python docs/tutorials/policy-as-code/examples/06_policy_testing.py
============================================================
  Chapter 6: Policy Testing
============================================================

--- Part 1: Validate the structure ---

  โœ… 'test-policy' loaded successfully
     5 rules, default action: allow

  ๐Ÿšซ Validation failed (as expected):
     Field:   rules -> 0 -> condition -> operator
     Problem: Input should be 'eq', 'ne', 'gt', 'lt', 'gte', 'lte', 'in', 'matches' or 'contains'

  PolicyDocument.from_yaml() catches structural errors
  before any rule is evaluated. A typo like 'equals'
  instead of 'eq' is caught immediately.

--- Part 2: Run test scenarios ---

  Scenario                         Expected   Actual     Result
  --------------------------------------------------------------------
  search-documents-allowed         allow      allow      โœ… pass
  delete-database-denied           deny       deny       โœ… pass
  transfer-funds-denied            deny       deny       โœ… pass
  send-email-denied                deny       deny       โœ… pass
  write-file-denied                deny       deny       โœ… pass
  unknown-tool-uses-default        allow      allow      โœ… pass
  search-documents-is-allowed      true       true       โœ… pass
  delete-database-is-not-allowed   false      false      โœ… pass

  โœ… 8/8 scenarios passed

  Each scenario is one line in a YAML file. The test runner
  evaluates the policy and compares the actual result to the
  expected result. No manual checking required.

--- Part 3: The test matrix ---

  Loading policies from chapters 2 and 4...

  Tool                   reader/dev  reader/prod  admin/dev   admin/prod
  -----------------------------------------------------------------------
  search_documents       โœ… allow     ๐Ÿšซ deny      โœ… allow    ๐Ÿšซ deny
  write_file             โœ… allow โš ๏ธ  ๐Ÿšซ deny      โœ… allow    ๐Ÿšซ deny
  send_email             ๐Ÿšซ deny      ๐Ÿšซ deny      โœ… allow    ๐Ÿšซ deny
  delete_database        ๐Ÿšซ deny      ๐Ÿšซ deny      ๐Ÿšซ deny    ๐Ÿšซ deny
  transfer_funds         โœ… allow     ๐Ÿšซ deny      โœ… allow    ๐Ÿšซ deny

  19/20 cells match expectations.  1 surprise(s):

  โš ๏ธ  reader + development + write_file
     Expected: deny
     Actual:   allow (from rule: allow-development)
     Reason:   Development environment: agents can act freely

  The reader policy blocks write_file at priority 80.
  But the environment policy allows development at priority 90.
  Priority 90 beats 80 โ€” the environment rule fires first.
  Without the test matrix, this interaction is invisible.

--- Part 4: Catch a regression ---

  Scenario: someone edits the policy and removes the phrase
  "requires human approval" from the transfer_funds rule.
  The tool silently flips from escalate to hard deny.

  Original policy:  transfer_funds โ†’ โณ escalate (escalate)
  Modified policy:  transfer_funds โ†’ ๐Ÿšซ deny     (deny)

  โŒ Regression detected!
     transfer_funds changed from 'escalate' to 'deny'.
     The edit removed the escalation keyword, so the action
     that used to pause for human review now silently blocks.

  A human scanning the YAML diff might miss this. But a test
  scenario that checks for the escalation keyword catches it
  instantly. That is the value of automated policy testing:
  changes that look harmless cannot silently break behavior.

============================================================
  Policies are code. Test them like code.
  Validate the structure, write expected outcomes,
  run them automatically, and catch regressions
  before they reach production.
============================================================

How does it work?

  Role policy     Environment policy
  (ch2)           (ch4)
      โ”‚                โ”‚
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  1. Validate each file          โ”‚
  โ”‚     PolicyDocument.from_yaml()  โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  2. Test each policy alone      โ”‚
  โ”‚     CLI: policy test            โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  3. Test the combined system    โ”‚
  โ”‚     Python: multi-policy eval   โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚
      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”
      โ–ผ             โ–ผ
  All pass     Surprises found
  โœ… Deploy    โŒ Fix and re-run
Tool What it does
PolicyDocument.from_yaml(path) Load YAML and validate against Pydantic schema
PolicyDocument.model_validate(dict) Validate a Python dict without loading a file
PolicyEvaluator(policies=[...]) Merge rules from multiple policies
evaluator.evaluate(context) Return a PolicyDecision with allowed, action, reason
policy validate <file> CLI: validate structure, print OK or FAIL
policy test <policy> <scenarios> CLI: run scenarios, print pass count

Try it yourself

  1. Fix the surprise. The test matrix found that reader + development + write_file is unexpectedly allowed. Edit 02_reader_policy.yaml and raise block-write-file's priority to 95 (above the environment policy's 90). Re-run the script โ€” the โš ๏ธ should disappear.

  2. Add a staging environment. The environment policy has rules for development and production, but not staging. Add staging to the environments list in the test matrix. What happens? Does the default deny or allow? Add a scenario to verify.

  3. Extend the matrix. Create a third policy file for an "operator" role that can search documents and send emails but cannot write files or delete databases. Add it to the Python test matrix and verify the results across all environments.


What's missing?

Policies change over time. Legal tells you that write_file must now be blocked in production, not just for readers. The policy needs to be updated from version 1.0 to version 2.0. But how do you make that change without accidentally breaking something that was already working?

You need a way to compare two versions side by side โ€” see exactly what changed, run the test suite against both versions, and find regressions before the new version goes live. That is policy versioning.

Previous: Chapter 5 โ€” Approval Workflows Next: Chapter 7 โ€” Policy Versioning โ€” compare v1 vs v2 behavior, catch regressions before deploying.