Chapter 7: Policy Versioning¶
Chapter 6 proved that your policies work right now. But policies change. Legal tells you that send_email should be a hard block, not an escalation. Someone fixes that โ and accidentally breaks transfer_funds in the same edit. You need a way to compare two versions, test both, and catch the regression before the new version goes live.
What you'll learn:
| Section | Topic |
|---|---|
| Two versions side by side | What changed between v1 and v2 |
| Diff with the CLI | See every structural change in one command |
| Test both versions | Run the same contexts against v1 and v2 |
| Catch the regression | Separate expected changes from accidents |
Step 1: Two versions side by side¶
Version 1.0 is the production baseline โ the same combined policy from Chapter 6 with five rules covering all decision tiers.
Version 2.0 has three changes:
| # | Change | Intentional? |
|---|---|---|
| 1 | block-write-file priority raised from 70 to 95 | Yes โ fixes the Chapter 6 surprise where the environment policy overrode the block |
| 2 | escalate-send-email message no longer says "requires human approval" | Yes โ legal decided send_email should be fully blocked |
| 3 | escalate-transfer-funds message no longer says "requires human approval" | No โ accidental edit, breaks the escalation |
Changes 1 and 2 are intentional. Change 3 happened because someone edited both escalation rules instead of just one. The YAML diff looks like a routine cleanup. The damage is invisible without a behavioral test.
Step 2: Diff with the CLI¶
The built-in diff command compares two policy files structurally:
rule escalate-transfer-funds: message: Sensitive action: transfer_funds requires human approval -> Sensitive action: transfer_funds is blocked
rule escalate-send-email: message: Sensitive action: send_email requires human approval -> Communication: send_email is blocked by policy
rule block-write-file: priority: 70 -> 95
version: 1.0 -> 2.0
Every structural change is listed: two messages changed, one priority raised, and the version bumped. But the diff does not tell you which change breaks behavior. For that, you need to run both versions through the same tests.
Step 3: Test both versions¶
Load v1 and v2 into separate evaluators and run the same five tools through both. Use the classify() helper from Chapter 6 to tag each result as allow, escalate, or deny:
from pathlib import Path
from agent_os.policies import PolicyEvaluator
from agent_os.policies.schema import PolicyDocument
examples_dir = Path("docs/tutorials/policy-as-code/examples")
v1 = PolicyDocument.from_yaml(examples_dir / "07_policy_v1.yaml")
v2 = PolicyDocument.from_yaml(examples_dir / "07_policy_v2.yaml")
eval_v1 = PolicyEvaluator(policies=[v1])
eval_v2 = PolicyEvaluator(policies=[v2])
ESCALATION_KEYWORD = "requires human approval"
def classify(decision):
if decision.allowed:
return "allow"
if decision.reason and ESCALATION_KEYWORD in decision.reason.lower():
return "escalate"
return "deny"
tools = ["search_documents", "write_file", "send_email",
"delete_database", "transfer_funds"]
for tool in tools:
ctx = {"tool_name": tool}
t1 = classify(eval_v1.evaluate(ctx))
t2 = classify(eval_v2.evaluate(ctx))
changed = "โ ๏ธ" if t1 != t2 else ""
print(f"{tool:<22s} {t1:<12s} {t2:<12s} {changed}")
Example output¶
Tool v1 v2 Changed?
----------------------------------------------------------
search_documents โ
allow โ
allow
write_file ๐ซ deny ๐ซ deny
send_email โณ escalate ๐ซ deny โ ๏ธ yes
delete_database ๐ซ deny ๐ซ deny
transfer_funds โณ escalate ๐ซ deny โ ๏ธ yes
2 tool(s) changed behavior between versions.
Two tools changed: send_email and transfer_funds. Both went from escalate to deny. The structural diff showed three changes, but the behavioral test shows only two matter. The write_file priority change does not affect single-policy evaluation โ it matters when combined with the environment policy (that is what the Chapter 6 test matrix would catch).
Step 4: Catch the regression¶
The team planned one behavioral change: send_email should become a hard deny. Anything else that changed is a regression.
expected_changes = {"send_email"}
for tool, tier1, tier2, changed in results:
if not changed:
continue
if tool in expected_changes:
print(f"โ
{tool}: {tier1} โ {tier2} (expected)")
else:
print(f"โ {tool}: {tier1} โ {tier2} (REGRESSION)")
โ
send_email: escalate โ deny (expected โ legal decision)
โ transfer_funds: escalate โ deny (REGRESSION)
โ Regression: transfer_funds
Was 'escalate' in v1, now 'deny' in v2.
The v2 edit removed the escalation keyword from the
message, so the action that used to pause for human
review now silently blocks.
Fix the regression in v2, then re-run this comparison.
Do not deploy until all changes are expected.
The regression is the same type Chapter 6 caught in Part 4 โ removing "requires human approval" silently converts an escalation into a hard deny. But this time, the test compares two versions instead of checking one version in isolation. That is what makes it a versioning check: you can see exactly when the behavior changed and which edit caused it.
Full example¶
============================================================
Chapter 7: Policy Versioning
============================================================
--- Part 1: Load both versions ---
v1: 'production-policy' version 1.0 (5 rules)
v2: 'production-policy' version 2.0 (5 rules)
--- Part 2: Diff the two versions ---
version: 1.0 โ 2.0
rule escalate-transfer-funds: message changed
was: "Sensitive action: transfer_funds requires human approval"
now: "Sensitive action: transfer_funds is blocked"
rule escalate-send-email: message changed
was: "Sensitive action: send_email requires human approval"
now: "Communication: send_email is blocked by policy"
rule block-write-file: priority 70 โ 95
The diff lists every structural change. But a diff cannot
tell you whether a change is safe. You need to test both
versions and compare the results.
--- Part 3: Test both versions ---
Tool v1 v2 Changed?
----------------------------------------------------------
search_documents โ
allow โ
allow
write_file ๐ซ deny ๐ซ deny
send_email โณ escalate ๐ซ deny โ ๏ธ yes
delete_database ๐ซ deny ๐ซ deny
transfer_funds โณ escalate ๐ซ deny โ ๏ธ yes
2 tool(s) changed behavior between versions.
--- Part 4: Detect regressions ---
โ
send_email: escalate โ deny (expected โ legal decision)
โ transfer_funds: escalate โ deny (REGRESSION)
โ Regression: transfer_funds
Was 'escalate' in v1, now 'deny' in v2.
The v2 edit removed the escalation keyword from the
message, so the action that used to pause for human
review now silently blocks.
Fix the regression in v2, then re-run this comparison.
Do not deploy until all changes are expected.
============================================================
Policy versioning closes the loop.
Tag a version, diff it, test both, catch regressions.
No policy update ships without passing this check.
============================================================
How does it work?¶
v1.yaml v2.yaml
โ โ
โโโโโโโโโโฌโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. Diff โ
โ CLI: policy diff โ
โ List structural diffs โ
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 2. Test both โ
โ Same contexts, same โ
โ classify() function โ
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโดโโโโโโโ
โผ โผ
No changes Changes found
โ
Safe to โ
deploy โโโโโโโโโโโโโโโโ
โ 3. Classify โ
โ Expected vs โ
โ Regression โ
โโโโโโโโฌโโโโโโโโ
โ
โโโโโโโโดโโโโโโโ
โผ โผ
Expected Regression
โ
Deploy โ Fix first
| Tool | What it does |
|---|---|
policy diff v1.yaml v2.yaml | CLI: structural diff between two policy files |
PolicyDocument.from_yaml(path) | Load and validate a policy file |
PolicyEvaluator(policies=[doc]) | Create an evaluator from a PolicyDocument |
evaluator.evaluate(context) | Return a PolicyDecision with allowed, action, reason |
classify(decision) | Tag a decision as allow, escalate, or deny (from Chapter 6) |
Try it yourself¶
-
Add a new rule in v2. Create a rule
block-execute-codethat deniesexecute_codein v2 only. Re-run the diff โ it should show "rule added." Test both versions to confirm the new rule only affects v2, and add it toexpected_changesso it does not flag as a regression. -
Bridge conversion. Import
governance_to_documentfromagent_os.policies.bridgeand convert aGovernancePolicyobject into aPolicyDocument. Diff the result against v1 to see how the legacy format maps to the declarative format. -
Automate the gate. Write a function
is_safe_to_deploy(v1_path, v2_path, expected)that loads both files, diffs them, tests both, and returnsTrueonly if every behavioral change is in theexpectedset. This is a deploy gate โ run it in CI before any policy update ships.
What you've built¶
Over seven chapters, you built a complete policy governance system:
| Chapter | Layer |
|---|---|
| 1 | Block dangerous tools |
| 2 | Scope permissions by role |
| 3 | Rate-limit actions |
| 4 | Resolve conflicts between policies |
| 5 | Escalate sensitive actions to humans |
| 6 | Test policies automatically |
| 7 | Update policies safely with regression detection |
Each layer added one concept. Together, they form a system that can govern AI agents in production: who can do what, how often, who approves, how you test it, and how you update it without breaking what already works.
Previous: Chapter 6 โ Policy Testing