ADR-0024: RL Training Governance with Violation Penalties¶
Status¶
Accepted
Context¶
Agent Lightning provides reinforcement learning infrastructure for agent optimization. Without governance integration, RL training could:
- Reward policy-violating behavior (if violations lead to higher task completion)
- Produce agents that learn to circumvent governance controls
- Generate training data that includes unsafe action sequences
We needed training-time governance that discourages policy violations through the reward signal itself, rather than relying solely on runtime enforcement.
Decision¶
GovernedEnvironment wraps the RL environment with violation-aware reward shaping:
- Each policy violation incurs a penalty subtracted from the reward signal
- Severity levels map to penalty magnitudes:
critical-- episode termination (agent is stopped immediately)high-- large negative rewardmedium-- moderate negative rewardlow-- small negative reward- Violation records include:
policy,description,severity,blocked,step,timestamp FlightRecorderEmitterexports training spans with full policy evaluation context for audit
Key design choices: - Penalties are configurable per deployment (not hardcoded) - Critical violations terminate the episode, preventing further unsafe exploration - All violations are recorded regardless of severity for audit trail - Training audit spans carry agent_os.* prefixed attributes for policy name, result, and violation status
Consequences¶
- Agents learn that policy violations are costly, shaping behavior toward compliance
- Critical violations prevent unsafe exploration entirely
- Training audit trail enables post-hoc analysis of what the agent learned
- Penalty magnitudes can be tuned without code changes
- No modification to the underlying RL algorithm required -- works with any algorithm that uses scalar rewards
References¶
agent-governance-python/agent-lightning/src/agent_lightning_gov/environment.pyagent-governance-python/agent-lightning/src/agent_lightning_gov/emitter.pydocs/specs/AGENT-LIGHTNING-FAST-PATH-1.0.mddocs/specs/AUDIT-COMPLIANCE-1.0.mdSection 20