Skip to content

ADR-0024: RL Training Governance with Violation Penalties

Status

Accepted

Context

Agent Lightning provides reinforcement learning infrastructure for agent optimization. Without governance integration, RL training could:

  • Reward policy-violating behavior (if violations lead to higher task completion)
  • Produce agents that learn to circumvent governance controls
  • Generate training data that includes unsafe action sequences

We needed training-time governance that discourages policy violations through the reward signal itself, rather than relying solely on runtime enforcement.

Decision

GovernedEnvironment wraps the RL environment with violation-aware reward shaping:

  • Each policy violation incurs a penalty subtracted from the reward signal
  • Severity levels map to penalty magnitudes:
  • critical -- episode termination (agent is stopped immediately)
  • high -- large negative reward
  • medium -- moderate negative reward
  • low -- small negative reward
  • Violation records include: policy, description, severity, blocked, step, timestamp
  • FlightRecorderEmitter exports training spans with full policy evaluation context for audit

Key design choices: - Penalties are configurable per deployment (not hardcoded) - Critical violations terminate the episode, preventing further unsafe exploration - All violations are recorded regardless of severity for audit trail - Training audit spans carry agent_os.* prefixed attributes for policy name, result, and violation status

Consequences

  • Agents learn that policy violations are costly, shaping behavior toward compliance
  • Critical violations prevent unsafe exploration entirely
  • Training audit trail enables post-hoc analysis of what the agent learned
  • Penalty magnitudes can be tuned without code changes
  • No modification to the underlying RL algorithm required -- works with any algorithm that uses scalar rewards

References

  • agent-governance-python/agent-lightning/src/agent_lightning_gov/environment.py
  • agent-governance-python/agent-lightning/src/agent_lightning_gov/emitter.py
  • docs/specs/AGENT-LIGHTNING-FAST-PATH-1.0.md
  • docs/specs/AUDIT-COMPLIANCE-1.0.md Section 20