Agent Lightning Fast-Path -- Version 1.0¶
Status: Draft · Date: 2025-07-28 · Authors: Agent Governance Toolkit team
This specification defines the RL training governance layer for Agent Lightning, including governed runners, policy violation handling, reward shaping, governed environments, flight recorder emission, and failure semantics. All SDK implementations MUST conform to this specification.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 and RFC 8174.
Table of Contents¶
- Introduction
- Terminology
- Governed Runner
- Policy Violations
- Governed Rollout
- Reward Shaping
- Policy Penalty Function
- Composite Reward
- Governed Environment
- Environment State
- Flight Recorder Emitter
- Span Export
- Runner Lifecycle
- Violation Rate and Stats
- Factory Functions
- Failure Semantics
- Security Considerations
- Conformance Requirements
- Worked Examples
- References
1. Introduction¶
1.1 Purpose¶
Agent Lightning is the RL training governance layer for the Agent Governance Toolkit. It wraps RL frameworks to inject policy enforcement into training loops, converting policy violations into learning signals that teach agents to respect governance constraints during reinforcement learning.
By intercepting each training step at the kernel boundary, Agent Lightning ensures that every rollout is subject to the same policy evaluation that governs production execution -- but instead of merely blocking unsafe actions, the layer converts violations into negative reward signals that steer the RL optimiser away from policy-violating behaviour.
1.2 Scope¶
This specification covers:
- Governed Runner: A generic runner that wraps RL kernels with policy enforcement, capturing violations and signals per rollout.
- Policy Violations: A typed violation model with severity-based penalties and enumerated violation categories.
- Governed Rollout: A rollout record that bundles task I/O with governance metadata (violations, signals, penalty, timing).
- Reward Shaping: Reward functions that integrate policy compliance into RL training objectives via additive or multiplicative penalties.
- Governed Environment: A Gymnasium-style training environment that enforces policies on every step and converts violations to rewards.
- Flight Recorder Emitter: An adapter that exports Flight Recorder audit logs as Lightning spans for unified training and compliance telemetry.
- Failure Semantics: Fail-closed behaviour for critical violations, exception containment, and error propagation rules.
1.3 Relationship to Other Specifications¶
| Specification | Relationship |
|---|---|
| Agent OS Policy Engine 1.0 | Kernel policies are evaluated on each runner step and environment step |
| AgentMesh Identity and Trust 1.0 | Agent DIDs identify runners; trust scores may inform reward weighting |
| Agent Hypervisor Execution Control 1.0 | Ring enforcement MAY gate runner instantiation; kill switch MAY terminate runners |
1.4 Design Principles¶
- Policy violations are learning signals. Blocked actions are not just errors -- they become negative rewards that shape RL behaviour.
- Governance is transparent to the trainer. The governed runner exposes the same
step/iterinterface as unmodified runners; trainers need not know about policy enforcement internals. - Fail closed on critical violations. When
fail_on_violation = True, a blocked action MUST raisePolicyViolationError, halting the rollout. - Reward penalties are configurable. Operators control the mapping from severity to penalty magnitude via
RewardConfig. - Audit is always-on. Every rollout emits governance spans to the Flight Recorder for compliance traceability.
2. Terminology¶
| Term | Definition |
|---|---|
| Governed Runner | A generic runner (Generic[T_task]) that wraps an Agent OS kernel and collects policy violations during RL training rollouts. |
| Governed Rollout | A single execution pass through the kernel, bundling task input, task output, success flag, violations, signals, total penalty, and execution time. |
| Policy Violation | A record of a governance rule infraction during execution, categorised by type and severity. |
| PolicyViolationType | An enum classifying how the kernel responded to a policy infraction: BLOCKED, MODIFIED, WARNED, or SIGNAL_SENT. |
| PolicyViolationError | An exception raised when fail_on_violation is enabled and a violation blocks execution. |
| Severity | One of four levels -- critical, high, medium, low -- each mapped to a default penalty value. |
| PolicyReward | A reward function wrapper that subtracts policy violation penalties from a base reward, creating a compliance-aware training signal. |
| RewardConfig | Configuration dataclass controlling penalty magnitudes, clean-execution bonuses, multiplicative mode, and reward clamping bounds. |
| CompositeReward | A weighted combiner that sums multiple reward functions (including PolicyReward) with configurable weights. |
| GovernedEnvironment | A Gymnasium-compatible training environment that wraps an Agent OS kernel, enforcing policies on each step() call. |
| EnvironmentConfig | Configuration for the governed environment including max steps, violation penalties, termination rules, and reward shaping parameters. |
| EnvironmentState | A snapshot of the environment's current episode: step count, total reward, accumulated violations, and termination flags. |
| FlightRecorderEmitter | An adapter that converts Agent OS Flight Recorder entries into LightningSpan objects for ingestion by LightningStore. |
| LightningSpan | A span record compatible with Agent Lightning's telemetry format, carrying span ID, trace ID, name, timestamps, attributes, and events. |
| Flight Recorder | The Agent OS audit log subsystem whose entries are the input to the emitter. |
| LightningStore | The Agent Lightning persistence layer that receives emitted spans. |
| Kernel | An Agent OS KernelSpace instance with loaded policies, the execution boundary through which all governed actions pass. |
| Clean Bonus | An additive reward granted when a rollout completes with zero violations. |
3. Governed Runner¶
3.1 Overview¶
The GovernedRunner wraps agent execution in an Agent OS kernel, enforcing policies and collecting violation data that can be used as RL training signals. It is generic over the task type (Generic[T_task]) and exposes step() and iter() as the primary training entry points.
[Pure Specification]
3.2 Constructor Parameters¶
A GovernedRunner MUST accept the following parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
kernel | KernelSpace | (required) | Agent OS kernel with loaded policies |
fail_on_violation | bool | False | If True, raise PolicyViolationError when a violation blocks execution |
log_violations | bool | True | If True, log all violations at WARNING level |
violation_callback | Callable or None | None | Optional callback invoked for each violation |
[Pure Specification]
3.3 Step Method¶
The step() method MUST:
- Bind per-step violation and signal lists to the current async context (context variables) so that concurrent
step()calls on the same runner do not share violation or signal buffers. - Start a high-resolution timer.
- Attempt execution through the kernel (
execute_asyncpreferred, falling back toexecute, then direct agent call). - Catch
PolicyViolationError-- setsuccess = Falseandresult = None. - Catch all other exceptions -- set
success = Falseandresult = None, log the full traceback vialogger.exception. - Reset context variables in a
finallyblock. - Compute execution time in milliseconds.
- Construct and return a
GovernedRollout. - Emit governance spans (Section 11).
[Pure Specification]
3.4 Step Signature¶
async step(
input: T_task,
*,
resources: Any | None = None,
mode: str | None = None,
event: Any | None = None,
) -> GovernedRollout
The mode parameter SHOULD accept "train" or "eval" to distinguish training rollouts from evaluation passes. [Pure Specification]
3.5 Iter Method¶
The iter() method MUST:
- Accept an optional cooperative stop signal (
event). - Loop: fetch the next task from the store, execute via
step(), submit the rollout to the store. - Stop when the event is set or no more tasks are available.
[Pure Specification]
3.6 Context Variable Isolation¶
Per-step state MUST be stored in contextvars.ContextVar instances:
| Variable | Purpose |
|---|---|
_active_violations_ctx | List of PolicyViolation for the active step |
_active_signals_ctx | List of signal strings for the active step |
Each step() call MUST set fresh lists before invoking the kernel and MUST reset the context variables in a finally block. When a violation or signal arrives outside an active step() context, the handler MUST fall back to instance-level lists for backward compatibility. [Pure Specification]
3.7 Kernel Hook Registration¶
On init(), the runner MUST register hooks on the kernel:
- If the kernel exposes
on_policy_violation, register the violation handler. - If the kernel exposes
on_signal, register the signal handler.
[Pure Specification]
4. Policy Violations¶
4.1 PolicyViolationType Enum¶
Implementations MUST define the following violation types:
| Value | String | Description |
|---|---|---|
BLOCKED | "blocked" | Action was blocked entirely by the policy engine |
MODIFIED | "modified" | Action was modified before execution to satisfy policy constraints |
WARNED | "warned" | Warning issued but the action was allowed to proceed |
SIGNAL_SENT | "signal_sent" | A kernel signal was dispatched (e.g., SIGSTOP) |
[Pure Specification]
4.2 PolicyViolation Dataclass¶
A PolicyViolation record MUST contain:
| Field | Type | Default | Description |
|---|---|---|---|
violation_type | PolicyViolationType | (required) | Category of the violation |
policy_name | string | (required) | Name of the policy that was violated |
description | string | (required) | Human-readable description of the violation |
severity | string | (required) | One of: "critical", "high", "medium", "low" |
timestamp | datetime | now(UTC) | When the violation occurred |
action_blocked | bool | False | Whether the action was prevented from executing |
penalty | float or None | None | Numeric penalty; derived from severity if not supplied |
[Pure Specification]
4.3 Severity Penalties¶
The SEVERITY_PENALTIES mapping MUST define the following default penalty values:
| Severity | Penalty |
|---|---|
"critical" | 100.0 |
"high" | 50.0 |
"medium" | 10.0 |
"low" | 1.0 |
[Default Implementation]
4.4 Penalty Derivation¶
On construction, if the caller does not supply an explicit penalty, the penalty MUST be derived from the severity field via the SEVERITY_PENALTIES mapping. If the severity is not found in the mapping, the fallback penalty MUST be 10.0 (the medium-severity value).
If the caller supplies an explicit penalty, that value MUST be preserved. Implementations MUST NOT unconditionally overwrite a caller-supplied penalty from the severity table. [Pure Specification]
4.5 PolicyViolationError¶
PolicyViolationError MUST:
- Be a subclass of
Exception. - Accept a
PolicyViolationinstance in its constructor. - Store the violation as
self.violation. - Format its message as
"Policy violation: {violation.description}".
[Pure Specification]
4.6 Violation Handler Behaviour¶
When a violation is received by _handle_violation:
- Construct a
PolicyViolationfrom the callback arguments. - Append to the context-local list (if an active step exists) or the instance-level fallback list.
- Increment the total violation counter.
- If
log_violationsisTrue, log at WARNING level with policy name, description, severity, and blocked status. - If
violation_callbackis notNone, invoke it with the violation. - If
fail_on_violationisTrueand the action was blocked, raisePolicyViolationError.
[Pure Specification]
5. Governed Rollout¶
5.1 GovernedRollout Dataclass¶
A GovernedRollout record MUST contain:
| Field | Type | Default | Description |
|---|---|---|---|
task_input | Any | (required) | The input provided to the runner |
task_output | Any | (required) | The result returned by the kernel (or None on failure) |
success | bool | (required) | Whether the execution completed without fatal error |
violations | list[PolicyViolation] | [] | Policy violations recorded during this rollout |
signals_sent | list[str] | [] | Kernel signals dispatched during this rollout |
total_penalty | float | 0.0 | Sum of all violation penalties |
execution_time_ms | float | 0.0 | Wall-clock execution time in milliseconds |
[Pure Specification]
5.2 Auto-Computed Total Penalty¶
On construction (__post_init__), the total_penalty field MUST be recomputed as the sum of v.penalty for all violations in the violations list. Any caller-supplied value for total_penalty is overwritten. [Pure Specification]
5.3 Governance Span Emission¶
After constructing a rollout, the runner MUST attempt to emit governance annotations. If the agentlightning.emitter module is importable, the runner SHOULD emit:
- A violation summary annotation containing violation count, total penalty, violation type values, and the distinct set of violated policy names.
- A signal summary annotation containing the list of signals sent.
If the emitter module is not available, the runner MUST silently continue (no error raised). [Default Implementation]
6. Reward Shaping¶
6.1 Overview¶
The PolicyReward class wraps any base reward function and subtracts penalties for policy violations, creating a learning signal that discourages unsafe behaviour during RL training. It supports both additive and multiplicative penalty modes, clean-execution bonuses, and configurable reward clamping.
6.2 RewardConfig¶
A RewardConfig MUST define the following fields:
| Field | Type | Default | Description |
|---|---|---|---|
critical_penalty | float | -100.0 | Penalty for each critical-severity violation |
high_penalty | float | -50.0 | Penalty for each high-severity violation |
medium_penalty | float | -10.0 | Penalty for each medium-severity violation |
low_penalty | float | -1.0 | Penalty for each low-severity violation |
clean_bonus | float | 5.0 | Bonus added when a rollout has zero violations |
multiplicative | bool | False | Use multiplicative penalty mode instead of additive |
multiplicative_factor | float | 0.5 | Factor to multiply reward by when violations occur (multiplicative mode) |
min_reward | float or None | -100.0 | Minimum reward floor; None disables clamping |
max_reward | float or None | 100.0 | Maximum reward ceiling; None disables clamping |
[Default Implementation]
6.3 PolicyReward Constructor¶
A PolicyReward MUST accept:
| Parameter | Type | Default | Description |
|---|---|---|---|
kernel | KernelSpace | (required) | Agent OS kernel for policy checking |
base_reward_fn | Callable or None | None | Base reward function; defaults to success-based reward |
config | RewardConfig or None | None | Reward configuration; defaults to RewardConfig() |
[Pure Specification]
6.4 Default Base Reward¶
When no base_reward_fn is provided, the default MUST return:
1.0if the rollout hassuccess == True.0.0if the rollout hassuccess == False.- If neither attribute is available, fall back to checking
task_output is not None(1.0) vsNone(0.0).
[Default Implementation]
6.5 Reward Computation¶
When PolicyReward.__call__ is invoked with a rollout:
- Compute the base reward from
base_reward_fn(rollout). - Extract violations from the rollout (via
rollout.violationsorkernel.get_recent_violations()). - Calculate the total penalty via the severity mapping.
- If
config.multiplicativeisTrueand violations exist, compute:final_reward = base_reward * config.multiplicative_factor. - Otherwise, compute:
final_reward = base_reward + penalty. - If no violations exist, add
config.clean_bonustofinal_reward. - Apply
min_rewardfloor (if notNone):final_reward = max(final_reward, min_reward). - Apply
max_rewardceiling (if notNone):final_reward = min(final_reward, max_reward). - Update internal statistics.
- If
emitisTrue, emit the reward to Agent Lightning. - Return
final_reward.
[Pure Specification]
6.6 Reward Emission¶
When emitting rewards, the implementation SHOULD produce a multi-dimensional reward object:
| Key | Value |
|---|---|
"final" | The clamped final reward |
"base" | The base reward before penalties |
"policy_penalty" | The total penalty value |
With attributes:
| Attribute | Value |
|---|---|
agent_os.violation_count | Number of violations |
agent_os.policy_compliant | True if zero violations |
If the Agent Lightning emitter is not importable, emission MUST be silently skipped. [Default Implementation]
6.7 Reward Statistics¶
PolicyReward.get_stats() MUST return:
| Key | Type | Description |
|---|---|---|
total_rewards | int | Number of reward computations |
total_penalties | float | Cumulative penalty value |
avg_penalty | float | Mean penalty per computation |
violation_rate | float | Fraction of computations with violations |
clean_rate | float | Fraction of computations without violations |
reset_stats() MUST zero all counters. [Pure Specification]
7. Policy Penalty Function¶
7.1 Standalone Penalty Computation¶
The policy_penalty function provides a lightweight utility for computing penalties outside the full PolicyReward class. It MUST accept:
| Parameter | Type | Default | Description |
|---|---|---|---|
violations | list[Any] | (required) | List of PolicyViolation objects |
critical_penalty | float | -100.0 | Penalty for critical violations |
high_penalty | float | -50.0 | Penalty for high-severity violations |
medium_penalty | float | -10.0 | Penalty for medium-severity violations |
low_penalty | float | -1.0 | Penalty for low-severity violations |
[Pure Specification]
7.2 Severity Mapping¶
The function MUST build the following mapping and sum penalties:
severity_penalties = {
"critical": critical_penalty,
"high": high_penalty,
"medium": medium_penalty,
"low": low_penalty,
}
total_penalty = 0.0
for violation in violations:
severity = violation.severity (default "medium" if absent)
total_penalty += severity_penalties.get(severity, medium_penalty)
[Pure Specification]
7.3 Unknown Severity Fallback¶
If a violation's severity is not one of "critical", "high", "medium", or "low", the implementation MUST fall back to the medium_penalty value. [Pure Specification]
7.4 Return Value¶
The function MUST return the total penalty as a negative float (or zero if no violations are present). [Pure Specification]
8. Composite Reward¶
8.1 Overview¶
CompositeReward combines multiple reward functions with weights, enabling operators to blend task-completion rewards, policy-compliance penalties, and efficiency metrics into a single scalar signal.
8.2 Constructor¶
A CompositeReward MUST accept:
| Parameter | Type | Default | Description |
|---|---|---|---|
components | list[tuple[Callable, float]] | (required) | List of (reward_fn, weight) tuples |
normalize | bool | False | If True, normalise weights to sum to 1.0 |
[Pure Specification]
8.3 Weight Normalisation¶
If normalize is True, the constructor MUST divide each weight by the sum of all weights:
total_weight = sum(w for _, w in components)
components = [(fn, w / total_weight) for fn, w in components]
[Pure Specification]
8.4 Computation¶
CompositeReward.__call__(rollout) MUST compute:
and return total. [Pure Specification]
8.5 Example¶
reward = CompositeReward([
(accuracy_reward, 1.0),
(policy_reward, 0.5),
(efficiency_reward, 0.3),
])
score = reward(rollout) # weighted sum of three signals
9. Governed Environment¶
9.1 Overview¶
The GovernedEnvironment wraps an Agent OS kernel as a Gymnasium-style training environment. On each step(), the environment executes an action through the kernel, enforces policies, converts violations to negative rewards, and optionally terminates the episode on critical violations.
9.2 Compatibility¶
The environment MUST be compatible with:
- Agent Lightning trainers
- OpenAI Gym / Gymnasium (
reset/step/closeinterface) - Stable Baselines3
- Any framework that consumes the five-tuple
(next_state, reward, terminated, truncated, info)return
[Pure Specification]
9.3 EnvironmentConfig¶
An EnvironmentConfig MUST define:
| Field | Type | Default | Description |
|---|---|---|---|
max_steps | int | 100 | Maximum steps per episode before truncation |
violation_penalty | float | -10.0 | Base penalty for each policy violation |
terminate_on_critical | bool | True | Terminate the episode immediately on a critical violation |
step_penalty | float | -0.1 | Small penalty per step to encourage efficiency |
success_bonus | float | 10.0 | Reward bonus for a successful, violation-free step |
reset_kernel_state | bool | True | Whether to call kernel.reset() on episode reset |
[Default Implementation]
9.4 Constructor¶
A GovernedEnvironment MUST accept:
| Parameter | Type | Default | Description |
|---|---|---|---|
kernel | KernelSpace | (required) | Agent OS kernel with loaded policies |
task_generator | Callable or None | None | Function to generate initial states |
reward_fn | Callable or None | None | Custom reward function; defaults to success-based |
config | EnvironmentConfig or None | None | Environment configuration |
The constructor MUST be generic over state and action types: GovernedEnvironment(Generic[T_state, T_action]). [Pure Specification]
9.5 Reset Method¶
reset() MUST:
- Reinitialise
EnvironmentStateto default values. - Clear the current violations list.
- Increment the total episode counter.
- If
config.reset_kernel_stateisTrueand the kernel exposes areset()method, call it. - If a
task_generatoris provided, generate the initial task. - Return
(initial_state, info)whereinfocontains the episode number and loaded policy names.
[Pure Specification]
9.6 Step Method¶
step(action) MUST:
- Clear per-step violations.
- Increment step counter and total step counter.
- Execute the action through the kernel (via
kernel.execute()). - If no callback hook was wired, pull violations from the kernel via
kernel.get_recent_violations(). - Compute the base reward from the reward function.
- Add the step penalty (
config.step_penalty). - For each violation, add a scaled violation penalty:
- Critical:
violation_penalty * 10 - High:
violation_penalty * 5 - All others:
violation_penalty * 1 - Accumulate the reward into
EnvironmentState.total_reward. - Check termination: if
terminate_on_criticaland any violation has severity"critical", setterminated = True. - Check truncation: if
step_count >= max_steps, settruncated = True. - If the step succeeded with zero violations, add
success_bonus. - Return
(next_state, reward, terminated, truncated, info).
[Pure Specification]
9.7 Violation Penalty Scaling¶
The violation penalty multiplier MUST follow:
| Severity | Multiplier | Effective Penalty (default config) |
|---|---|---|
"critical" | 10x | -100.0 |
"high" | 5x | -50.0 |
"medium" | 1x | -10.0 |
"low" | 1x | -10.0 |
[Default Implementation]
9.8 Kernel Violation Polling¶
If the kernel does not expose on_policy_violation (i.e., the push callback could not be wired), the environment MUST poll kernel.get_recent_violations() after each action execution. The poll results MUST be normalised from either dict or object form:
| Source (dict key or attribute) | Target field |
|---|---|
"policy" or "policy_name" | policy_name |
"description" | description |
"severity" (default "low") | severity |
"blocked" or "action_blocked" | blocked |
[Pure Specification]
9.9 Close Method¶
close() MUST log the environment metrics and release resources. [Pure Specification]
10. Environment State¶
10.1 EnvironmentState Dataclass¶
An EnvironmentState record MUST contain:
| Field | Type | Default | Description |
|---|---|---|---|
step_count | int | 0 | Number of steps taken in the current episode |
total_reward | float | 0.0 | Cumulative reward for the current episode |
violations | list | [] | All violations accumulated in the current episode |
terminated | bool | False | Whether the episode ended due to a terminal condition (e.g., critical violation) |
truncated | bool | False | Whether the episode ended due to step limit |
info | dict | {} | Additional metadata from the most recent step |
[Pure Specification]
10.2 Terminated Property¶
The environment MUST expose a terminated property that returns True if the state is either terminated or truncated. [Pure Specification]
10.3 Episode Metrics¶
get_metrics() MUST return:
| Key | Type | Description |
|---|---|---|
total_episodes | int | Number of episodes completed |
total_steps | int | Total steps across all episodes |
total_violations | int | Total violations across all episodes |
successful_episodes | int | Episodes with at least one violation-free, successful step |
success_rate | float | successful_episodes / max(total_episodes, 1) |
violations_per_episode | float | total_violations / max(total_episodes, 1) |
steps_per_episode | float | total_steps / max(total_episodes, 1) |
[Pure Specification]
11. Flight Recorder Emitter¶
11.1 Overview¶
The FlightRecorderEmitter adapts Agent OS Flight Recorder entries to Agent Lightning's span format. This enables:
- Complete audit trail from training to production.
- RL algorithms learning from policy violations.
- Compliance-friendly training logs.
11.2 LightningSpan¶
A LightningSpan record MUST contain:
| Field | Type | Default | Description |
|---|---|---|---|
span_id | string | (required) | Unique span identifier |
trace_id | string | (required) | Trace identifier linking related spans |
name | string | (required) | Span name (e.g., "agent_os.policy_check") |
start_time | datetime | (required) | When the span started |
end_time | datetime or None | None | When the span ended |
attributes | dict[str, Any] | {} | Key-value metadata |
events | list[dict] | [] | Discrete events within the span |
[Pure Specification]
11.3 Serialisation¶
LightningSpan MUST support:
to_dict(): Returns a dictionary with all fields; datetime values serialised to ISO 8601.to_json(): Returns a JSON string viajson.dumps(to_dict()).
[Pure Specification]
11.4 Constructor Parameters¶
A FlightRecorderEmitter MUST accept:
| Parameter | Type | Default | Description |
|---|---|---|---|
flight_recorder | FlightRecorder | (required) | Agent OS Flight Recorder instance |
include_policy_checks | bool | True | Include policy check spans |
include_signals | bool | True | Include signal dispatch spans |
include_tool_calls | bool | True | Include tool call spans |
trace_id_prefix | string | "agentos" | Prefix for generated trace IDs |
[Pure Specification]
11.5 Entry Type Filtering¶
The emitter MUST filter entries by type according to its configuration:
| Entry Type | Filter Flag | Included by Default |
|---|---|---|
policy_check | include_policy_checks | Yes |
signal | include_signals | Yes |
tool_call | include_tool_calls | Yes |
If the corresponding flag is False, entries of that type MUST be silently dropped. [Pure Specification]
11.6 Entry Conversion¶
For each included entry, the emitter MUST produce a LightningSpan with:
span_id: Derived fromentry.id,entry.entry_id, or the emitted count.trace_id:"{trace_id_prefix}-{agent_id}".name:"agent_os.{entry_type}".start_timeandend_time: Both set to the entry's timestamp.
[Pure Specification]
11.7 Type-Specific Attributes¶
Depending on the entry type, the following attributes MUST be set:
All entries:
| Attribute | Value |
|---|---|
agent_os.entry_type | Entry type string |
agent_os.agent_id | Agent identifier |
policy_check entries:
| Attribute | Value |
|---|---|
agent_os.policy_name | Name of the evaluated policy |
agent_os.policy_result | Result of the policy check |
agent_os.policy_violated | Boolean -- whether the policy was violated |
signal entries:
| Attribute | Value |
|---|---|
agent_os.signal_type | Signal type string |
agent_os.signal_target | Target of the signal |
tool_call entries:
| Attribute | Value |
|---|---|
agent_os.tool_name | Name of the tool invoked |
agent_os.tool_args | String representation of arguments (truncated to 1000 chars) |
agent_os.tool_result | String representation of result (truncated to 1000 chars) |
[Pure Specification]
11.8 Metadata Propagation¶
If the entry has a metadata dict, all key-value pairs MUST be copied into the span attributes with an agent_os. prefix. [Pure Specification]
11.9 Incremental Span Cursor¶
The emitter MUST maintain a _last_position cursor into the recorder's entry list. The get_new_spans() method MUST only convert entries added since the last call, avoiding O(n) full-list walks on every poll. [Pure Specification]
12. Span Export¶
12.1 emit_to_store¶
emit_to_store(store) MUST:
- Get all spans via
get_spans(). - For each span, call
store.emit_span(span.to_dict())orstore.add_span(span.to_dict()). - If the store has no recognised span emitter, log a warning and stop.
- If an individual span emission fails, log the error and continue with the next span.
- Return the total number of spans emitted.
[Pure Specification]
12.2 export_to_file¶
export_to_file(filepath) MUST:
- Get all spans via
get_spans(). - Write the list of span dicts to the filepath as a JSON array with 2-space indentation.
- Return the number of spans exported.
[Pure Specification]
12.3 Streaming via AsyncIterator¶
The stream() method MUST:
- Accept an optional
stop_event(asyncio.Event) and apoll_interval(default 0.1 seconds). - In a loop, call
get_new_spans()and yield each span. - Sleep for
poll_intervalbetween polls. - If
stop_eventis set, drain the current poll and exit. - If no
stop_eventis provided, the caller MUST cancel the consuming task to terminate the stream.
The return type MUST be AsyncIterator[LightningSpan]. [Pure Specification]
13. Runner Lifecycle¶
13.1 Lifecycle Methods¶
The GovernedRunner MUST implement the following lifecycle methods:
| Method | When Called | Purpose |
|---|---|---|
init(agent, **kwargs) | Once during setup | Store the agent reference and register kernel hooks |
init_worker(worker_id, store, **kwargs) | Once per distributed worker | Store worker ID and LightningStore reference |
teardown() | Once during shutdown | Log final rollout and violation counts |
teardown_worker(worker_id) | Once per worker shutdown | Release worker-local resources |
[Pure Specification]
13.2 Worker ID Management¶
init_worker() MUST store the worker_id and store as instance attributes for later use by step() and iter(). [Pure Specification]
13.3 Teardown Logging¶
teardown() MUST log a summary including the total number of rollouts and total number of violations observed during the runner's lifetime. [Pure Specification]
14. Violation Rate and Stats¶
14.1 get_violation_rate¶
get_violation_rate() MUST return:
[Pure Specification]
14.2 GovernedRunner.get_stats¶
GovernedRunner.get_stats() MUST return:
| Key | Type | Description |
|---|---|---|
total_rollouts | int | Number of rollouts executed |
total_violations | int | Number of violations observed |
violation_rate | float | total_violations / total_rollouts |
[Pure Specification]
14.3 GovernedEnvironment.get_metrics¶
GovernedEnvironment.get_metrics() MUST return the fields specified in Section 10.3. [Pure Specification]
14.4 FlightRecorderEmitter.get_violation_summary¶
get_violation_summary() MUST return:
| Key | Type | Description |
|---|---|---|
total_entries | int | Total spans scanned |
total_violations | int | Spans where agent_os.policy_violated is True |
violation_rate | float | total_violations / max(total_entries, 1) |
policies_violated | dict[str, int] | Map of policy name to violation count |
[Pure Specification]
14.5 FlightRecorderEmitter.get_stats¶
FlightRecorderEmitter.get_stats() MUST return:
| Key | Type | Description |
|---|---|---|
emitted_count | int | Total spans emitted |
last_position | int | Current cursor position in the entry list |
[Pure Specification]
15. Factory Functions¶
15.1 create_policy_reward¶
create_policy_reward(
kernel,
*,
base_reward_fn=None,
severity_penalties=None,
clean_bonus=5.0,
multiplicative=False,
) -> PolicyReward
This factory MUST:
- Construct a
RewardConfigwithclean_bonusandmultiplicative. - If
severity_penaltiesis provided (a dict mapping severity names to float values), override the corresponding fields on the config: "critical"->config.critical_penalty"high"->config.high_penalty"medium"->config.medium_penalty"low"->config.low_penalty- Return
PolicyReward(kernel, base_reward_fn=base_reward_fn, config=config).
[Pure Specification]
15.2 create_governed_env¶
This factory MUST:
- Construct a default
EnvironmentConfig. - For each key-value pair in
kwargs, if the key matches a field onEnvironmentConfig, set that field. - Return
GovernedEnvironment(kernel, config=config).
[Pure Specification]
15.3 create_emitter¶
This factory MUST pass flight_recorder and all kwargs directly to the FlightRecorderEmitter constructor. [Pure Specification]
16. Failure Semantics¶
16.1 Fail Closed on Critical Violations¶
When fail_on_violation is True and a policy violation blocks an action, the GovernedRunner MUST raise PolicyViolationError. The rollout MUST record the violation with success = False and task_output = None. [Pure Specification]
16.2 Exception Containment in Runner¶
The step() method MUST catch all exceptions (not just PolicyViolationError). Unexpected exceptions MUST be logged via logger.exception (preserving the full traceback) and result in a rollout with success = False. The runner MUST NOT propagate unexpected kernel exceptions to the trainer. [Pure Specification]
16.3 Exception Swallowing in Emitter¶
The FlightRecorderEmitter MUST NOT propagate exceptions from:
emit_to_store()-- individual span emission failures are logged and the next span is attempted._convert_entry()-- entries that cannot be converted are silently skipped.- Import of
agentlightning.emitter--ImportErroris caught and emission is silently skipped.
[Pure Specification]
16.4 Exception Swallowing in Environment¶
The GovernedEnvironment MUST catch exceptions from kernel.execute() and set success = False, result = None. The environment MUST NOT propagate kernel execution errors to the trainer. [Pure Specification]
16.5 Violation Polling Resilience¶
If kernel.get_recent_violations() raises an exception during polling, the environment MUST log the error at DEBUG level and continue with zero violations for that step. [Pure Specification]
16.6 Error Types¶
Implementations MUST define the following error type:
| Error | Context |
|---|---|
PolicyViolationError | Raised when fail_on_violation is True and a policy blocks execution |
[Pure Specification]
16.7 Failure Behaviour Summary¶
| Operation | Failure Behaviour |
|---|---|
| Runner step (policy blocks) | PolicyViolationError if fail_on_violation; else success = False |
| Runner step (unexpected error) | success = False, logged via logger.exception |
| Emitter span emission | Log error, continue to next span |
| Emitter import failure | Silently skip emission |
| Environment step (kernel error) | success = False, result = None |
| Environment violation poll | Log at DEBUG, zero violations for step |
| Reward emission import failure | Silently skip emission |
[Pure Specification]
17. Security Considerations¶
17.1 Violation Callback Injection¶
The violation_callback parameter allows arbitrary code execution on every violation. Implementations SHOULD validate that the callback is callable and SHOULD document that the callback runs in the same execution context as the runner. Malicious callbacks could suppress violations or exfiltrate data.
17.2 Reward Manipulation¶
Custom base_reward_fn and violation_callback functions could be crafted to override penalty signals and train agents to ignore governance constraints. Operators MUST audit reward configurations in production training pipelines.
17.3 Kernel Trust Boundary¶
The GovernedRunner trusts the kernel to faithfully report violations. A compromised kernel could suppress violation callbacks, allowing unsafe actions to generate clean-bonus rewards. The Flight Recorder provides a secondary audit trail that SHOULD be reconciled against runner-observed violations.
17.4 Span Data Sensitivity¶
LightningSpan attributes may contain tool arguments and results, which could include sensitive data. The emitter truncates tool_args and tool_result to 1000 characters, but implementations SHOULD apply additional redaction for sensitive fields before emitting to external stores.
17.5 File Export Security¶
export_to_file() writes span data (potentially including sensitive attributes) to the filesystem. Callers MUST ensure the output path is in a secure, access-controlled directory. Implementations MUST NOT follow symlinks to prevent path traversal attacks.
17.6 Concurrent Step Isolation¶
Context variable isolation (Section 3.6) is critical for preventing violation cross-contamination between concurrent rollouts. If context variables are not properly reset in the finally block, a leaked context could cause violations from one rollout to be attributed to another, corrupting training signals.
17.7 CORS and Network Exposure¶
If the governed environment or emitter exposes an HTTP API for span ingestion, wildcard CORS origins (*) MUST be rejected when credentials are enabled. Span ingestion endpoints SHOULD require authentication.
18. Conformance Requirements¶
18.1 MUST Requirements¶
An implementation is conformant if it satisfies all MUST requirements:
- GovernedRunner accepts
kernel,fail_on_violation,log_violations, andviolation_callbackparameters. step()isolates violations and signals per call via context variables.step()catchesPolicyViolationErrorand general exceptions without propagating to the trainer.step()returns aGovernedRolloutwith all required fields.- GovernedRollout auto-computes
total_penaltyfrom violations on construction. - PolicyViolationType defines exactly BLOCKED, MODIFIED, WARNED, and SIGNAL_SENT values.
- PolicyViolation derives penalty from severity when not caller-supplied; preserves caller-supplied penalty.
- PolicyViolationError stores the violation and formats the message correctly.
policy_penaltyfalls back to medium penalty for unknown severities.- PolicyReward supports both additive and multiplicative penalty modes.
- PolicyReward clamps rewards to
[min_reward, max_reward]. - CompositeReward computes weighted sums and supports normalisation.
- GovernedEnvironment returns the Gymnasium five-tuple from
step(). - GovernedEnvironment terminates on critical violations when configured.
- GovernedEnvironment polls kernel violations when no push hook is wired.
- FlightRecorderEmitter filters entries by type configuration.
- FlightRecorderEmitter maintains an incremental cursor for
get_new_spans(). - LightningSpan supports
to_dict()andto_json()serialisation. - All lifecycle methods (
init,init_worker,teardown,teardown_worker) are implemented. - All failure semantics follow fail-closed principles.
18.2 Test Coverage¶
Conformance tests MUST cover:
- GovernedRunner step with zero violations (clean rollout).
- GovernedRunner step with violations (penalty computation).
- GovernedRunner step with
fail_on_violation = True(exception raised). - Context variable isolation across concurrent steps.
- PolicyViolation penalty derivation from severity.
- PolicyViolation caller-supplied penalty preservation.
- PolicyViolationType enum values.
- PolicyViolationError message formatting.
policy_penaltywith known and unknown severities.- PolicyReward additive mode computation.
- PolicyReward multiplicative mode computation.
- PolicyReward clean bonus application.
- PolicyReward min/max clamping.
- CompositeReward weighted sum computation.
- CompositeReward weight normalisation.
- GovernedEnvironment reset and step lifecycle.
- GovernedEnvironment critical violation termination.
- GovernedEnvironment violation penalty scaling.
- GovernedEnvironment kernel violation polling.
- FlightRecorderEmitter entry filtering.
- FlightRecorderEmitter incremental cursor.
- FlightRecorderEmitter span attribute population.
- LightningSpan serialisation.
emit_to_storeandexport_to_fileexport.- Factory functions with default and custom parameters.
- Violation rate and stats computation.
19. Worked Examples¶
19.1 Basic Governed Training¶
Given: kernel with SQLPolicy(deny=["DROP", "DELETE"])
runner = GovernedRunner(kernel)
agent attempts: "DROP TABLE users"
When: rollout = await runner.step("DROP TABLE users")
Then: rollout.success == False
rollout.violations == [PolicyViolation(
violation_type=BLOCKED,
policy_name="SQLPolicy",
severity="critical",
action_blocked=True,
penalty=100.0,
)]
rollout.total_penalty == 100.0
19.2 Reward Computation -- Additive Mode¶
Given: config = RewardConfig(
critical_penalty=-100.0,
clean_bonus=5.0,
multiplicative=False,
)
base_reward_fn returns 1.0 for success
rollout has 1 critical violation
When: reward = policy_reward(rollout)
Then: base_reward = 1.0
penalty = -100.0
final_reward = 1.0 + (-100.0) = -99.0
(no clean bonus -- violations exist)
clamped = max(-99.0, -100.0) = -99.0
result = -99.0
19.3 Reward Computation -- Multiplicative Mode¶
Given: config = RewardConfig(
multiplicative=True,
multiplicative_factor=0.5,
)
base_reward_fn returns 10.0
rollout has 1 low violation
When: reward = policy_reward(rollout)
Then: final_reward = 10.0 * 0.5 = 5.0
clamped = min(max(5.0, -100.0), 100.0) = 5.0
result = 5.0
19.4 Clean Execution Bonus¶
Given: config = RewardConfig(clean_bonus=5.0)
base_reward_fn returns 1.0
rollout has 0 violations
When: reward = policy_reward(rollout)
Then: base_reward = 1.0
penalty = 0.0
clean_bonus = 5.0
final_reward = 1.0 + 0.0 + 5.0 = 6.0
result = 6.0
19.5 Environment Critical Termination¶
Given: env = GovernedEnvironment(kernel, config=EnvironmentConfig(
terminate_on_critical=True,
violation_penalty=-10.0,
))
state, info = env.reset()
When: agent submits action that triggers critical violation
state, reward, terminated, truncated, info = env.step(action)
Then: terminated == True
reward includes: violation_penalty * 10 = -100.0
info["violations"] contains the critical violation record
19.6 Incremental Span Emission¶
Given: emitter = FlightRecorderEmitter(recorder)
recorder has 5 entries
When: spans1 = emitter.get_new_spans()
Then: len(spans1) == 5, emitter._last_position == 5
When: recorder receives 3 more entries
spans2 = emitter.get_new_spans()
Then: len(spans2) == 3, emitter._last_position == 8
(only new entries converted -- no O(n) re-scan)
19.7 Policy Penalty with Unknown Severity¶
Given: violations = [
PolicyViolation(severity="unknown_level", ...),
PolicyViolation(severity="critical", ...),
]
When: penalty = policy_penalty(violations)
Then: penalty = medium_penalty + critical_penalty
= (-10.0) + (-100.0) = -110.0
("unknown_level" falls back to medium_penalty)