Debugging and Troubleshooting¶
When you train your own agent with Agent-lightning, most failures surface because the agent logic is brittle or simply incorrect. Debugging becomes easier when you peel back the stack: start by driving the rollout logic on its own, dry-run the trainer loop, and only then bring the full algorithm and runner topology online. The examples/apo/apo_debug.py
script demonstrates these techniques; this guide expands on each approach and helps you decide when to reach for them.
Using Runner
in Isolation¶
Runner
is a long-lived worker that wraps your LitAgent
, coordinates tracing, and talks to the LightningStore
. In typical training flows the trainer manages runners for you, but being able to spin one up manually is invaluable while debugging.
If you define rollout logic with @rollout
or implement a LitAgent
directly, you will get a LitAgent
instance and you should be able to execute it with LitAgentRunner
, which is a subclass of Runner
. The runner needs but does not instantiate a Tracer
, so supply one yourself. See Working with Traces for a walkthrough of tracer options.
Runner.run_context
prepares the runner to execute a particular agent. Besides the agent and tracer you must provide a store that will collect spans and rollouts. InMemoryLightningStore
keeps everything in-process, which is perfect for debugging sessions.
import agentlightning as agl
tracer = agl.OtelTracer()
runner = agl.LitAgentRunner(tracer)
store = agl.InMemoryLightningStore()
with runner.run_context(agent=apo_rollout, store=store):
...
Inside the run_context
block you can call runner.step(...)
to execute a single rollout. The payload includes the task input and any NamedResources
the agent expects. Read introduction to Resources and NamedResources for more details. For example, if your agent references a PromptTemplate
, pass it through the resources
argument:
with runner.run_context(agent=apo_rollout, store=store):
resource = agl.PromptTemplate(template="You are a helpful assistant. {any_question}", engine="f-string")
rollout = await runner.step(
"Explain why the sky appears blue using principles of light scattering in 100 words.",
resources={"main_prompt": resource},
)
You can do as many things as you want within the Runner.run_context
block. After the rollout finishes you can query the store to inspect what happened:
Example output (with a reward span captured):
[Rollout(rollout_id='ro-519769241af8', input='Explain why the sky appears blue using principles of light scattering in 100 words.', start_time=1760706315.6996238, ..., status='succeeded')]
[Span(rollout_id='ro-519769241af8', attempt_id='at-a6b62caf', sequence_id=1, ..., name='agentlightning.reward', attributes={'reward': 0.95}, ...)]
Swap in an AgentOpsTracer
instead of OtelTracer
to see the underlying LLM spans alongside reward information:
[
Span(rollout_id='ro-519769241af8', attempt_id='at-a6b62caf', sequence_id=1, ..., name='openai.chat.completion', attributes={..., 'gen_ai.prompt.0.role': 'user', 'gen_ai.prompt.0.content': 'You are a helpful assistant. Explain why the sky appears blue using principles of light scattering in 100 words.', ...}),
Span(rollout_id='ro-519769241af8', attempt_id='at-a6b62caf', sequence_id=2, ..., name='openai.chat.completion', attributes={..., 'gen_ai.prompt.0.role': 'user', 'gen_ai.prompt.0.content': 'Evaluate how well the output fulfills the task...', ...}),
Span(rollout_id='ro-519769241af8', attempt_id='at-a6b62caf', sequence_id=3, ..., name='agentlightning.reward', attributes={'reward': 0.95}, ...)
]
Tip
Spans too difficult to read? Try using Adapter
to convert them into a more readable format.
Runner.step
executes a full rollout even though it is named "step". The companion method Runner.iter
executes multiple "steps" by continuously pulling new rollout inputs from the store until a stop event is set. Use iter
once you are confident the single-step path works and you have another worker enqueue_rollout
to the store.
Tip
You can also call Runner.step
to inject ad-hoc rollouts into a running store being used by another algorithm, so that the rollouts can be consumed by the algorithms. This is very recently known as the paradigm of "online RL". At the moment, no algorithm in the algorithm zoo consumes externally generated rollouts, but the data flow is available there if you need it.
Hook into Runner's Lifecycle¶
Runner.run_context
accepts a hooks
argument so you can observe or augment lifecycle events without editing your agent. Hooks subclass Hook
and can respond to four asynchronous callbacks: on_trace_start
, on_rollout_start
, on_rollout_end
, and on_trace_end
. This is useful for:
- Capturing raw OpenTelemetry spans before they hit the store and before the
LitAgentRunner
do postprocessing on the rollout - Inspecting the tracer instance after they are activated
- Logging rollout inputs before they are processed by the agent
The hook
mode in examples/apo/apo_debug.py
prints every span collected during a rollout:
import agentlightning as agl
# ... Same as previous example
class DebugHook(agl.Hook):
async def on_trace_end(self, *, agent, runner, tracer, rollout):
trace = tracer.get_last_trace()
print("Trace spans collected during the rollout:")
for span in trace:
print(f"- {span.name} (status: {span.status}):\n {span.attributes}")
with runner.run_context(
agent=apo_rollout,
store=store,
hooks=[DebugHook()],
):
await runner.step(
"Explain why the sky appears blue using principles of light scattering in 100 words.",
resources={"main_prompt": resource},
)
Because hooks run inside the runner process you can also attach debuggers or breakpoints directly in the callback implementations.
Note
For a better understanding of where hooks are called, we show a pseudo code of Runner's working flow below:
resources = await store.get_latest_resources()
rollout = ...
try:
# <-- on_rollout_start
with tracer.trace_context(...):
# <--- on_trace_start
result = await agent.rollout(...)
# <--- on_trace_end
post_process_result(result)
except Exception:
# <-- on_rollout_end
await store.update_attempt(status=...)
Dry-Run the Trainer Loop¶
Once single rollouts behave, switch to the trainer’s dry-run mode. Trainer.dev
spins up a lightweight fast algorithm — agentlightning.Baseline
by default — so you can exercise the same infrastructure as Trainer.fit
without standing up complex stacks like RL or SFT.
Warning
When you enable multiple runners via n_runners
, the trainer may execute them in separate worker processes. Attaching a debugger such as pdb
is only practical when n_runners=1
, and even then the runner might not live in the main process.
import agentlightning as agl
dataset: agl.Dataset[str] = [
"Explain why the sky appears blue using principles of light scattering in 100 words.",
"What's the capital of France?",
]
resource = agl.PromptTemplate(template="You are a helpful assistant. {any_question}", engine="f-string")
trainer = agl.Trainer(
n_runners=1,
initial_resources={"main_prompt": resource},
)
trainer.dev(apo_rollout, dataset)
Just like Runner.run_context
, Trainer.dev
requires the NamedResources
your agent expects. The key difference is that resources are attached to the trainer rather than the runner.
Trainer.dev
uses an almost switchable interface from Trainer.fit
. It also needs a dataset to iterate over, similar to fit
. Under the hood dev
uses the same implementation as fit
, which means you can spin up multiple runners, observe scheduler behavior, and validate how algorithms adapt rollouts. The default Baseline
logs detailed traces so you can see each rollout as the algorithm perceives it:
21:20:30 Initial resources set: {'main_prompt': PromptTemplate(resource_type='prompt_template', template='You are a helpful assistant. {any_question}', engine='f-string')}
21:20:30 Proceeding epoch 1/1.
21:20:30 Enqueued rollout ro-302fb202bd85 in train mode with sample: Explain why the sky appears blue using principles of light scattering in 100 words.
21:20:30 Enqueued rollout ro-e65a3ffaa540 in train mode with sample: What's the capital of France?
21:20:30 Waiting for 2 harvest tasks to complete...
21:20:30 [Rollout ro-302fb202bd85] Status is initialized to queuing.
21:20:30 [Rollout ro-e65a3ffaa540] Status is initialized to queuing.
21:20:35 [Rollout ro-302fb202bd85] Finished with status succeeded in 3.80 seconds.
21:20:35 [Rollout ro-302fb202bd85 | Attempt 1] ID: at-f84ad21c. Status: succeeded. Worker: Worker-0
21:20:35 [Rollout ro-302fb202bd85 | Attempt at-f84ad21c | Span 3a286a856af6bea8] #1 (openai.chat.completion) ... 1.95 seconds. Attribute keys: ['gen_ai.request.type', 'gen_ai.system', ...]
21:20:35 [Rollout ro-302fb202bd85 | Attempt at-f84ad21c | Span e2f44b775e058dd6] #2 (openai.chat.completion) ... 1.24 seconds. Attribute keys: ['gen_ai.request.type', 'gen_ai.system', ...]
21:20:35 [Rollout ro-302fb202bd85 | Attempt at-f84ad21c | Span 45ee3c94fa1070ec] #3 (agentlightning.reward) ... 0.00 seconds. Attribute keys: ['reward']
21:20:35 [Rollout ro-302fb202bd85] Adapted data: [Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': '...', 'agent_name': ''}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=0.95, metadata={'response_id': '...', 'agent_name': ''})]
21:20:35 Finished 1 rollouts.
21:20:35 [Rollout ro-e65a3ffaa540] Status changed to preparing.
21:20:40 [Rollout ro-e65a3ffaa540] Finished with status succeeded in 6.39 seconds.
21:20:40 [Rollout ro-e65a3ffaa540 | Attempt 1] ID: at-eaefa5d4. Status: succeeded. Worker: Worker-0
21:20:40 [Rollout ro-e65a3ffaa540 | Attempt at-eaefa5d4 | Span 901dd6acc0f50147] #1 (openai.chat.completion) ... 1.30 seconds. Attribute keys: ['gen_ai.request.type', 'gen_ai.system', ...]
21:20:40 [Rollout ro-e65a3ffaa540 | Attempt at-eaefa5d4 | Span 52e0aa63e02be611] #2 (openai.chat.completion) ... 1.26 seconds. Attribute keys: ['gen_ai.request.type', 'gen_ai.system', ...]
21:20:40 [Rollout ro-e65a3ffaa540 | Attempt at-eaefa5d4 | Span 6c452de193fbffd3] #3 (agentlightning.reward) ... 0.00 seconds. Attribute keys: ['reward']
21:20:40 [Rollout ro-e65a3ffaa540] Adapted data: [Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=None, metadata={'response_id': '...', 'agent_name': ''}), Triplet(prompt={'token_ids': []}, response={'token_ids': []}, reward=1.0, metadata={'response_id': '...', 'agent_name': ''})]
21:20:40 Finished 2 rollouts.
The only limitation is that resources remain static and components like LLMProxy
are not wired in. For richer dry runs you can subclass FastAlgorithm
and override the pieces you care about.
Debug the Algorithm-Runner Boundary¶
Debugging algorithms in Agent-Lightning is often more challenging than debugging agents. Algorithms are typically stateful and depend on several moving parts — runners, stores, and trainers — which makes it difficult to isolate and inspect their behavior. Even mocking an agent to cooperate with an algorithm can be costly and error-prone. To simplify this, Agent-Lightning provides a way to run algorithms in isolation so you can attach a debugger and inspect internal state without interference from other components.
By default, Trainer.fit
runs the algorithm in the main process and thread, but its logs are interleaved with those from the store and runners, making it hard to follow what’s happening inside the algorithm itself. In Write Your First Algorithm, we covered how to stand up a store, algorithm, and runner in isolation for your own implementations. This section extends that approach to cover two common questions:
- How can I run built-in or class-based algorithms (inheriting from
Algorithm
) in isolation? - How can I still use
Trainer
features liken_runners
,adapter
, orllm_proxy
while debugging?
The solution is to keep using a Trainer
instance but manage the store yourself, running the algorithm and runner roles separately. This approach mirrors the internal process orchestration of Trainer.fit
, but with more visibility and control. Below, we show a step-by-step guide to achieve this with the calc_agent
example.
1. Launch the store manually. In a separate terminal, start the store:
Then, in your training script, create a LightningStoreClient
and pass it to the trainer:
Set the environment variable AGL_MANAGED_STORE=0
so the trainer doesn't attempt to manage the store automatically.
2. Start the runner and algorithm processes separately.
Each process should run the same training script, but with different environment variables specifying the current role.
This setup faithfully mirrors how Trainer.fit
orchestrates these components behind the scenes.
# Terminal 2 – Runner process
AGL_MANAGED_STORE=0 AGL_CURRENT_ROLE=runner \
python train_calc_agent.py --external-store-address http://localhost:4747 --val-file data/test_mini.parquet
# Terminal 3 – Algorithm process
AGL_MANAGED_STORE=0 AGL_CURRENT_ROLE=algorithm \
python train_calc_agent.py --external-store-address http://localhost:4747 --val-file data/test_mini.parquet
3. Reuse your existing trainer configuration. You can continue using the same datasets, adapters, and proxies as usual. Because the store is now external, you can:
- Attach debuggers to either the algorithm or runner process
- Add fine-grained logging or tracing
- Simulate partial failures or latency in individual components
This setup provides a faithful reproduction of the algorithm–runner interaction while keeping the store visible for inspection. Once you’ve resolved the issue, simply set AGL_MANAGED_STORE=1
(or omit it) to return to the standard managed training workflow.