The Bird's Eye View of Agent-lightning¶
This article summarizes how Agent-lightning (as of v0.2) wires algorithms, runners, and stores together and shows where auxiliary components (tracer, adapters, proxies) plug into the loop. Each section provides a diagram for a different perspective of the system.
Algorithm ↔ Runner ↔ Store data flow¶
At its heart, Agent-lightning is built on three main components that work in a coordinated loop:
- Algorithm: The "brain" of the system. It decides what tasks to run, learns from the results, and updates resources (like AI models or prompts).
- Runner: The "worker" of the system. It executes tasks assigned by the algorithm, runs the agent, and records the results.
- LightningStore: The central "database" and message queue. It acts as the single source of truth, storing tasks, results, and resources, and enabling communication between the Algorithm and Runner.
The typical data flow in a training loop is as follows: The Algorithm enqueues tasks (called Rollouts) into the Store. A Runner then dequeues a task, executes it, and streams the results (called Spans) back to the Store. Once the task is complete, the Algorithm can query the new data from the Store to learn and update its resources.
The diagram below shows this fundamental interaction in a simple, non-parallel setup.
sequenceDiagram
autonumber
participant Algo as Algorithm
participant Store as LightningStore
participant Runner
participant Agent
loop Over the dataset
Algo-->>Store: add_resources + enqueue_rollout
Store-->>Runner: dequeue_rollout → AttemptedRollout
Store-->>Runner: get_latest_resources
Runner-->>Store: update_attempt("running", worker_id)
Runner->>Agent: rollout + resources
Agent->>Runner: reward / spans
Runner-->>Store: add_span or add_otel_span
Runner-->>Store: update_attempt("finished", status)
Store-->>Algo: query_rollouts + spans
Algo-->>Algo: Update resources (optional)
end
Solid lines represent direct calls, while dashed lines are asynchronous or long-running operations.
Key Terminology¶
We define the following terms, which may be helpful for understanding the diagram above.
- Resources: A collection of assets to be tuned or trained. Agents perform rollouts against resources and collect span data. Algorithms use those data to update the resources. In RL training, the resources are a tunable model. In prompt tuning, the resources are prompt templates.
- Rollout: A unit of work that an agent performs against a resource. A rollout (noun) can be incomplete, in which case it is also known as a task, sample, or job (these terms are used interchangeably). The agent executes its own defined workflow against the rollout — the process is also called "to rollout" (verb). After execution, the rollout (noun) is considered complete.
- Attempt: A single execution of a rollout. One rollout can have multiple attempts in case of failures or timeouts.
- Span: During the rollout, the agent can generate multiple spans (also known as "traces" or "events"). The recorded spans are collected in the store, which is crucial for understanding agent behavior and optimizing agents.
- Reward: A special span that is defined as a number judging the quality of the rollout during some period of the rollout.
- Dataset: A collection of incomplete rollouts (i.e., tasks) for the agent to process. The dual datasets (train, val) serve as the initial input for the algorithm to enqueue the first batch of rollouts.
Store¶
As discussed previously, the store is the central hub for all data in Agent-lightning. The store exposes a set of APIs for algorithms and runners to interact with the data; the most important ones are:
from agentlightning.types import AttemptedRollout, ResourcesUpdate, Span, TaskInput
class LightningStore:
async def enqueue_rollout(self, input: TaskInput, ...) -> Rollout: ...
async def dequeue_rollout(self) -> AttemptedRollout | None: ...
async def add_span(self, span: Span) -> Span: ...
async def get_latest_resources(self) -> Optional[ResourcesUpdate]: ...
async def wait_for_rollouts(self, rollout_ids: List[str], ...): ...
async def query_spans(self, rollout_id: str, ...): ...
async def update_attempt(self, rollout_id: str, attempt_id: str, status: str, ...): ...
...
As the APIs show, the store essentially provides a queue for rollouts and storage for resources, spans, and attempts. Developers should implement the store carefully to ensure data integrity and consistency, especially when multiple runners work in parallel across multiple attempts.
The store is designed to be extensible. Users can implement their own store by inheriting from LightningStore
and overriding methods. Agent-lightning provides a few reference implementations, such as InMemoryLightningStore
(default) and SqliteLightningStore
(under construction). When parallelized, the store may need special wrappers to ensure thread/process safety or delegate computation to a store in another process or machine.
Supporting Components in the Loop¶
While the core loop is simple, Agent-lightning provides several components to make development easier and more powerful.
Tracer¶
The tracer is a component within the Runner that records detailed spans (events) during an agent's execution and sends them to the Store. Instead of requiring the agent to manually log every span, the tracer automatically instruments key methods (e.g., LLM calls) and captures their inputs, outputs, and metadata. This provides a detailed log of the agent's behavior with minimal effort.
sequenceDiagram
autonumber
participant Store
participant Runner
participant Tracer
participant Agent
Note over Runner,Tracer: Runner manages tracer as member
Tracer->>Agent: Apply instrumentation
loop Until no more rollouts
Store-->>Runner: dequeue_rollout → AttemptedRollout
Store-->>Runner: get_latest_resources
Runner->>Agent: training_rollout / validation_rollout
loop For each finished span
Agent-->>Tracer: openai.chat.completion invoked<br>agent.execute invoked<br>...
Agent->>Tracer: emit intermediate reward
Tracer-->>Store: add_otel_span(rollout_id, attempt_id, span)
end
Agent->>Runner: final reward + extra spans (if any)
Runner-->>Store: add_span(rollout_id, attempt_id, span)
Runner-->>Store: update_attempt(status)
end
Tracer->>Agent: Unapply instrumentation
The above diagram shows the overall data flow between store, tracer and agent. In realistic, it's a bit more complicated than that. Spans are not emitted actively by the agent; they are intercepted by the tracer by hooking and instrumenting key methods used in the agents. The tracer uses a callback (called exporter) to monitor events and log to the store. Before a rollout starts, the runner enters a trace_context
before invoking the agent, wiring store identifiers into the tracer. Each span completion streams back to the store through LightningSpanProcessor.on_end
, so the agent’s instrumentation lands in add_otel_span
. If the agent’s rollout method returns a numeric reward, the runner emits one more OpenTelemetry span before finalizing the attempt.
Hooks¶
Hooks are user-defined callback functions that allow you to augment a Runner's behavior at specific points in its lifecycle. You can use hooks to add custom logging, set up resources before a rollout begins, or tear them down after it ends. Hooks can be triggered at four key moments: on_rollout_start
, on_trace_start
, on_trace_end
, and on_rollout_end
.
Users should pay special attention to the difference between on_trace_end
and on_rollout_end
. The former is called right before the tracer exits the trace context, while the latter is called after the runner processes the final leftover rewards and spans, and finalizes the attempt in the store.
sequenceDiagram
autonumber
participant Store
participant Hooks
participant Runner
participant Tracer
participant Agent
Note over Runner,Hooks: Runner manages hooks as member
loop Until no more rollouts
Store-->>Runner: dequeue_rollout → AttemptedRollout
Store-->>Runner: get_latest_resources
Runner->>Hooks: on_rollout_start(agent, runner, rollout)
Runner->>Agent: training_rollout / validation_rollout
Tracer->>Agent: enter_trace_context
activate Tracer
Runner->>Hooks: on_trace_start(agent, runner, tracer, rollout)
Note over Runner,Agent: Agent rollout omitted
Runner->>Hooks: on_trace_end(agent, runner, tracer, rollout)
Tracer->>Agent: exit_trace_context
deactivate Tracer
Agent->>Runner: final reward + extra spans (if any)
Runner-->>Store: add_span(rollout_id, attempt_id, span)
Runner->>Hooks: on_rollout_end(agent, runner, rollout, status)
end
Adapter¶
The Adapter is a component used by the Algorithm to transform raw data from the Store into a format suitable for learning. Runners stream raw spans into the Store during execution. Later, the Algorithm queries these spans and uses an Adapter to convert them into structured data, like training examples for a reinforcement learning model.
For instance, the TraceTripletAdapter
processes OpenTelemetry spans to create (prompt, response, reward)
triplets, which are the fundamental data structure for many RL fine-tuning algorithms.
flowchart LR
Runner -- (1) add_otel_span --> Store
Store -- (2) query_spans --> Algorithm
Algorithm -- (3) spans --> Adapter
Adapter -- (4) transformed data --> Algorithm
LLM Proxy¶
The LLM Proxy is an optional bridge component that sits between an agent and the algorithms' resources. It acts as a centralized endpoint for all LLM calls. Usually the proxy URL is added to the store as a special resource, so that the runner can fetch it along with other resources when dequeuing a rollout. During rollouts, the runner invokes the proxy's HTTP endpoint instead of calling a model backend directly.
This design offers several benefits:
- Instrumentation: It automatically captures detailed traces of LLM interactions (prompts, responses, metadata) and sends them to the Store, complementing the Tracer, especially when the agent's code is hard to instrument directly.
- Backend Abstraction: It provides a unified interface for various LLM backends (OpenAI, Anthropic, local models) and can add features like retry logic, rate limiting, and caching.
- Resource Management: The Algorithm can dynamically update which LLM the agent uses (e.g., swapping to a newly fine-tuned model) by simply swapping the backend model the proxy is using, without interrupting the agent's code.
The benefits above seem to be all discussed within the context of model fine-tuning. As a matter of fact, the proxy can be useful for prompt tuning as well. The algorithm can register one of the following two types of endpoints into the proxy:
- Endpoint served by the algorithm: If the algorithm is internally updating the LLM weights (e.g., RL), it can launch an LLM inference engine (i.e., a model server) and register the endpoint URL with the proxy. The proxy then forwards all LLM calls to that endpoint.
- Third-party LLM endpoint: If the algorithm is not updating the LLM weights (e.g., prompt tuning), it can register a third-party LLM endpoint into the proxy.
We show a diagram below that illustrates how the proxy fits into the overall data flow.
sequenceDiagram
autonumber
participant Algo as Algorithm
participant LLMProxy as LLM Proxy
participant Store
participant Runner
participant Agent
Note over Algo,LLMProxy: Algorithm manages LLMProxy as member
loop Over the Dataset
Algo->>Algo: Launch LLM Inference Engine<br>(optional)
Algo->>LLMProxy: Register Inference Engine<br>(optional)
Algo-->>Store: enqueue_rollout
LLMProxy->>Store: Proxy URL added as Resource
Store-->>Runner: dequeue_rollout → AttemptedRollout
Store-->>Runner: get_latest_resources
Runner->>Agent: rollout + resources<br>(LLM Proxy URL as resource)
loop Defined by Agent
Agent-->>LLMProxy: LLM calls
activate LLMProxy
LLMProxy-->>Store: add_span or add_otel_span
LLMProxy-->>Agent: LLM responses
deactivate LLMProxy
Agent-->>Runner: rewards
Runner-->>Store: add_span or add_otel_span
end
Runner-->>Store: update_attempt("finished", status)
Store-->>Algo: query_rollouts + spans
Algo-->>Algo: Update LLM Weights<br>(optional)
end
In this diagram, the store receives spans from both the proxy and the runner. We will see a problem later with parallelism where the proxy and runner are in different machines, and spans need to obtain a special counter from the store to ensure the ordering of spans.
Trainer¶
The Trainer is the high-level orchestrator that initializes and connects all major components -- algorithm, runner, store, tracer, adapter, LLM proxy, and hooks. The components can have a lifecycle as long as the trainer. The trainer manages their lifecycles and handles dependency injection, ensuring that every part of the system operates within a consistent and shared environment.
Below, we demonstrate how the components relate to each other and their roles. We first clarify the roles and relationships shown in the diagram:
- Owns: components that the trainer constructs and manages directly (e.g., runner, tracer).
- Injects: components passed into others as dependencies.
- References: weak links for coordination without ownership.
- Uses: components that are temporarily interacted with.
For example, the store is injected into the algorithm and runner. The tracer and agent are injected into the runner. The adapter and LLM proxy are injected into the algorithm. The store is further injected into the tracer, adapter and LLM proxy by the runner and algorithm respectively.
flowchart TD
%% === Left side: Algorithm domain ===
subgraph L["Algorithm Side"]
Algorithm["Algorithm<br>(no default)"]
Adapter["Adapter<br>(TraceTripletAdapter*)"]
LLMProxy["LLM Proxy<br>(no default)"]
Algorithm -.injects.-> Adapter
Algorithm -.injects.-> LLMProxy
end
linkStyle 0,1 stroke:#896978,stroke-width:2px;
%% === Middle: Core trainer and store ===
subgraph M["Core"]
Trainer["Trainer"]
Store["LightningStore<br>(InMemory* default)"]
Trainer --has--> Algorithm
Trainer --has--> Store
Trainer --has--> Adapter
Trainer --has--> LLMProxy
end
linkStyle 2,3,4,5 stroke:#839791,stroke-width:2px;
%% === Right side: Runner side ===
subgraph R["Runner Side"]
Runner["Runner<br>(AgentRunnerV2* default)"]
Tracer["Tracer<br>(AgentOpsTracer*)"]
Hooks["Hooks (empty default)"]
Agent["Agent<br>(LitAgent*)"]
Runner -.injects.-> Tracer
Runner -.injects.-> Store
Runner -.injects.-> Agent
Runner -.injects.-> Hooks
Tracer -.injects.-> Store
Hooks -.uses.-> Runner
Hooks -.uses.-> Agent
Hooks -.uses.-> Tracer
end
linkStyle 6,7,8,9,10 stroke:#896978,stroke-width:2px;
linkStyle 11,12,13 stroke:#7a89c2,stroke-width:2px;
%% === Cross-section connections ===
Trainer --has--> Runner
Trainer --has--> Tracer
Trainer --has--> Hooks
Trainer --uses--> Agent
Algorithm -.injects.-> Store
LLMProxy -.injects.-> Store
Agent -.references.-> Trainer
Runner -.references.-> Trainer
Algorithm -.references.-> Trainer
linkStyle 14,15,16 stroke:#839791,stroke-width:2px;
linkStyle 17,20,21,22 stroke:#7a89c2,stroke-width:2px;
linkStyle 18,19 stroke:#896978,stroke-width:2px;
style L fill:none;
style M fill:none;
style R fill:none;
Putting It All Together: A Reinforcement Learning Example (VERL)¶
VERL shows how an algorithm consumes the shared infrastructure. For historical reasons, code lives in agentlightning.algorithm.verl
and agentlightning.verl
. The latter is legacy and reuses terms like Trainer
in confusing ways. The former is a thin wrapper that conforms to the new algorithm interface. Future versions will merge the two.
Reinforcement learning aims to learn a policy that takes actions in states to maximize expected reward. For agents, the policy is usually a language model. Inputs are prompts (state). Outputs are generated text (action). A numeric score judges quality (reward). The (state, action, reward)
triplet is the basic learning unit.
In Agent-lightning, the environment is implicit in the agent’s workflow, which orchestrates one or more LLM calls and often self-judges using rules or additional model calls. During a rollout, the agent emits spans that contain everything needed for RL training, including LLM call traces and numeric judge/reward signals. The "algorithm", on the other hand, have more responsibilities.
- Providing a language model deployment that is currently learning and improving for the agent to interact with;
- Preparing the tasks that the agents will perform;
- Querying the spans generated, extracting triplets, and converting them into a format that the underlying RL library can consume;
- Updating the language model based on the learning signals.
In the VERL integration, the algorithm launches a chat completion endpoint using vLLM
and wraps training with FSDP
for distributed optimization. It enqueues tasks from the dataset. After rollouts finish, it queries spans and converts them to triplets with TraceTripletAdapter
. VERL’s native training loop then consumes these triplets to update model weights. The workflow can be summarized in the following diagram.
sequenceDiagram
autonumber
participant vLLM as vLLM Chat<br>Completion Endpoint
participant FSDP as FSDP / Megatron<br>Weights Optimizer
participant Algo as Algorithm<br>Main Controller<br>(Main Process)
participant Adapter as TraceTripletAdapter
participant LLMProxy as LLM Proxy
participant Store as LightningStore
participant Runner as Runner + Agent
Note over Algo,LLMProxy: LLMProxy and Adapter are injected by Trainer as member
Note over vLLM,Algo: Algorithm creates and owns vLLM and FSDP
loop Over the Dataset in Batches
Algo->>vLLM: Create Chat Completion Endpoint
activate vLLM
vLLM->>LLMProxy: Registered as Backend Endpoint
LLMProxy->>Store: Proxy URL added as Resource
par Over data samples in the batch
Algo-->>Store: enqueue_rollout
Store-->>Runner: Dequeue Rollout +<br>Resources (i.e., URL)
loop One Rollout Attempt
Runner-->>LLMProxy: LLM calls
LLMProxy-->>vLLM: Forwarded LLM calls
vLLM-->>LLMProxy: LLM responses
LLMProxy-->>Store: add_span / add_otel_span
LLMProxy-->>Runner: Forwarded LLM responses
Runner-->>Store: add_span / add_otel_span <br> (by tracer, including rewards)
end
Runner-->>Store: update_attempt("finished", status)
end
Algo-->>Store: Poll for completed rollouts + spans
Algo->>vLLM: Chat Completion Endpoint Sleeps
deactivate vLLM
Algo->>Adapter: adapt(spans)
Adapter->>FSDP: Triplets (state, action, reward)
activate FSDP
FSDP-->>Algo: Updated LLM weights
deactivate FSDP
end
Notes:
-
There are interactions between different components injected into or owned by algorithms in the diagram, such as the output of the adapter feeding into the FSDP optimizer. This is for simplicity of illustration and slightly different from the actual implementation, where it's the algorithm main controller that orchestrates the data flow between components.
-
On mapping to VERL. VERL uses a classic RLHF setup where each action is a single token, the state is the full conversation history up to that token, and reward is given at the end. This is very different from our setup where each action is actually a chunk of text, although they are both called RL! Therefore, after the adapter produces triplets, the algorithm converts each
(state, action, reward)
into a VERL trajectory (DataProto
) with keys likeinput_ids
,position_ids
,attention_mask
, andtoken_level_scores
. That conversion happens after triplet generation and is not shown in the diagram.
Execution Strategies and Parallelism¶
Readers might have observed from the diagram above that there is absolutely no communication between (1) runner and agents and (2) algorithm. The only overlap of them is the trainer and store. This observation is very clear with the diagram within the trainer section. This design allows us to flexibly scale the runner and algorithm independently, which is crucial for large-scale training.
Agent-lightning packages two executable bundles: a runner bundle (runner, tracer, hooks, agent) and an algorithm bundle (algorithm, adapter, LLM proxy). Both share the store. The trainer initializes and connects the bundles.
graph TD
subgraph Runner_Side["Runner Bundle"]
direction LR
R[Runner] --- T[Tracer] --- H[Hooks] --- A1[Agent]
end
subgraph Algorithm_Side["Algorithm Bundle"]
direction LR
ALG[Algorithm] --- AD[Adapter] --- LLM[LLM Proxy]
end
S[(Store)]
TR[Trainer]
Runner_Side <--> S
Algorithm_Side <--> S
TR --> Runner_Side
TR --> Algorithm_Side
linkStyle 0,1,2,3,4 opacity:0;
An execution strategy, defined and owned by the trainer, governs how algorithm and runner bundles are placed, connected, scaled, and aborted. It serves four primary purposes.
Execution strategies first determine bundle placement — whether the two bundles run in the same thread, process, machine, or across separate machines. They also define store management, wrapping the store and specifying how data is shared between bundles.
In terms of scalability, the strategy can replicate the runner bundle across multiple threads, processes, or machines to expand throughput on the runner side. The algorithm side remains single-process due to the complexity of parallelization. Mature frameworks such as DeepSpeed and Megatron already support distributed model training, so scaling of the algorithm bundle is delegated to those implementations.
Abort handling is another core responsibility. Aborts may be triggered by normal exits, failures in either bundle, or user interrupts. The trainer must include cancellation interfaces for the bundles so that bundles can be cleanly aborted. When the algorithm bundle exits normally, the strategy signals the runner bundle to terminate. If the runner exits first, no signal is sent to the algorithm, as it may still be processing completed rollouts. In cases of failure or user interruption, the strategy signals both bundles to abort; if a bundle fails to respond, the strategy should attempt a forceful termination.
Agent-lightning currently provides two execution strategies: shared-memory and client-server, described in the following sections.
Shared-memory Strategy¶
SharedMemoryExecutionStrategy
runs algorithm and runner bundles as threads in one process. The strategy wraps the store with LightningStoreThreaded
, which guards calls with a lock for safe concurrency.
This is good for lightweight debugging because components share one Python heap and avoid serialization. It is not suitable for heavy RL training or compute-intensive agents.
flowchart TB
subgraph MainProcess
direction TB
subgraph AlgorithmThread [Thread 0]
Algorithm[Algorithm bundle]
end
subgraph RunnerThread1 [Thread 1]
Runner1[Runner bundle #1]
end
subgraph RunnerThread2 [Thread 2]
Runner2[Runner bundle #2]
end
subgraph RunnerThread3 [Thread 3]
RunnerN[Runner bundle #N]
end
LightningStoreFacade[LightningStoreThreaded]
BaseStore[Underlying LightningStore]
end
Algorithm -- async calls --> LightningStoreFacade
Runner1 -- async calls --> LightningStoreFacade
Runner2 -- async calls --> LightningStoreFacade
RunnerN -- async calls --> LightningStoreFacade
LightningStoreFacade -->|thread-safe delegates| BaseStore
You can configure which role runs on the main thread. If the main thread runs the algorithm, it is able to spawn multiple runner threads. If it runs a runner, n_runners
must be 1 and the runner lives on the main thread.
Client-server Strategy¶
ClientServerExecutionStrategy
splits concerns across processes. The algorithm bundle starts a LightningStoreServer
(HTTP API) that wraps the underlying store. Runners connect via LightningStoreClient
to call the same interface over REST. The server embeds a client to support algorithm-launched subprocesses (e.g., an LLM proxy worker) that need to talk back to the algorithm’s process through the same API.
Currently this design introduces an extra wrapper in the Server side (as shown in the diagram), which helps debugging and improves fault tolerance. We might revisit this design in the future and enforce the client to be the only way to communicate with the store.
flowchart TD
subgraph Algorithm Process Group
subgraph StoreServer[LightningStoreServer]
StoreHttpClient[HTTP Client]
StoreHttpServer[HTTP Server]
StoreWrapper[LightningStore Wrapper]
StoreHttpClient -- HTTP --> StoreHttpServer
end
subgraph Algorithm Bundle
Algorithm[Algorithm Main Process]
subgraph Another subprocess
LLMProxy[LLM Proxy]
end
end
LLMProxy -- async calls --> StoreHttpClient
Algorithm -- async calls --> StoreWrapper
end
subgraph RunnerSide ["Runner Side"]
subgraph Runner Process 1
Runner1[Runner bundle #1]
Runner1 -- async calls --> LightningStoreClient1
LightningStoreClient1[LightningStoreClient]
end
subgraph Runner Process 2
Runner2[Runner bundle #2]
Runner2 -- async calls --> LightningStoreClient2
LightningStoreClient2[LightningStoreClient]
end
subgraph Runner Process N
RunnerN[Runner bundle #N]
RunnerN -- async calls --> LightningStoreClientN
LightningStoreClientN[LightningStoreClient]
end
end
LocalStore[Underlying LightningStore]
StoreHttpServer -->|delegates| StoreWrapper
StoreWrapper -->|delegates| LocalStore
LightningStoreClient1 -- HTTP --> StoreHttpServer
LightningStoreClient2 -- HTTP --> StoreHttpServer
LightningStoreClientN -- HTTP --> StoreHttpServer
style RunnerSide fill:none;
Online/Continuous Learning¶
Continuous learning keeps the algorithm loop running while runners report tasks and spans opportunistically. Key differences from batch mode:
- The algorithm does not enqueue rollouts from a fixed dataset. Runners report tasks/rollouts and spans spontaneously.
- The algorithm can wait for rollouts with a expected set of rollout IDs, but more oftenly polls for new rollouts and spans or waits for a count to arrive.
- The runner processes one rollout at a time via
step(task)
instead of exhausting a task queue. It notifies the store when starting a rollout so the store records it. - A user or higher-level loop controls which resources the next step uses and when to retry.
Spans, adapters, and LLM proxies work the same way.
sequenceDiagram
autonumber
actor User
participant Runner
participant Agent
participant Store as LightningStore
participant Algorithm
Note over Algorithm: Algorithm is long-running and loops continuously
loop Continuous Learning Loop
activate User
opt Decide what to do next
User-->>Store: get_resources_by_id
Store-->>User: Resources
User-->>User: Prepare input for next step
end
User->>Runner: step(input, resources)
activate Runner
Runner-->>Store: Notify: start_rollout(input)
Runner->>Agent: rollout(input, resources)
Agent-->>Runner: add_span / reward spans
Runner-->>Store: add_span or add_otel_span
Runner-->>Store: update_attempt(status="finished")
deactivate Runner
deactivate User
Algorithm->>Store: poll for new rollouts and spans
opt If there is enough new data
Store-->>Algorithm: new spans
Algorithm->>Algorithm: adapt spans → learning signal
Algorithm->>Store: update_resources
end
end