Scaling out Agent-lightning¶
Agent-lightning splits training into an algorithm bundle and a runner bundle that exchange work through the LightningStore. This tutorial shows how to increase rollout throughput, place bundles across processes or machines, and keep the algorithm side scalable with external frameworks.
Parallelizing Rollouts with Trainer¶
Before we dive into the details of the bundles and execution strategies, let's first revisit how to parallelize rollouts with Trainer.
Trainer is the quickest way to dial up parallelism. Even when n_runners = 1, calling Trainer.fit runs the algorithm and runners in parallel. The algorithm enqueues rollouts; runners dequeue them and execute your LitAgent, and the algorithm collects spans via its Adapter before scheduling the next batch.
Note
One of the most important features of Trainer is the ability to abort things gracefully. For example, if you press Ctrl+C in the terminal, the algorithm will abort and the runners will stop executing. If the algorithm crashes, the runners will also stop executing.
Increase throughput by setting n_runners when constructing the trainer. The following example comes from train_calc_agent.py. Since backend LLMs usually use techniques like continuous batching to increase throughput, you do not have to worry about overwhelming the backend with too many requests.
import agentlightning as agl
from datasets import Dataset as HFDataset
from calc_agent import calc_agent
train_dataset = HFDataset.from_parquet("data/train.parquet").to_list()
val_dataset = HFDataset.from_parquet("data/test.parquet").to_list()
algorithm = agl.VERL(verl_config)
trainer = agl.Trainer(
algorithm=algorithm,
n_runners=8, # launch eight rollout workers
tracer=agl.OtelTracer(),
adapter=agl.LlmProxyTraceToTriplet(),
)
trainer.fit(calc_agent, train_dataset=train_dataset, val_dataset=val_dataset)
In Trainer, there are multiple other initialization parameters that you can use to customize the training process. For example, you can use max_rollouts to keep smoke tests short. Pass a concrete LightningStore instance when you need persistence or want to share the queue across multiple scripts.
Tip
Before scaling out, run Trainer.dev() with n_runners=1 to verify the rollout logic and spans without burning GPU hours.
Bundles and Execution Strategies¶
When Trainer starts, it packages its configuration into two callable bundles:
The algorithm bundle wraps your Algorithm, adapter, and any LLM proxy into a single callable that can be aborted via a signal event.
The runner bundle wraps the Runner, tracer, hooks, and agent into a single callable that can be aborted via a signal event. Unlike the algorithm bundle, the runner bundle is expected to be replicated.
An execution strategy then decides where those bundles are placed (threads vs processes vs multiple machines), how many runner replicas to launch, and how lifecycle events such as shutdown are coordinated.
By default, the trainer builds an InMemoryLightningStore if you do not provide one. Because that store has no locking or cross-process transport, the execution strategy is the component that wraps it in thread-safe or HTTP-safe facades (LightningStoreThreaded, LightningStoreServer) before handing it to bundles. For a deeper look at these facades, see Understanding the Store and Birds' Eye View.
Agent-lightning provides two built-in execution strategies: SharedMemoryExecutionStrategy and ClientServerExecutionStrategy. You can pass a string alias, a configuration dictionary, or a pre-built strategy instance:
import agentlightning as agl
algorithm = agl.Baseline()
# Short alias for the shared-memory strategy.
# Because the runner lives on the main thread in this mode,
# n_runners must be 1 unless you move the algorithm to the main thread.
trainer = agl.Trainer(algorithm=algorithm, n_runners=1, strategy="shm")
# Dict with overrides; keep the algorithm on the main thread so multiple runner threads can spawn.
# Specifying `n_runners` inside strategy is equivalent to passing `n_runners` to the trainer.
trainer = agl.Trainer(
algorithm=algorithm,
strategy={
"type": "shm",
"n_runners": 8,
"main_thread": "algorithm",
},
)
# Pass an existing strategy instance – Trainer respects the strategy's own `n_runners`.
strategy = agl.SharedMemoryExecutionStrategy(main_thread="algorithm", n_runners=4)
trainer = agl.Trainer(algorithm=algorithm, strategy=strategy)
If you omit the strategy, the trainer defaults to ClientServerExecutionStrategy(n_runners=trainer.n_runners). You can still re-specify the client-server strategy through aliases or configuration to tweak ports and other settings:
trainer = agl.Trainer(
algorithm=algorithm,
n_runners=8,
strategy={"type": "cs", "server_port": 9999},
)
Environment variables give you another layer of control. For example:
import os
os.environ["AGL_SERVER_PORT"] = "10000"
os.environ["AGL_CURRENT_ROLE"] = "algorithm"
os.environ["AGL_MANAGED_STORE"] = "0"
trainer = agl.Trainer(algorithm=algorithm, n_runners=8, strategy="cs")
The resulting ClientServerExecutionStrategy picks up the port, role, and managed-store flag from the environment.
Tip
The same configuration patterns apply to other trainer components. For example,
wires in a custom tracer, while swaps in a different adapter. Passing a dict lets you tweak the init parameters of defaults without naming the class explicitly:The next sections walk through the two built-in strategies and how they affect placement and store access.
Client-server Architecture¶
The default ClientServerExecutionStrategy starts a LightningStoreServer alongside the algorithm and spawns runner processes that talk to it through LightningStoreClient. All runners share the HTTP endpoint, so the queue and spans stay consistent across processes or machines.
If you simply instantiate Trainer (as above), it will send the algorithm bundle and runner bundle to ClientServerExecutionStrategy, which will then:
- Launch \(N+1\) processes: \(N\) runner processes and 1 algorithm process (one of them could live in the main process).
- The algorithm process will take the store received from
Trainer, wrap it in aLightningStoreServer, and start serving it over HTTP. - The runner processes discard the store and create a new store, which is a client that connects to the algorithm process through
LightningStoreClient, and start executing the runner bundle. - The strategy automatically escalates shutdown (cooperative stop →
SIGINT→terminate()→kill()) so long-running runners do not linger.
You can override server placement or ports, and whether to automatically wrap the store, through constructor arguments or environment variables:
trainer = agl.Trainer(
algorithm=algorithm,
n_runners=1,
strategy={
"type": "cs",
"server_host": "0.0.0.0",
"server_port": 9999,
"main_process": "runner",
},
)
Set AGL_SERVER_HOST and AGL_SERVER_PORT if you prefer environment-based configuration. You can also use AGL_MANAGED_STORE if you do not want the execution strategy to wrap the store for you. An example is shown in Debugging with External Store.
Algorithms sometimes require heterogeneous computation resources, such as GPU accelerators, while runners sometimes require a specific environment to run because many agent frameworks are fragile in their dependencies. A role-based launch pattern helps you place the algorithm on a dedicated machine with more GPU memory, while runners can live on another machine with more flexible dependencies. This is possible via AGL_CURRENT_ROLE="algorithm" or AGL_CURRENT_ROLE="runner" environment variables. When running on different machines, you also need to set AGL_SERVER_HOST and AGL_SERVER_PORT to the IP address and port of the algorithm machine. You might recognize that this convention is very similar to MASTER_ADDR and MASTER_PORT in PyTorch distributed training.
Launching Algorithm and Runner Roles on Separate Machines¶
When you want to stretch the algorithm onto a GPU-rich machine and keep rollout workers close to the data source (or on machines with a more permissive dependency stack), launch the same training script in different terminals with role-specific environment variables. The client–server strategy will route each process to the right side of the queue as long as they share the same AGL_SERVER_HOST/AGL_SERVER_PORT pair.
1. Pick an address and port for the store. Decide which machine will host the algorithm. Choose a TCP port that can be reached by the runner machines (for example, open it in your firewall configuration). In this example we will use 10.0.0.4:4747.
2. Start the algorithm process. On the machine that should run the algorithm, expose the store by binding to all network interfaces and mark the role as algorithm.
export AGL_SERVER_HOST=0.0.0.0
export AGL_SERVER_PORT=4747
export AGL_CURRENT_ROLE=algorithm
python train_calc_agent.py
Leaving AGL_MANAGED_STORE unset (or setting it to 1) lets the strategy create the LightningStoreServer for you. Otherwise, you can use the method in the previous section to create a store on your own.
3. Start rollout workers on remote machines. Every runner machine should point to the algorithm host and declare itself as the runner role. You can start multiple processes per machine or repeat the command on additional hosts.
export AGL_SERVER_HOST=10.0.0.4
export AGL_SERVER_PORT=4747
export AGL_CURRENT_ROLE=runner
python train_calc_agent.py --n-runners 4
The runner process automatically connects via LightningStoreClient. Adjust --n-runners to spawn the desired number of worker processes on that machine.
4. Scale out as needed. Repeat step 3 on as many machines as you need. When you are done, stop the algorithm process. However, since the runners are on different machines, the strategy WILL NOT send a cooperative stop signal to the connected runners. So you need to kill the runners on your own.
This role-based launch mirrors what Trainer.fit does inside a single machine while letting you spread work across a fleet. Because every process shares the same training script, you keep a single source of truth for dataset loading, adapters, and tracers, but you can tune compute resources independently for the algorithm and rollout workers.
Shared-memory Strategy¶
SharedMemoryExecutionStrategy keeps everything inside one process. The runner runs on the main thread (by default) while the algorithm lives on a Python thread guarded by LightningStoreThreaded.
Use it when you want easier debugging with shared breakpoints and no serialization overhead, or minimal startup time for unit tests. It's not a good choice for many algorithms that require heavy model training because LightningStoreThreaded does not work for multiprocessing. Using it with multiprocessing algorithms will lead to undefined behavior.
Sample configuration:
You can further customize the init parameters of SharedMemoryExecutionStrategy. With main_thread="runner", the runner occupies the main thread and n_runners must be 1. The strategy respects AGL_MANAGED_STORE; set it to 0 to opt out of the LightningStoreThreaded wrapper.
Parallelizing Algorithms¶
Runner parallelism scales rollout throughput, but the algorithm loop remains a single-process loop inside the execution strategy. We understand that many algorithms have parallelization built in, but that's outside the parallelization scope of Agent-lightning.
Agent-lightning strives to make algorithms’ own parallelization work well under our execution strategies. The biggest challenge turns out to come from the store. For example, VERL uses Ray and launches FSDP and vLLM components internally. ClientServerExecutionStrategy has to make sure that the server is not simultaneously serving in multiple processes or Ray workers, and that there is only one single authoritative source of truth for all subprocesses to connect to. Subprocesses connect to the store via a small LightningStoreClient bundled within LightningStoreServer.
Note
The birds' eye view illustrates how adapters, proxies, and stores interact when the algorithm spawns additional workers. Use that diagram as a checklist when introducing new distributed components.
Parallelizing LightningStore¶
By default, Agent-lightning persists rollouts and spans in an in-memory store. Trainer.fit spins it up automatically, or you can launch it yourself via the agl store command. InMemoryLightningStore keeps all state inside the current process, which makes local iteration fast but introduces two production constraints:
- Spans are evicted once the process crosses its memory cap, so long runs risk data loss unless the host has abundant RAM.
- Although the store is well optimized via asynchronous programming, the store lives in a single process and remains bound by the GIL, preventing it from saturating multi-core machines.
General note for all server-client stores
If your algorithm and runners communicate through HTTP protocol (which should be the default for 99% of the cases), you need to ensure the file limit is sufficiently large to avoid the "Too many open files" error. You can set the file limit by running the following command:
For resilient runs, switch to a persistent backend such as MongoLightningStore, which writes data to MongoDB instead of local RAM. Agent-lightning relies on pymongo to interact with MongoDB, which can be installed via:
To use the MongoDB store, you need to pass the MongoDB URI to the store constructor. The URI should be in the format of mongodb://<host>:<port>/<database>?replicaSet=<replicaSet>.
from agentlightning.store.mongo import MongoLightningStore
trainer = agl.Trainer(
algorithm=algorithm,
store=MongoLightningStore(mongo_uri="mongodb://localhost:27017/?replicaSet=rs0"),
)
Setting up MongoDB
MongoDB is a popular document-oriented database. Before running Agent-lightning with MongoLightningStore, make sure that you've already had a MongoDB instance running. Setting up can be conveniently done via Docker Compose via compose.mongo.yml. Unless targeting serious production use, we recommend creating the data folders and setting them to 777 permission to avoid permission issues.
Alternatively, you can also install MongoDB manually following the official documentation. If you installed MongoDB manually, an important note is that you need to ensure that the MongoDB instance has enabled replica set feature, since Agent-lightning uses the transactional operations internally. The simplest approach is to use the following script (executed in the MongoDB shell) to initialize the replica set:
To scale out further, launch the store server via agl store --backend mongo (see Debugging with External Store). The CLI accepts --n-workers, which starts the server under gunicorn with multiple worker processes so concurrent runners can push and pull at higher throughput. This option applies only to persistent backends; an in-memory store, on the other hand, cannot be sharded across workers because its state lives inside one process.
Note
The --n-workers here is the number of worker processes for the store server, NOT related to the number of rollout runners.
Increasing Throughput of LLM Proxy¶
Agent-lightning includes an optional LLMProxy that wraps LiteLLM to provide a unified OpenAI-compatible endpoint for your agents. When rollout throughput increases, the proxy can become a bottleneck. You can scale it out using the same pattern as the store server.
To increase proxy throughput, pass num_workers when constructing the proxy:
import agentlightning as agl
proxy = agl.LLMProxy(
port=4000,
launch_mode="mp", # multiprocessing mode
num_workers=4, # four gunicorn workers handle concurrent requests
)
You can also configure the proxy through Trainer:
trainer = agl.Trainer(
algorithm=algorithm,
n_runners=8, # The runners here is the rollout runners, not related to LLM proxy replicas
llm_proxy={"port": 4000, "num_workers": 4}, # launch mode is actually mp by default
)
When num_workers > 1, the launcher starts gunicorn with the specified number of worker processes. Each worker runs its own event loop, allowing the proxy to handle many concurrent LLM requests without being blocked by Python's GIL.
Tip
When using mp launch mode, LLMProxy will start the server in a separate process. To make sure the proxy is still accessing the same store as the main process, you need to set the store to be zero-copy compatible, which means, either the store is a native zero-copy store like MongoLightningStore or the store is wrapped via LightningStoreServer or LightningStoreClient.
Shared Server Infrastructure
Both LightningStoreServer and LLMProxy rely on a common utility called PythonServerLauncherArgs. This dataclass captures the settings needed to launch a FastAPI application:
from agentlightning.utils import PythonServerLauncherArgs
args = PythonServerLauncherArgs(
port=8000,
host="0.0.0.0",
n_workers=4, # spawn 4 gunicorn workers
launch_mode="thread", # or "mp" for multiprocessing, "asyncio" for in-loop
)
Under the hood, PythonServerLauncher reads these arguments and chooses between uvicorn (single worker) and gunicorn (multiple workers) automatically.