Serving LLMs under Agent-lightning¶

Agent-lightning focuses on data, learning signals, and control flow — not on running model inference. This deep dive explains how to serve a model alongside Agent-lightning so runners can call it reliably, how the LLM Proxy fits into the loop, and why token IDs matter if you care about correctness in training and evaluation.

General background on LLM serving¶

Serving a model is essential if you want to train it, especially when you use the model’s own generations as training data. We’ll briefly review the general background to ensure all readers are aligned.

Modern LLM servers solve a difficult scheduling problem: keeping GPUs fully utilized while handling prompts of different lengths, streaming tokens as they arrive, and fitting large KV caches into limited memory. Techniques like continuous batching and paged attention address these challenges. Continuous batching interleaves decoding across requests to reuse weights efficiently; with careful memory planning, it achieves major throughput gains without increasing latency. PagedAttention reduces KV-cache fragmentation so batching remains effective as sequences grow. See vLLM’s PagedAttention paper and industry analyses for details. Balancing inference correctness and efficiency is difficult — a recent blog from Thinking Machines Labs highlights how inference nondeterminism ultimately affects training.

Beyond scheduling, servers expose an HTTP API, often OpenAI-compatible (/v1/chat/completions and /v1/responses), which is itself a complex stack. In addition to text prompts and chat messages, the API defines many parameters and response fields such as tool calls, structured output, and multimodal support. Much effort has been put into implementing all these parameters for many frameworks. Popular engines like vLLM and SGLang ship with OpenAI-compatible frontends so you can reuse existing client code. Ollama and llama.cpp provide similar capabilities. However, because models differ internally, each framework interprets and implements the API slightly differently. Even with identical requests, the tokens passed to the model can vary substantially across frameworks.

What Agent-lightning expects from a served LLM¶

Most of the issues above either have workarounds or remain open research problems. Keep them in mind, but the key question is: what does Agent-lightning expect from a served LLM? The answer includes at least two things:

An OpenAI-compatible Chat Completions or Responses endpoint the agent can call during rollouts.
Optional training and debugging signals: logprobs, usage, and ideally token IDs. (OpenAI’s public API exposes usage and logprobs, but not token IDs — more on why IDs matter later.)

Launching a serving framework¶

For many algorithms, you’ll start an engine (e.g., vLLM or SGLang) before rollouts, then shut it down afterward to free GPU memory. Most frameworks provide a one-line “serve” command to launch the OpenAI-compatible server. You can use those to bring up /v1/chat/completions with your checkpoint, ensuring streaming and any required tool-calling features are enabled. A working example is shown in Unsloth SFT.

Weight updates — which occur after each training step — are trickier. Some frameworks like vLLM support hot-updating model weights, but it’s usually simpler and more reliable to restart the engine to load new weights. For medium-sized tasks (hundreds of rollouts taking 10+ minutes), the restart overhead (under 30 seconds) is typically negligible.

If you’re using Agent-lightning’s VERL integration, the algorithm can manage the server automatically. The VERL framework intelligently allocates compute resources and wraps vLLM/SGLang behind an AsyncLLMServer abstraction. You can directly use this as the LLM endpoint for agents. Since VERL can spawn multiple vLLM replicas, using LLMProxy to manage them adds an additional safety layer.

A full sequence diagram of how VERL interacts with the LLM server and proxy is available here.

LLM Proxy¶

The LLM Proxy is a utility class in Agent-lightning, built on LiteLLM, that sits between runners and your backend engine(s) or server(s). In Agent-lightning it acts as a single URL registered as a Resource in the store, offering three key benefits:

Unified endpoint & hot-swaps. You can redirect traffic between OpenAI, Anthropic, local vLLM/SGLang, or canary checkpoints without modifying agent code — simply repoint the proxy.
First-class tracing. The proxy emits OpenTelemetry spans for every call and sends them to the LightningStore. It includes rollout and attempt identifiers in request headers so spans are correctly attributed. Sequence numbers are allocated monotonically via the store to prevent clock-skew issues and allow reliable reconstruction of execution trees.
Token IDs. The proxy can return prompt and response token IDs along with the model output. More details are available in the next section.

Operationally, running the proxy alongside the algorithm works best: the algorithm registers the backend (e.g., the vLLM URL) via LLMProxy.update_model_list, publishes the proxy URL as a resource via LightningStore.add_resources, and runners simply use that URL during rollouts. This mirrors many production client–server setups.

Token IDs and why they matter¶

This section explains how Agent-lightning handles and uses token IDs — a subtle but important detail for training stability and accuracy.

Most agents interact with LLMs via Chat Completion APIs, exchanging chat messages. There are two main approaches to collecting training data from such agents.

Note

Tokenization here refers to the process of converting Chat Messages into Token IDs. Detokenization is the reverse process of converting Token IDs back to Chat Messages. Normally, the tokenizer is published along with the pretrained model, which includes a vocabulary, special tokens, and a chat template to dealing with chat messages.

1. Retokenizing chat messages. In this approach, you store chat messages as text and let training algorithms retokenize them later, as done in many SFT workflows (e.g., HuggingFace SFT). In practice, we’ve found this method unstable and less accurate. The chart below compares training results. The retokenization approach is run twice. All settings are the same except for the retokenization approach.

This instability has three causes. Firstly, chat template used in different frameworks could be slightly different. For example, one single LLaMA model can work with multiple chat templates (multiple in vLLM and one in HuggingFace). It's possible that the chat template used in detokenization is different from the one used in tokenization (this is actually an implementation bug).

Secondly, a word might be generated as two tokens (e.g., H + AVING) but later retokenized as HAV + ING. The text looks identical, but the token IDs differ from what the model originally produced.

Thirdly, a generated tool call text like <tool_call>{ "name": ... }</tool_call> is parsed by tool call parser into an object that is required by chat completion API. Later, the object is rendered back to <tool_call>{ "name": ... }</tool_call> and retokenized again, tool call parsing and re-rendering might cause changes in whitespace and formatting. In some situations, JSON errors may even be auto-corrected by the tool call parser — masking the model’s true generation errors and preventing them from being trained away.

2. Saving token IDs directly. The alternative is to save the token IDs generated by the model, as done in RL setups like Tinker. This requires a training pipeline that treats tokens as first-class entities, meaning agents must communicate with the inference engine at the token level.

However, most agents — especially those built with frameworks like LangChain — rely on OpenAI-compatible APIs and can’t tokenize or detokenize themselves. As mentioned earlier, implementing this layer manually is complex and error-prone. Some frameworks implement custom solutions (e.g., VERL Agent Loop, Tinker Renderer), while others leave it to users (e.g., SkyRL Search-R1).

A better solution is to use an OpenAI-compatible API that returns token IDs directly. This lets agents continue using familiar APIs while capturing token IDs via tracing for training. The limitation, of course, is that the serving framework must actually support this capability.

When Agent-lightning was first released, we implemented an instrumented vLLM server that monkey-patched vLLM’s OpenAI server to return token IDs. Since then, the Agent-lightning and vLLM teams have collaborated to add this feature directly to vLLM core. Starting with vLLM v0.10.2, the OpenAI-compatible API includes a return_token_ids parameter, allowing token IDs to be requested alongside chat messages. SGLang has tracked similar feature requests, though its OpenAI-compatible layer doesn’t yet support them.

In short, when using vLLM v0.10.2 or newer, LLMProxy automatically adds return_token_ids to each request so the engine includes token IDs in its response. For older vLLM versions, you still need the instrumented version (via agl vllm CLI command).

Finally, if you only save token IDs in spans, it will have its own limitations — if you train one model using spans from another model with a different tokenizer, incompatibilities can arise. In practice, though, spans in Agent-lightning always store both chat messages and token IDs (actually the full request and response objects), allowing you to fall back to retokenization when necessary.