Fine-tune with Unsloth SFT¶

Prerequisites

Please make sure you have read Write the First Algorithm. Although that recipe is based on a simple prompt tuning algorithm, it introduces the core concepts of Agent-lightning and you should be familiar with them before proceeding.

This recipe builds on Write the First Algorithm. Instead of iterating on a prompt, we will fine-tune a large language model with Unsloth's SFT Trainer and keep the whole loop inside Agent-lightning. The new pieces you will meet are the LLM proxy, the trace-to-triplet adapter, a vLLM inference endpoint, and an agent implemented with the OpenAI Agents SDK. The full sample code is available in the examples/unsloth folder.

Warning

You need a GPU that can host the Unsloth base model and run vLLM. The sample defaults to unsloth/Qwen3-4B-Instruct-2507, which requires at least 16GB of GPU memory under 4-bit quantization.

The Data and Serving Loop¶

To tune a large language model in Supervised Fine-Tuning (SFT), we commonly need a dataset with input/output samples. For example, the TRL SFT Trainer expects a dataset with samples like the following:

{"messages": [{"role": "user", "content": "What color is the sky?"},
              {"role": "assistant", "content": "It is blue."}]}

With supervised fine-tuning, the LLM learns to generate the "assistant" response as close as possible to the completion in the dataset.

Typically, the dataset used in SFT should be a curated set of samples. The samples can be either hand-written by humans, or generated by a more powerful model, which is known as data distillation. However, in this recipe, we use a different setup that relies on samples generated by the model itself. We use the reward emitted by the agent to select the top-performing samples.

Overall, the flow of the algorithm is an iteration of the following steps:

Serve the current checkpoint (with vLLM).
Publish the vLLM endpoint through the LLM proxy and let runners roll out some tasks with the current model.
Collect the traces from the rollouts and transform the highest-rewarded ones into a dataset that is acceptable for Unsloth SFT Trainer.
Launch Unsloth to fine-tune on the dataset and save a new checkpoint.

You will find the full source code of this iteration in sft_one_iter in sft_algorithm.py. We will elaborate on each part below.

Serving the Model with vLLM and Proxy¶

Most modern agents do not use the model directly; instead, they use an API like the OpenAI chat completions API to interact with the model. Therefore, we need a vLLM-based inference server launched before rollouts. The serving code looks like the following. See the vllm_server function in sft_algorithm.py if you want to see a more robust version.

from openai import OpenAI

vllm_process = subprocess.Popen([
    "vllm", "serve", model_path, "--port", str(port),
    "--enable-auto-tool-choice", "--tool-call-parser", "hermes"
])

# Wait for the server to be ready
url = f"http://localhost:{port}/health"
start = time.time()
client = httpx.Client()

while True:
    if client.get(url).status_code == 200:
        break

server_address = f"http://localhost:{port}/v1"

# Try using the vLLM server
openai = OpenAI(base_url=server_address)
...

In this recipe, we do not expose the server address directly to the agent runners, because we want to install a "middleware" to collect the prompts and responses of all the requests. In general, it's up to you to decide whether to hide the vLLM server behind a proxy or not.

The "middleware" here is LLMProxy, which is an independent LiteLLM server that forwards the requests to the vLLM server. It also exposes an OpenAI-compatible API that the runners can target without caring about where the model lives. The benefits of using the proxy are:

Traces: The proxy automatically logs the prompts and responses of all the requests into the store.
Token IDs: The proxy augments the requests so that the vLLM server can return the prompt and response token IDs (see more details in Serving LLM).

The LLMProxy accepts a list of model configurations, in the same syntax as LiteLLM's model_list. Include a hosted_vllm/ prefix to the models to activate LiteLLM's vLLM integration.

import agentlightning as agl

llm_proxy = agl.LLMProxy(port=port, store=store)
model_list = [
    {
        "model_name": "Qwen3-4B-Instruct",
        "litellm_params": {"model": f"hosted_vllm/{model_path}", "api_base": server_address},
    }
]
llm_proxy.update_model_list(model_list)
# If the proxy is not running, it will start automatically.
llm_proxy.restart()
# Add the proxy as a resource to the store so that the runners can access it via URL.
resource_update = await store.add_resources({"main_llm": llm_proxy.as_resource()})

Spawn Rollout and Collect Spans¶

Once the proxy is registered as a resource, the algorithm schedules work for the rollout runners. Each problem from a training dataset becomes a rollout with the proxy baked into its resources:

rollouts: list[Rollout] = []
for sample in train_dataset:
    rollouts.append(
        await store.enqueue_rollout(
            input=sample,
            mode="train",
            resources_id=resources_update.resources_id,
        )
    )

resources_id ties every rollout to the main_llm proxy resource we just uploaded. The runners on the other side poll the store (LitAgentRunner.iter()) and execute the agent for each rollout. On the algorithm side we wait for completions with a non-blocking polling loop:

completed_rollouts: list[Rollout] = []
while True:
    completed_rollouts = await store.wait_for_rollouts(
        rollout_ids=[r.rollout_id for r in rollouts],
        timeout=0.0,
    )
    if len(completed_rollouts) == len(rollouts):
        break
    await asyncio.sleep(5.0)

Note

The timeout=0.0 is needed here because this example uses a LightningStoreClient, and wait_for_rollouts establishes an HTTP connection to that store. Currently, only non-blocking wait requests are supported, which avoids holding the store connection open.

Once the rollouts complete, we terminate the vLLM server to free up GPU memory.

vllm_process.terminate()
vllm_process.join(timeout=10.0)

Adapt the Spans to HuggingFace Dataset¶

LlmProxyTraceToTriplet converts the proxy’s spans (which might be dozens to hundreds per rollout) into Triplet objects that contain prompt/response token IDs plus an optional reward. The adapter may return multiple triplets per rollout (one per chat-completion call). To bias training toward successful reasoning chains the algorithm walks the triplets in reverse order, keeps the most recent reward, and turns each prompt/response pair into Hugging Face dataset rows:

all_triplets = []
data_adapter = agl.LlmProxyTraceToTriplet()

for rollout in completed_rollouts:
    spans = await store.query_spans(rollout.rollout_id, "latest")
    triplets = data_adapter.adapt(spans)

    recent_reward = None
    for triplet in reversed(triplets):
        if triplet.reward is not None:
            recent_reward = triplet.reward
        if recent_reward is None:
            continue

        input_ids = triplet.prompt["token_ids"] + triplet.response["token_ids"]
        # We don't train on prompt tokens, so they are masked out by setting to -100.
        labels = [-100] * len(triplet.prompt["token_ids"]) + triplet.response["token_ids"]
        # This matches the dataset format required by the Unsloth SFT trainer.
        all_triplets.append(
            {
                "input_ids": input_ids,
                "attention_mask": [1] * len(input_ids),
                "labels": labels,
                "reward": recent_reward,
            }
        )

Note

You might notice that the dataset format used here differs from the format described in the SFT Trainer documentation. According to the documentation, dataset samples should be provided as plain text strings or message objects.

As a matter of fact, this example leverages some undocumented behavior in the SFT Trainer implementation. When the dataset already includes a "input_ids" column, the Trainer automatically marks it as is_processed and skips the internal tokenization step.

Since we already have spans with token IDs generated by the LLMProxy, providing them directly avoids unnecessary re-tokenization and related complications. This approach will both save processing time and increase consistency between training and inference.

After aggregating every rollout we shuffle, sort by reward, and keep the top fraction (e.g., 50%) before shuffling again. The resulting list feeds directly into datasets.Dataset.from_list, which is the format Unsloth’s SFT trainer expects.

from datasets import Dataset as HuggingFaceDataset

random.shuffle(all_triplets)
all_triplets.sort(key=lambda x: x["reward"], reverse=True)
sliced_triplets = all_triplets[: max(1, int(len(all_triplets) * triplet_fraction))]
# Shuffle the sliced triplets again
random.shuffle(sliced_triplets)

sft_dataset = HuggingFaceDataset.from_list(sliced_triplets)

Launch Unsloth Training¶

The heavy lifting happens in trl.SFTTrainer (see unsloth_helper.py on how it's used). We launch it in a fresh process created with multiprocessing.get_context("spawn") so CUDA memory is reliably reclaimed when training ends. Launching it in the same process will also work for the first iteration, but we found that the memory won't be freed properly for subsequent vLLM serving.

context = multiprocessing.get_context("spawn")
unsloth_process = context.Process(
    target=unsloth_training,
    args=(model_path, sft_dataset, next_model_path),
    daemon=True,
)
unsloth_process.start()
unsloth_process.join(timeout=600.0)

Inside the unsloth_training subprocess, Unsloth loads the previous checkpoint in 4-bit, applies LoRA adapters, and forwards the Hugging Face dataset to trl.SFTTrainer with the configuration defined in SFTConfig (batch size, accumulation steps, learning rate, etc.). The merged 16-bit weights are saved under models/version_<iteration + 1> so the next iteration can immediately serve them with vLLM.

from unsloth import FastLanguageModel
# TRL is patched by unsloth.
from trl import SFTConfig, SFTTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_path,
    load_in_4bit=True,  # 4 bit quantization to reduce memory
)

# Config the model to use LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    ...
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=sft_dataset,
    ...
)

# This is the heaviest step.
trainer_stats = trainer.train()

# Save in 16-bit for vLLM inference later
model.save_pretrained_merged(next_model_path, tokenizer, save_method="merged_16bit")

Math Agent: OpenAI Agents SDK with MCP¶

We build an agent with the OpenAI Agents SDK to wire a calculator MCP tool and an OpenAI-compatible chat completion model together. The agent aims to solve a math problem and returns a reward indicating whether the answer is correct or not. The runner injects the LLM resource supplied by the algorithm side:

import os
from typing import TypedDict

import agentlightning as agl
from agents import Agent, ModelSettings, OpenAIChatCompletionsModel, Runner as OpenAIRunner
from agents.mcp import MCPServerStdio
from openai import AsyncOpenAI

class GsmProblem(TypedDict):
    input: str
    target: float

def compute_reward(result: str, target: float) -> float:
    ...

@agl.rollout
async def math_agent(task: GsmProblem, llm: agl.LLM) -> float:
    async with MCPServerStdio(
        name="Calculator via uvx",
        params={"command": "uvx", "args": ["mcp-server-calculator"]},
    ) as server:
        agent = Agent(
            name="Assistant",
            instructions=(
                "Use the calculator tool for every question. "
                "Return only the numeric answer wrapped like ### <answer> ###."
            ),
            mcp_servers=[server],
            model=OpenAIChatCompletionsModel(
                model=llm.model,
                openai_client=AsyncOpenAI(
                    base_url=llm.endpoint,
                    api_key=llm.api_key or "dummy",
                ),
            ),
            model_settings=ModelSettings(
                temperature=llm.sampling_parameters.get("temperature", 0.0),
            ),
        )
        result = await OpenAIRunner.run(agent, task["input"])
    return compute_reward(result.final_output, task["target"])

Tip

You can test the agent with a dry run:

import asyncio

llm = agl.LLM(
    endpoint=os.environ["OPENAI_BASE_URL"],
    api_key=os.environ["OPENAI_API_KEY"],
    model="gpt-4.1-mini",
)
asyncio.run(math_agent({"input": "What is 1 + 1?", "target": 2.0}, llm))

Run this Recipe¶

The full runnable script for this recipe resides in examples/unsloth folder.

Before running this example, install unsloth, vllm, and the other libraries used in the examples (the project uses CUDA tooling, TRL, rich, datasets, etc.). We tested with unsloth==2025.10.1. unsloth==2025.10.2 and 2025.10.3 are not working because of an issue we have been investigating with the unsloth team.

It's recommended to download the base model before running the example, such that the first iteration and subsequent iterations can both load from local checkpoints.

hf download unsloth/Qwen3-4B-Instruct-2507 --local-dir models/version_0

The repository already contains examples/unsloth/data_gsmhard.jsonl (which is a very small subset of the GSM-hard math dataset for demonstration purposes).

Run Manually¶

Similar to the Write the First Algorithm recipe, you can open three terminals and start each component in parallel.

agl store --port 4747
python examples/unsloth/sft_rollout_runners.py
python examples/unsloth/sft_algorithm.py

In this case, sft_rollout_runners.py is a simple spawner implemented in Python that spawns 4 runners in parallel. The runners all connect to the same store server executing in another terminal.

import agentlightning as agl

def run_rollout(store: agl.LightningStore, worker_id: int) -> None:
    # Since the server side has already used LiteLLM proxy to collect traces,
    # a simple OtelTracer to collect the rewards is enough.
    tracer = agl.OtelTracer()

    runner = agl.LitAgentRunner(tracer=tracer)

    with runner.run_context(agent=math_agent, store=store, worker_id=worker_id):
        asyncio.run(runner.iter())


def spawn_runners(store: agl.LightningStore, n_runners: int) -> None:
    runners = [
        multiprocessing.Process(target=run_rollout, args=(store, worker_id))
        for worker_id in range(n_runners)
    ]
    for runner in runners:
        runner.start()

    for runner in runners:
        runner.join()


store = agl.LightningStoreClient("http://localhost:4747")
spawn_runners(store=store, n_runners=4)

Tip

Try to swap OtelTracer in the runners with other tracers like AgentOpsTracer. Try to use a different adapter at the algorithm side such as TracerTraceToTriplet to see what happens.

Run Everything with Trainer¶

We also show how to wrap everything into a single script using Trainer. sft_allinone.py wires the same components together, replacing the manual management of runners above.

class UnslothSupervisedFinetuning(agl.Algorithm):

    async def run(
        self,
        train_dataset: Optional[Dataset[GsmProblem]] = None,
        val_dataset: Optional[Dataset[GsmProblem]] = None,
    ):
        # Use the store, llm_proxy, and adapter from the trainer
        store = self.get_store()
        llm_proxy = self.get_llm_proxy()
        data_adapter = self.get_adapter()

        for iteration in range(self.max_iterations):
            ...  # Same logic as sft_algorithm.py

algo = UnslothSupervisedFinetuning(
    max_iterations=2,
    vllm_port=12316,
    train_triplet_fraction=0.5,
    initial_model_path="models/version_0",
)

# The LLM proxy can be created before Trainer
trainer = Trainer(
    n_runners=4,
    algorithm=algo,
    llm_proxy=LLMProxy(port=12358),
)

trainer.fit(math_agent, load_math_dataset())

You might wonder where the initialization of Adapter happens in this code. It turns out that TracerTraceToTriplet is the default adapter in Trainer, so we don't need to create one manually.

Now you can run the example with:

python examples/unsloth/sft_allinone.py

It starts an InMemoryLighningStore for you, launches four worker processes, iterates the SFT loop, and prints the final checkpoint path when done. Adjust max_iterations, train_triplet_fraction, n_runners, or the proxy port to match your hardware or training goals. If you already run an external store or proxy you can also pass those objects into Trainer instead of relying on the Trainer-managed defaults.

Info

As a future plan, we might graduate this example into a more powerful SFT algorithm bundled into Algorithm Zoo. Currently, this UnslothSupervisedFinetuning is still for demo purposes.