Skip to content

vLLM Provider

Integrates vLLM's OpenAI-compatible Responses API for local/self-hosted LLMs.

Module ID

provider-vllm

Installation

providers:
  - module: provider-vllm
    source: git+https://github.com/microsoft/amplifier-module-provider-vllm@main
    config:
      base_url: "http://192.168.128.5:8000/v1"
      default_model: openai/gpt-oss-20b

Configuration

Option Type Default Description
base_url string (required) vLLM server URL
default_model string openai/gpt-oss-20b Model name from vLLM
max_tokens int 4096 Maximum output tokens
temperature float null Sampling temperature; null = not sent (some models don't support it)
reasoning string null Reasoning effort: minimal\|low\|medium\|high; null = not sent
reasoning_summary string detailed Summary verbosity: auto\|concise\|detailed
truncation string auto Automatic context management
enable_state boolean false Enable stateful conversations (requires vLLM config)
timeout float 600.0 API timeout (seconds)
priority int 100 Provider priority for selection
debug boolean false Enable standard debug events
raw_debug boolean false Enable ultra-verbose raw API I/O logging
debug_truncate_length int 180 Max string length in debug logs
max_retries int 5 Retry attempts before failing
retry_jitter float 0.2 Randomness in retry delays (0.0-1.0)
min_retry_delay float 1.0 Minimum delay between retries
max_retry_delay float 60.0 Maximum delay between retries

Features

  • Responses API only - Optimized for reasoning models
  • Full reasoning support - Automatic reasoning block separation
  • Tool calling - Complete tool integration
  • No API key required - Works with local vLLM servers

vLLM Server Setup

This provider requires a running vLLM server:

# Start vLLM server (basic)
vllm serve openai/gpt-oss-20b \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2

# For production (recommended - full config in /etc/vllm/model.env)
sudo systemctl start vllm

Server requirements: - vLLM version: ≥0.10.1 (tested with 0.10.1.1) - Responses API: Automatically available (no special flags needed) - Model: Any model compatible with vLLM (gpt-oss, Llama, Qwen, etc.)

Architecture

This provider uses the OpenAI SDK with a custom base_url pointing to your vLLM server. Since vLLM implements the OpenAI-compatible Responses API, the integration is clean and direct.

Response flow:

ChatRequest → VLLMProvider.complete() → AsyncOpenAI.responses.create() →
→ vLLM Server → Response → Content blocks (Thinking + Text + ToolCall) → ChatResponse

Token Accounting

For GPT-OSS models: Token accounting is automatic but requires vocab files.

How it works: - First use: Automatically downloads vocab files to ~/.amplifier/cache/vocab/ - Subsequent uses: Uses cached files - No manual setup needed if you have internet access

What's computed: - Input tokens: Accurate count using Harmony's tokenization (matches model training format) - Output tokens: Approximate count based on visible output text - Limitation: Output count doesn't include hidden reasoning channels (REST API limitation)

If auto-download fails (offline/air-gapped):

# Manual setup for offline environments
mkdir -p ~/.amplifier/cache/vocab

# Download vocab files (on a machine with internet)
curl -sS -o ~/.amplifier/cache/vocab/o200k_base.tiktoken \
  https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken

curl -sS -o ~/.amplifier/cache/vocab/cl100k_base.tiktoken \
  https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken

# Transfer ~/.amplifier/cache/vocab/ directory to offline machine
# Then set environment variable:
export TIKTOKEN_ENCODINGS_BASE=~/.amplifier/cache/vocab

Troubleshooting

Connection refused

Problem: Cannot connect to vLLM server

Solution:

# Check vLLM service status
sudo systemctl status vllm

# Verify server is listening
curl http://192.168.128.5:8000/health

# Check logs
sudo journalctl -u vllm -n 50

Tool calling not working

Problem: Model responds with text instead of calling tools

Verification: - ✅ vLLM version ≥0.10.1 - ✅ Using Responses API (not Chat Completions) - ✅ Tools defined in request

Note: Tool calling works via Responses API without special vLLM flags. If it's not working, check the model supports tool calling.

No reasoning blocks

Problem: Responses don't include reasoning/thinking

Check: - Is reasoning parameter set in config? (minimal|low|medium|high) - Is the model a reasoning model? (gpt-oss supports reasoning) - Check raw debug logs to see if reasoning is in API response

Token usage shows zeros

Check logs for: - [TOKEN_ACCOUNTING] Downloading Harmony vocab files to ~/.amplifier/cache/vocab/... (first use) - [TOKEN_ACCOUNTING] Loaded Harmony GPT-OSS encoder (success) - [TOKEN_ACCOUNTING] Injected usage: input=X, output=Y (active)

Repository

GitHub