vLLM Provider¶
Integrates vLLM's OpenAI-compatible Responses API for local/self-hosted LLMs.
Module ID¶
provider-vllm
Installation¶
providers:
- module: provider-vllm
source: git+https://github.com/microsoft/amplifier-module-provider-vllm@main
config:
base_url: "http://192.168.128.5:8000/v1"
default_model: openai/gpt-oss-20b
Configuration¶
| Option | Type | Default | Description |
|---|---|---|---|
base_url | string | (required) | vLLM server URL |
default_model | string | - | Model name from vLLM |
max_tokens | int | 4096 | Maximum output tokens |
temperature | float | 0.7 | Sampling temperature |
reasoning | string | high | Reasoning effort: minimal, low, medium, high |
reasoning_summary | string | detailed | Summary verbosity: auto, concise, detailed |
timeout | float | 300.0 | API timeout (seconds) |
Features¶
- Responses API only - Optimized for reasoning models
- Full reasoning support - Automatic reasoning block separation
- Tool calling - Complete tool integration
- No API key required - Works with local vLLM servers
- OpenAI-compatible - Uses OpenAI SDK under the hood
vLLM Server Setup¶
# Start vLLM server
vllm serve openai/gpt-oss-20b \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2
Requirements:
- vLLM version: ≥0.10.1
- Any model compatible with vLLM (gpt-oss, Llama, Qwen, etc.)
Usage¶
# Configure in profile
providers:
- module: provider-vllm
config:
base_url: "http://localhost:8000/v1"
default_model: "openai/gpt-oss-20b"
reasoning: "high"
Debugging¶
Repository¶
→ GitHub