Skip to content

vLLM Provider

Integrates vLLM's OpenAI-compatible Responses API for local/self-hosted LLMs.

Module ID

provider-vllm

Installation

providers:
  - module: provider-vllm
    source: git+https://github.com/microsoft/amplifier-module-provider-vllm@main
    config:
      base_url: "http://192.168.128.5:8000/v1"
      default_model: openai/gpt-oss-20b

Configuration

Option Type Default Description
base_url string (required) vLLM server URL
default_model string - Model name from vLLM
max_tokens int 4096 Maximum output tokens
temperature float 0.7 Sampling temperature
reasoning string high Reasoning effort: minimal, low, medium, high
reasoning_summary string detailed Summary verbosity: auto, concise, detailed
timeout float 300.0 API timeout (seconds)

Features

  • Responses API only - Optimized for reasoning models
  • Full reasoning support - Automatic reasoning block separation
  • Tool calling - Complete tool integration
  • No API key required - Works with local vLLM servers
  • OpenAI-compatible - Uses OpenAI SDK under the hood

vLLM Server Setup

# Start vLLM server
vllm serve openai/gpt-oss-20b \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2

Requirements:

  • vLLM version: ≥0.10.1
  • Any model compatible with vLLM (gpt-oss, Llama, Qwen, etc.)

Usage

# Configure in profile
providers:
  - module: provider-vllm
    config:
      base_url: "http://localhost:8000/v1"
      default_model: "openai/gpt-oss-20b"
      reasoning: "high"

Debugging

config:
  debug: true        # Summary logging
  raw_debug: true    # Complete API I/O

Repository

GitHub