Architecture架构说明

oBeaver is designed as a modular, layered system with a CLI frontend, a FastAPI server, and pluggable inference engines.oBeaver 采用模块化分层设计,包含 CLI 前端、FastAPI 服务和可插拔推理引擎。

Architecture Diagram架构图

obeaver architecture diagram

Code Structure代码结构

text
obeaver/
├── cli.py                  # Typer CLI: chat, serve, embed, serve-embed, convert, check
├── server.py               # FastAPI OpenAI-compatible server (chat + embeddings + dashboard)
├── chat.py                 # Interactive multi-turn terminal chat loop
├── engine_foundrylocal.py  # FoundryEngine — wraps foundry-local-sdk
├── engine_ort.py           # OrtEngine — wraps onnxruntime-genai
├── engine_embedding.py     # EmbeddingEngine — ONNX-only text embeddings
├── monitor.py              # System resource monitoring (CPU, GPU, NPU memory)
├── tools.py                # Tool-calling: parse_tool_call(), inject_tools_into_messages()
├── config.py               # ServerConfig dataclass
├── convert_vl.py           # Vision-Language model conversion via Olive
└── static/
    └── index.html          # Web dashboard UI with real-time memory gauges

Engine Selection Logic引擎选择逻辑

ConditionEngineModel Argument
macOS / Windows (default)Foundry LocalCatalog alias (e.g. Phi-4-mini)
--engine ort or Linux (default)ONNX Runtime GenAILocal directory path
embed / serve-embed commandsEmbeddingEngine (ONNX)Local ONNX model directory
VL model detectedORT (auto-switched)Local directory with vision.onnx

Foundry Local Engine FlowFoundry Local 引擎流程

text
FoundryLocalManager(bootstrap=True)
  → download_model(alias)
  → load_model(alias)
  → OpenAI client streams via manager.endpoint

The Foundry Local engine delegates to the Microsoft Foundry Local daemon, which handles hardware detection and model management. The SDK provides an OpenAI-compatible client that streams responses directly.Foundry Local 引擎委托给 Microsoft Foundry Local 守护进程,负责硬件检测和模型管理。SDK 提供一个 OpenAI 兼容的客户端,直接流式传输响应。

ORT Engine FlowORT 引擎流程

text
og.Model(path) → og.Tokenizer(model) → og.GeneratorParams(model)
  → og.Generator(model, params) → generate_next_token() loop → stream tokens

The ORT engine loads models directly using ONNX Runtime GenAI. It tokenizes input, configures generation parameters, and runs the inference loop locally — fully offline with zero network dependency.ORT 引擎使用 ONNX Runtime GenAI 直接加载模型。它对输入进行分词、配置生成参数,并在本地运行推理循环——完全离线,零网络依赖。

Tool-Calling Internals工具调用内部实现

ComponentRole
ToolCallDataclass: id, name, arguments.to_openai_dict()
ChatResponseUnified engine response: content, tool_calls, finish_reason
inject_tools_into_messages()Appends JSON Schema tool definitions to system prompt (ORT path)
parse_tool_call()Extracts ToolCall from raw model output; handles 6 format variants
build_tool_result_message()Builds the role: tool message for turn-2 follow-up

Tool-Calling Data Flow工具调用数据流

text
User request + tools
  → inject_tools_into_messages() [ORT] / pass tools natively [Foundry]
  → Engine generates response
  → parse_tool_call() extracts ToolCall
  → Application executes the function
  → build_tool_result_message() creates turn-2 message
  → Engine generates final response with tool result context

HTTP Request LifecycleHTTP 请求生命周期

text
Client POST /v1/chat/completions
  → FastAPI route handler
  → Validate ChatCompletionRequest (Pydantic)
  → Resolve engine (Foundry / ORT)
  → If VL model: extract images, build VL messages
  → If tools: inject into messages or pass natively
  → If streaming:
      → StreamingResponse with SSE chunks
      → Each token yielded as data: {...}
  → If non-streaming:
      → Collect full response
      → Return JSON with choices[]
  → Log request timing & stats

Embedding Engine Flow嵌入引擎流程

text
EmbeddingEngine(model_path)
  → Load ONNX model via onnxruntime.InferenceSession
  → Tokenize input via transformers.AutoTokenizer
  → Run inference → extract embeddings
  → Normalize & return float[] vectors