Architecture架构说明

oBeaver is designed as a modular, layered system with a CLI frontend, a FastAPI server, and pluggable inference engines.oBeaver 采用模块化分层设计，包含 CLI 前端、FastAPI 服务和可插拔推理引擎。

Architecture Diagram架构图

Code Structure代码结构

text

obeaver/
├── cli.py                  # Typer CLI: chat, serve, embed, serve-embed, convert, check
├── server.py               # FastAPI OpenAI-compatible server (chat + embeddings + dashboard)
├── chat.py                 # Interactive multi-turn terminal chat loop
├── engine_foundrylocal.py  # FoundryEngine — wraps foundry-local-sdk
├── engine_ort.py           # OrtEngine — wraps onnxruntime-genai
├── engine_embedding.py     # EmbeddingEngine — ONNX-only text embeddings
├── monitor.py              # System resource monitoring (CPU, GPU, NPU memory)
├── tools.py                # Tool-calling: parse_tool_call(), inject_tools_into_messages()
├── config.py               # ServerConfig dataclass
├── convert_vl.py           # Vision-Language model conversion via Olive
└── static/
    └── index.html          # Web dashboard UI with real-time memory gauges

Engine Selection Logic引擎选择逻辑

Condition	Engine	Model Argument
macOS / Windows (default)	Foundry Local	Catalog alias (e.g. `Phi-4-mini`)
`--engine ort` or Linux (default)	ONNX Runtime GenAI	Local directory path
`embed` / `serve-embed` commands	EmbeddingEngine (ONNX)	Local ONNX model directory
VL model detected	ORT (auto-switched)	Local directory with `vision.onnx`

Foundry Local Engine FlowFoundry Local 引擎流程

text

FoundryLocalManager(bootstrap=True)
  → download_model(alias)
  → load_model(alias)
  → OpenAI client streams via manager.endpoint

The Foundry Local engine delegates to the Microsoft Foundry Local daemon, which handles hardware detection and model management. The SDK provides an OpenAI-compatible client that streams responses directly.Foundry Local 引擎委托给 Microsoft Foundry Local 守护进程，负责硬件检测和模型管理。SDK 提供一个 OpenAI 兼容的客户端，直接流式传输响应。

ORT Engine FlowORT 引擎流程

text

og.Model(path) → og.Tokenizer(model) → og.GeneratorParams(model)
  → og.Generator(model, params) → generate_next_token() loop → stream tokens

The ORT engine loads models directly using ONNX Runtime GenAI. It tokenizes input, configures generation parameters, and runs the inference loop locally — fully offline with zero network dependency.ORT 引擎使用 ONNX Runtime GenAI 直接加载模型。它对输入进行分词、配置生成参数，并在本地运行推理循环——完全离线，零网络依赖。

Tool-Calling Internals工具调用内部实现

Component	Role
`ToolCall`	Dataclass: `id`, `name`, `arguments` → `.to_openai_dict()`
`ChatResponse`	Unified engine response: `content`, `tool_calls`, `finish_reason`
`inject_tools_into_messages()`	Appends JSON Schema tool definitions to system prompt (ORT path)
`parse_tool_call()`	Extracts `ToolCall` from raw model output; handles 6 format variants
`build_tool_result_message()`	Builds the `role: tool` message for turn-2 follow-up

Tool-Calling Data Flow工具调用数据流

text

User request + tools
  → inject_tools_into_messages() [ORT] / pass tools natively [Foundry]
  → Engine generates response
  → parse_tool_call() extracts ToolCall
  → Application executes the function
  → build_tool_result_message() creates turn-2 message
  → Engine generates final response with tool result context

HTTP Request LifecycleHTTP 请求生命周期

text

Client POST /v1/chat/completions
  → FastAPI route handler
  → Validate ChatCompletionRequest (Pydantic)
  → Resolve engine (Foundry / ORT)
  → If VL model: extract images, build VL messages
  → If tools: inject into messages or pass natively
  → If streaming:
      → StreamingResponse with SSE chunks
      → Each token yielded as data: {...}
  → If non-streaming:
      → Collect full response
      → Return JSON with choices[]
  → Log request timing & stats

Embedding Engine Flow嵌入引擎流程

text

EmbeddingEngine(model_path)
  → Load ONNX model via onnxruntime.InferenceSession
  → Tokenize input via transformers.AutoTokenizer
  → Run inference → extract embeddings
  → Normalize & return float[] vectors