Benchmark & Performance性能基准

oBeaver provides built-in performance metrics and real-time resource monitoring so you can measure inference speed, memory usage, and system health — all from the CLI or the web dashboard.oBeaver 提供内置的性能指标和实时资源监控，你可以测量推理速度、内存使用和系统健康状态——全部可通过 CLI 或网页仪表盘完成。

CLI Timing StatsCLI 计时统计

Add the --timings flag to any obeaver run command to display per-response performance metrics directly in the terminal:在任何 obeaver run 命令中添加 --timings 参数，即可在终端直接显示每次响应的性能指标：

# Foundry Local engine with timing stats
  obeaver run phi-4-mini --timings

# ORT engine with timing stats
obeaver run --engine ort ./models/phi3-mini-int4 --timings

# VL (Vision-Language) model with timing stats
obeaver run ./models/Qwen3-VL-2B-Instruct_VL_ONNX_INT4_CPU --timings

After each response, the CLI prints a summary line:每次响应后，CLI 会打印一行摘要信息：

TTFT Time to First Token (ms)首个 Token 时间（ms）

tok/s Tokens per second每秒 Token 数

tokens Total tokens generated生成的总 Token 数

Example output:示例输出： After asking a question, the CLI will show stats like提问后，CLI 会显示如下统计信息
TTFT 1398ms · 5.7 tok/s · 26 tokens

Web Dashboard MonitoringWeb 仪表盘监控

You can select an engine and launch the dashboard with the following commands:你可以通过以下指令选择引擎并进入 dashboard：

bash

# Foundry Local engine (default on macOS/Windows) — lists cached Foundry models
obeaver dashboard

# ORT engine — scans ./models for local ONNX GenAI models
obeaver dashboard -e ort

Open http://127.0.0.1:1573/ in your browser, select your model, and start testing. You can use the dashboard to evaluate resource usage across different hardware configurations.在浏览器中打开 http://127.0.0.1:1573/，选择您的模型即可进行测试。您可以在该环境中评估不同硬件的资源占用情况。

oBeaver Dashboard home page.oBeaver Dashboard 首页。

Foundry Local DashboardFoundry Local 仪表盘

By default, the dashboard uses the Foundry Local engine and loads models from your configured Foundry Local model directory.默认情况下，dashboard 使用 Foundry Local 引擎，并从您默认路径加载 Foundry Local 引擎的模型。

Foundry Local model selector showing cached models — Foundry Local model selector — lists cached models from your default path.Foundry Local 模型选择器——列出默认路径下缓存的模型。

You can chat with the model and observe the local inference benchmark metrics in real time:你可以进行聊天，查看模型在本地响应的 benchmark 指标：

Foundry Local chat — sending a message — Chatting with a Foundry Local model.与 Foundry Local 模型进行聊天。

Foundry Local chat — benchmark result with TTFT and tok/s — Chat response with benchmark stats (TTFT, tok/s, token count).聊天响应及 benchmark 统计（TTFT、tok/s、Token 数）。

ORT DashboardORT 仪表盘

When using obeaver dashboard -e ort, you can select any local ONNX model from your default model directory:使用 obeaver dashboard -e ort 时，您可以选择默认路径下的本地 ONNX 模型：

ORT model selector showing local ONNX models — ORT model selector — lists local ONNX GenAI models.ORT 模型选择器——列出本地 ONNX GenAI 模型。

For standard (non-VL) models, the usage is the same as Foundry Local. However, when loading a VL (Vision-Language) model, the chat interface is hidden and replaced with a dedicated VL interface. You can provide a web image URL for testing, such as:非 VL 模型的使用方式与 Foundry Local 一致。但加载 VL（视觉语言）模型时，Chat 界面会隐藏，进入专用的 VL 界面。您需要添加网络照片的链接进行测试，例如：

text

https://images-sports.now.com/sport/news/6/576/39556158576/39556718725_600x400.jpg

ORT VL interface — image URL input and prompt — VL interface — provide an image URL and prompt for multimodal inference.VL 界面——提供图片 URL 和提示词进行多模态推理。

ORT VL result — model describing the image — VL model response describing the image content.VL 模型响应，描述图片内容。

System Resource Gauges系统资源仪表

The dashboard tracks resource utilisation across all available compute hardware with real-time gauges that refresh every 3 seconds:仪表盘跟踪所有可用计算硬件的资源利用率，实时仪表每 3 秒刷新一次：

Resource资源	Metrics Shown显示指标	Description描述
CPU MemoryCPU 内存	Total, Used, Available, % gauge总量、已用、可用、百分比仪表	System RAM utilisation with processor identification (e.g. ARMv8 Qualcomm, Apple M-series, Intel/AMD)系统 RAM 利用率及处理器识别（如 ARMv8 Qualcomm、Apple M 系列、Intel/AMD）
GPU MemoryGPU 内存	Total, Used, Available, Active/Idle总量、已用、可用、活跃/空闲	Detected GPU device memory (NVIDIA, AMD, Intel, Qualcomm Adreno)检测到的 GPU 设备内存（NVIDIA、AMD、Intel、Qualcomm Adreno）
NPU MemoryNPU 内存	Total, Used, Available, Active/Idle总量、已用、可用、活跃/空闲	Neural Processing Unit if available (Intel Meteor Lake, Qualcomm Hexagon)神经处理单元（如可用）（Intel Meteor Lake、Qualcomm Hexagon）
Process Memory进程内存	PID, Resident, VirtualPID、常驻内存、虚拟内存	Memory consumed by the oBeaver server process itselfoBeaver 服务器进程自身消耗的内存

CPU, GPU, and NPU memory cards with real-time gauges — CPU Memory (87% — 31.6 GB total), GPU Memory (Qualcomm Adreno X1-85), NPU Memory (Snapdragon X Elite Hexagon), and process info (PID, Resident 118 MB, Virtual 87 MB).CPU 内存（87% — 共 31.6 GB）、GPU 内存（Qualcomm Adreno X1-85）、NPU 内存（Snapdragon X Elite Hexagon）及进程信息（PID、常驻 118 MB、虚拟 87 MB）。

Inference Parameters推理参数

The dashboard sidebar provides real-time tuneable inference parameters with three presets:仪表盘侧边栏提供可实时调整的推理参数，并提供三种预设：

Parameter参数	Range范围	Description描述
Temperature	0 – 2	Controls randomness of output. Lower = more deterministic控制输出的随机性。越低越确定
Top P	0 – 1	Nucleus sampling probability threshold核采样概率阈值
Top K	1 – 100	Top-K sampling — limits token choicesTop-K 采样——限制候选 Token 数量
Max Tokens	1 – 4096	Maximum number of tokens to generate最大生成 Token 数
Repetition Penalty	1.0 – 2.0	Penalises repeated tokens对重复 Token 进行惩罚

Presets:预设： Creative (high temperature, high top-k)（高温度、高 top-k） · Balanced (default)（默认） · Precise (low temperature, greedy)（低温度、贪心）

Chat Performance Metrics对话性能指标

The built-in chat interface shows real-time performance stats for every model response:内置聊天界面为每次模型响应显示实时性能统计：

Chat interface showing streaming response with TTFT, tok/s stats — Chat response to "Explain quantum computing in one sentence" — showing TTFT 2532 ms, 12.7 tok/s, 26 tokens.对话响应"用一句话解释量子计算"——显示 TTFT 2532 ms、12.7 tok/s、26 tokens。

Each message displays three key metrics:每条消息显示三个关键指标：

TTFT (Time to First Token)TTFT（首个 Token 时间） — how quickly the model begins generating after receiving the prompt. Lower is better; strongly influenced by model size and available hardware acceleration.模型收到提示词后开始生成的速度。越低越好；受模型大小和可用硬件加速影响较大。
tok/s (Tokens per Second)tok/s（每秒 Token 数） — the sustained generation speed. Higher is better; CPU-only inference typically ranges 3-15 tok/s for small models.持续生成速度。越高越好；纯 CPU 推理小模型通常为 3-15 tok/s。
Token CountToken 计数 — total tokens generated in the response.响应中生成的总 Token 数。

Chat section with conversation history sidebar and inference parameters — Chat section with conversation history sidebar and tuneable inference parameters panel.聊天区域，包含会话历史侧边栏和可调推理参数面板。

Full Dashboard Overview仪表盘完整概览

The complete dashboard brings together system monitoring, a chat interface, inference parameter controls, conversation history, server logs, and export functionality in a single view:完整的仪表盘将系统监控、聊天界面、推理参数控制、会话历史、服务器日志和导出功能集成在一个视图中：

Full overview of oBeaver dashboard with all sections visible — Full dashboard overview: model selector, memory gauges, process info, API reference, chat, server logs, and export buttons.仪表盘完整概览：模型选择器、内存仪表、进程信息、API 参考、聊天、服务器日志和导出按钮。

Dashboard Features Summary仪表盘功能摘要

Feature功能	Description描述
Model Selector模型选择器	Switch between cached models at runtime; NPU-accelerated models are marked with ⚡运行时切换缓存模型；NPU 加速模型以 ⚡ 标记
System Info Bar系统信息栏	Model name, engine type, platform, Python version, live health status模型名称、引擎类型、平台、Python 版本、实时健康状态
Memory Gauges内存仪表	CPU / GPU / NPU utilisation with auto-refresh every 3 secondsCPU / GPU / NPU 利用率，每 3 秒自动刷新
Inference Parameters推理参数	Temperature, Top P, Top K, Max Tokens, Repetition Penalty with presetsTemperature、Top P、Top K、Max Tokens、Repetition Penalty 及预设
Chat Interface聊天界面	Streaming responses with TTFT, tok/s, and token count per message流式响应，显示每条消息的 TTFT、tok/s 和 Token 数
Conversation History会话历史	Saved conversations sidebar with system prompt configuration保存的会话侧边栏，支持系统提示词配置
Server Logs服务器日志	Live request log with method, path, status code, and timing实时请求日志，显示方法、路径、状态码和耗时
Export导出	Export conversations as JSON or Markdown将会话导出为 JSON 或 Markdown

Performance Tips性能优化建议

Choose smaller models for faster TTFT选择较小的模型以获得更快的 TTFT — run foundry model list to see available models; models like Phi-4-mini start generating in under a second on modern hardware.运行 foundry model list 查看可用模型；Phi-4-mini 等模型在现代硬件上不到一秒即可开始生成。
Use Foundry Local on macOS/Windows在 macOS/Windows 上使用 Foundry Local — it automatically selects the best available accelerator (NPU > GPU > CPU).它会自动选择最佳可用加速器（NPU > GPU > CPU）。
INT4 quantised ONNX modelsINT4 量化的 ONNX 模型 — provide the best balance of quality and speed for on-device inference.在设备上推理时提供质量和速度的最佳平衡。
Close other memory-heavy applications关闭其他占用内存的应用 — inference speed is directly affected by available RAM; monitor the CPU Memory gauge in the dashboard.推理速度直接受可用内存影响；可在仪表盘中监控 CPU 内存仪表。
Monitor the dashboard监控仪表盘 — run obeaver dashboard (Foundry Local) or obeaver dashboard -e ort (ONNX Runtime GenAI) and keep http://127.0.0.1:1573/ open while benchmarking to track resource utilisation in real time.运行 obeaver dashboard（Foundry Local）或 obeaver dashboard -e ort（ONNX Runtime GenAI），并在基准测试时保持 http://127.0.0.1:1573/ 打开以实时跟踪资源利用率。