Benchmark & Performance性能基准

oBeaver provides built-in performance metrics and real-time resource monitoring so you can measure inference speed, memory usage, and system health — all from the CLI or the web dashboard.oBeaver 提供内置的性能指标和实时资源监控,你可以测量推理速度、内存使用和系统健康状态——全部可通过 CLI 或网页仪表盘完成。

CLI Timing StatsCLI 计时统计

Add the --timings flag to any obeaver run command to display per-response performance metrics directly in the terminal:在任何 obeaver run 命令中添加 --timings 参数,即可在终端直接显示每次响应的性能指标:

# Foundry Local engine with timing stats
  obeaver run phi-4-mini --timings

# ORT engine with timing stats
obeaver run --engine ort ./models/phi3-mini-int4 --timings

# VL (Vision-Language) model with timing stats
obeaver run ./models/Qwen3-VL-2B-Instruct_VL_ONNX_INT4_CPU --timings

After each response, the CLI prints a summary line:每次响应后,CLI 会打印一行摘要信息:

TTFT Time to First Token (ms)首个 Token 时间(ms)
tok/s Tokens per second每秒 Token 数
tokens Total tokens generated生成的总 Token 数
Example output:示例输出: After asking a question, the CLI will show stats like提问后,CLI 会显示如下统计信息
TTFT 1398ms · 5.7 tok/s · 26 tokens

Web Dashboard MonitoringWeb 仪表盘监控

You can select an engine and launch the dashboard with the following commands:你可以通过以下指令选择引擎并进入 dashboard:

bash
# Foundry Local engine (default on macOS/Windows) — lists cached Foundry models
obeaver dashboard

# ORT engine — scans ./models for local ONNX GenAI models
obeaver dashboard -e ort

Open http://127.0.0.1:1573/ in your browser, select your model, and start testing. You can use the dashboard to evaluate resource usage across different hardware configurations.在浏览器中打开 http://127.0.0.1:1573/,选择您的模型即可进行测试。您可以在该环境中评估不同硬件的资源占用情况。

oBeaver Dashboard home page
oBeaver Dashboard home page.oBeaver Dashboard 首页。

Foundry Local DashboardFoundry Local 仪表盘

By default, the dashboard uses the Foundry Local engine and loads models from your configured Foundry Local model directory.默认情况下,dashboard 使用 Foundry Local 引擎,并从您默认路径加载 Foundry Local 引擎的模型。

Foundry Local model selector showing cached models
Foundry Local model selector — lists cached models from your default path.Foundry Local 模型选择器——列出默认路径下缓存的模型。

You can chat with the model and observe the local inference benchmark metrics in real time:你可以进行聊天,查看模型在本地响应的 benchmark 指标:

Foundry Local chat — sending a message
Chatting with a Foundry Local model.与 Foundry Local 模型进行聊天。
Foundry Local chat — benchmark result with TTFT and tok/s
Chat response with benchmark stats (TTFT, tok/s, token count).聊天响应及 benchmark 统计(TTFT、tok/s、Token 数)。

ORT DashboardORT 仪表盘

When using obeaver dashboard -e ort, you can select any local ONNX model from your default model directory:使用 obeaver dashboard -e ort 时,您可以选择默认路径下的本地 ONNX 模型:

ORT model selector showing local ONNX models
ORT model selector — lists local ONNX GenAI models.ORT 模型选择器——列出本地 ONNX GenAI 模型。

For standard (non-VL) models, the usage is the same as Foundry Local. However, when loading a VL (Vision-Language) model, the chat interface is hidden and replaced with a dedicated VL interface. You can provide a web image URL for testing, such as:非 VL 模型的使用方式与 Foundry Local 一致。但加载 VL(视觉语言)模型时,Chat 界面会隐藏,进入专用的 VL 界面。您需要添加网络照片的链接进行测试,例如:

text
https://images-sports.now.com/sport/news/6/576/39556158576/39556718725_600x400.jpg
ORT VL interface — image URL input and prompt
VL interface — provide an image URL and prompt for multimodal inference.VL 界面——提供图片 URL 和提示词进行多模态推理。
ORT VL result — model describing the image
VL model response describing the image content.VL 模型响应,描述图片内容。

System Resource Gauges系统资源仪表

The dashboard tracks resource utilisation across all available compute hardware with real-time gauges that refresh every 3 seconds:仪表盘跟踪所有可用计算硬件的资源利用率,实时仪表每 3 秒刷新一次:

Resource资源 Metrics Shown显示指标 Description描述
CPU MemoryCPU 内存 Total, Used, Available, % gauge总量、已用、可用、百分比仪表 System RAM utilisation with processor identification (e.g. ARMv8 Qualcomm, Apple M-series, Intel/AMD)系统 RAM 利用率及处理器识别(如 ARMv8 Qualcomm、Apple M 系列、Intel/AMD)
GPU MemoryGPU 内存 Total, Used, Available, Active/Idle总量、已用、可用、活跃/空闲 Detected GPU device memory (NVIDIA, AMD, Intel, Qualcomm Adreno)检测到的 GPU 设备内存(NVIDIA、AMD、Intel、Qualcomm Adreno)
NPU MemoryNPU 内存 Total, Used, Available, Active/Idle总量、已用、可用、活跃/空闲 Neural Processing Unit if available (Intel Meteor Lake, Qualcomm Hexagon)神经处理单元(如可用)(Intel Meteor Lake、Qualcomm Hexagon)
Process Memory进程内存 PID, Resident, VirtualPID、常驻内存、虚拟内存 Memory consumed by the oBeaver server process itselfoBeaver 服务器进程自身消耗的内存
CPU, GPU, and NPU memory cards with real-time gauges
CPU Memory (87% — 31.6 GB total), GPU Memory (Qualcomm Adreno X1-85), NPU Memory (Snapdragon X Elite Hexagon), and process info (PID, Resident 118 MB, Virtual 87 MB).CPU 内存(87% — 共 31.6 GB)、GPU 内存(Qualcomm Adreno X1-85)、NPU 内存(Snapdragon X Elite Hexagon)及进程信息(PID、常驻 118 MB、虚拟 87 MB)。

Inference Parameters推理参数

The dashboard sidebar provides real-time tuneable inference parameters with three presets:仪表盘侧边栏提供可实时调整的推理参数,并提供三种预设:

Parameter参数 Range范围 Description描述
Temperature 0 – 2 Controls randomness of output. Lower = more deterministic控制输出的随机性。越低越确定
Top P 0 – 1 Nucleus sampling probability threshold核采样概率阈值
Top K 1 – 100 Top-K sampling — limits token choicesTop-K 采样——限制候选 Token 数量
Max Tokens 1 – 4096 Maximum number of tokens to generate最大生成 Token 数
Repetition Penalty 1.0 – 2.0 Penalises repeated tokens对重复 Token 进行惩罚
Presets:预设: Creative (high temperature, high top-k)(高温度、高 top-k) · Balanced (default)(默认) · Precise (low temperature, greedy)(低温度、贪心)

Chat Performance Metrics对话性能指标

The built-in chat interface shows real-time performance stats for every model response:内置聊天界面为每次模型响应显示实时性能统计:

Chat interface showing streaming response with TTFT, tok/s stats
Chat response to "Explain quantum computing in one sentence" — showing TTFT 2532 ms, 12.7 tok/s, 26 tokens.对话响应"用一句话解释量子计算"——显示 TTFT 2532 ms、12.7 tok/s、26 tokens。

Each message displays three key metrics:每条消息显示三个关键指标:

Chat section with conversation history sidebar and inference parameters
Chat section with conversation history sidebar and tuneable inference parameters panel.聊天区域,包含会话历史侧边栏和可调推理参数面板。

Full Dashboard Overview仪表盘完整概览

The complete dashboard brings together system monitoring, a chat interface, inference parameter controls, conversation history, server logs, and export functionality in a single view:完整的仪表盘将系统监控、聊天界面、推理参数控制、会话历史、服务器日志和导出功能集成在一个视图中:

Full overview of oBeaver dashboard with all sections visible
Full dashboard overview: model selector, memory gauges, process info, API reference, chat, server logs, and export buttons.仪表盘完整概览:模型选择器、内存仪表、进程信息、API 参考、聊天、服务器日志和导出按钮。

Dashboard Features Summary仪表盘功能摘要

Feature功能 Description描述
Model Selector模型选择器 Switch between cached models at runtime; NPU-accelerated models are marked with ⚡运行时切换缓存模型;NPU 加速模型以 ⚡ 标记
System Info Bar系统信息栏 Model name, engine type, platform, Python version, live health status模型名称、引擎类型、平台、Python 版本、实时健康状态
Memory Gauges内存仪表 CPU / GPU / NPU utilisation with auto-refresh every 3 secondsCPU / GPU / NPU 利用率,每 3 秒自动刷新
Inference Parameters推理参数 Temperature, Top P, Top K, Max Tokens, Repetition Penalty with presetsTemperature、Top P、Top K、Max Tokens、Repetition Penalty 及预设
Chat Interface聊天界面 Streaming responses with TTFT, tok/s, and token count per message流式响应,显示每条消息的 TTFT、tok/s 和 Token 数
Conversation History会话历史 Saved conversations sidebar with system prompt configuration保存的会话侧边栏,支持系统提示词配置
Server Logs服务器日志 Live request log with method, path, status code, and timing实时请求日志,显示方法、路径、状态码和耗时
Export导出 Export conversations as JSON or Markdown将会话导出为 JSON 或 Markdown

Performance Tips性能优化建议