winml perf¶

Benchmark an ONNX model's latency and throughput on a target device.

When to use this¶

Use winml perf when you want a quantitative latency and throughput baseline for a model on a specific device, or when you need to compare the performance impact of different precision settings, execution providers, or batch sizes.

Synopsis¶

$ winml perf [options]

Flags¶

Flag	Short	Type	Default	Description
`--model`	`-m`	`TEXT`	—	HuggingFace model ID or path to a local `.onnx` file. Required.
`--task`		`TEXT`	auto-detected	Explicit task override (e.g., `image-classification`). Inferred from the model if omitted.
`--iterations`		`INTEGER`	`100`	Number of timed inference iterations used to compute statistics.
`--warmup`		`INTEGER`	`10`	Number of warm-up iterations run before timing begins; excluded from statistics.
`--device`	`-d`	`auto\\|cpu\\|gpu\\|npu`	`auto`	Device to run the benchmark on. `auto` selects the highest-priority available device.
`--precision`		`TEXT`	`auto`	Precision mode applied during model build: `auto`, `fp32`, `fp16`, `int8`, `int16`, or compound forms such as `w8a16`.
`--ep`		`TEXT`	—	Force a specific execution provider (e.g., `qnn`, `dml`, `vitisai`, `openvino`, `cpu`). Overrides the device-to-provider mapping.
`--output`	`-o`	`PATH`	`~/.cache/winml/perf/<slug>/<timestamp>.json`	Output JSON file path for the benchmark report.
`--batch-size`		`INTEGER`	`1`	Batch size used when generating synthetic input tensors.
`--shape-config`		`PATH`	—	Path to a JSON file containing shape overrides (e.g., `{"height": 480, "width": 480}`). Ignored for pre-exported ONNX files and in `--module` mode.
`--quantize/--no-quantize`		flag	`true`	Run quantization during model build (use `--no-quantize` to skip it). Useful for measuring the fp32 baseline.
`--rebuild/--no-rebuild`		flag	`false`	Force model rebuild even if a cached artifact already exists.
`--ignore-cache/--no-ignore-cache`		flag	`false`	Build from scratch in a temporary folder and discard the artifact after benchmarking. Implies `--rebuild`.
`--module`		`TEXT`	—	PyTorch module class name for per-module benchmarking (e.g., `BertAttention`). Builds and times each matching instance separately. See Load and export.
`--monitor/--no-monitor`		flag	`false`	Show a live NPU/CPU utilization chart while the benchmark runs and include hardware metrics in the JSON report.

How it works¶

winml perf loads the model through WinMLAutoModel — accepting both HuggingFace IDs and local ONNX files — then generates random input tensors from the model's I/O configuration. It runs the specified number of warm-up iterations (excluded from statistics) followed by the timed iterations, collecting per-sample latency. The final report includes mean, min, max, P50, P90, P95, P99, standard deviation, and throughput in samples per second. When --monitor is active, a hardware polling loop runs in parallel and records NPU / GPU utilization, CPU usage, and device memory alongside the timing data.

Examples¶

Basic benchmark on the best available device:

$ winml perf -m microsoft/resnet-50

Device:      npu
Precision:   auto
Task:        image-classification
Iterations:  100 (+ 10 warmup)
Batch Size:  1

Latency (ms)
  Avg    P50    P90    P95    P99    Min    Max    Std
 2.14   2.11   2.38   2.51   2.79   1.97   3.04   0.12

Throughput: 467.29 samples/sec

Results saved to: ~/.cache/winml/perf/microsoft_resnet-50/2026-05-27T120000.json

Benchmark a pre-exported ONNX file on CPU with more iterations:

$ winml perf -m model.onnx --device cpu --iterations 500

Benchmark a text model with an explicit task, targeting the NPU:

$ winml perf -m bert-base-uncased --task text-classification --device npu --precision w8a16

Benchmark with live hardware monitoring enabled:

$ winml perf -m microsoft/resnet-50 --device npu --monitor

Per-module benchmarking to find latency hot-spots across all attention blocks:

$ winml perf -m bert-base-uncased --module BertAttention --iterations 200

Common pitfalls¶

Warm-up too low on NPU. The first several inferences on an NPU EP can be significantly slower due to kernel compilation and caching. The default of 10 warm-up iterations is usually enough for vision models, but transformer models with many operators may need --warmup 30 or higher to reach steady-state latency.
--shape-config is silently ignored in two cases. It has no effect on pre-exported ONNX files (shapes are baked into the graph) and is ignored in --module mode. The command prints a warning in both situations.
Random inputs do not represent real data distributions. Latency numbers are accurate, but memory access patterns may differ from production because the generated tensors are uniform random values. For memory-bandwidth-sensitive models this can understate real-world latency.
Cross-device comparison. To compare performance across devices, run winml perf separately with different --device values and compare the resulting JSON reports.