winml compile¶

Compile an ONNX model to an EP-specific format for fast runtime loading.

When to use this¶

Use winml compile as the final pipeline stage after winml quantize to produce an execution-provider-native artifact (for example, a QNN EPContext model) that loads faster and avoids online graph compilation at inference time.

Synopsis¶

$ winml compile [options]

Flags¶

Flag	Short	Type	Default	Description
`--model`	`-m`	path	(required unless `--list`)	Input ONNX model file.
`--output`	`-o`	path	—	Output file path (e.g., `model_compiled.onnx`). Takes precedence over `--output-dir`.
`--output-dir`		path	same dir as input	Directory to write compiled output artifacts.
`--device`	`-d`	choice	`auto`	Target device: `auto`, `npu`, `gpu`, or `cpu`.
`--ep`		`TEXT`	—	Force a specific execution provider, overriding device-to-provider mapping. Accepts full names (e.g., `QNNExecutionProvider`) or aliases (`qnn`, `dml`, `openvino`, `vitisai`, `migraphx`, `cpu`, `nvtensorrtrtx`).
`--validate` / `--no-validate`		flag	`--validate`	Run a post-compilation validation pass on the target hardware. Enabled by default; pass `--no-validate` to skip when the target hardware or driver is unavailable.
`--compiler`		choice	`ort`	Compiler backend: `ort` (ONNX Runtime) or `qairt` (Qualcomm AI Runtime Tools).
`--qnn-sdk-root`		path	`None`	Path to the QNN SDK root directory.
`--embed/--no-embed`		flag	`false`	Embed the EP context blob inside the ONNX file instead of writing a separate `.bin` file.
`--list`		flag	`false`	List available compiler backends for the selected device and exit without compiling.
`--help`	`-h`	flag		Show this message and exit.

How it works¶

winml compile resolves the target execution provider from --device and --ep, then calls the winml-cli compiler API to hand the ONNX graph to the EP's offline compilation toolchain. When --device auto (the default), the target EP is determined by auto-detecting available hardware. For NPU targets, ONNX Runtime's QNN EP generates a binary .bin context file (or embeds it inline with --embed) that encodes the hardware-optimized execution plan, eliminating graph partitioning at load time. An optional post-compilation validation pass runs a forward pass through the target EP; skip it with --no-validate when the target hardware is absent.

Examples¶

# Compile with auto device detection (default compiler)
winml compile -m resnet50_qdq.onnx

Input: resnet50_qdq.onnx
Device: npu
Provider: qnn
Compiler: ort

Compiling model...

Success! Model compiled
Output: resnet50_qdq_ctx.onnx
Compile time: 12.40s
Total time: 13.05s

# List available compiler backends for NPU before committing to a run
winml compile --list --device npu

# Compile a pre-quantized BERT model for NPU with context embedded inline
winml compile -m bert-base-uncased_qdq.onnx --embed

# Compile for GPU using the OpenVINO execution provider
winml compile -m microsoft_resnet50.onnx --device gpu --ep openvino

Common pitfalls¶

--embed inflates the .onnx file significantly. Embedding the EP context produces a single portable file but can make it impractical to open or inspect the ONNX graph with standard tooling.
Validation requires the target hardware. The post-compilation validation step runs an actual inference pass; on a machine without the NPU driver or the relevant EP installed, always pass --no-validate.
--device auto auto-detects the best available hardware. Pass --device npu, --device gpu, or --device cpu explicitly when targeting specific hardware regardless of what is auto-detected.