Skip to content

winml compile

Compile an ONNX model to an EP-specific format for fast runtime loading.

When to use this

Use winml compile as the final pipeline stage after winml quantize to produce an execution-provider-native artifact (for example, a QNN EPContext model) that loads faster and avoids online graph compilation at inference time.

Synopsis

$ winml compile [options]

Flags

Flag Short Type Default Description
--model -m path (required unless --list) Input ONNX model file.
--output -o path Output file path (e.g., model_compiled.onnx). Takes precedence over --output-dir.
--output-dir path same dir as input Directory to write compiled output artifacts.
--device -d choice auto Target device: auto, npu, gpu, or cpu.
--ep TEXT Force a specific execution provider, overriding device-to-provider mapping. Accepts full names (e.g., QNNExecutionProvider) or aliases (qnn, dml, openvino, vitisai, migraphx, cpu, nvtensorrtrtx).
--validate / --no-validate flag --validate Run a post-compilation validation pass on the target hardware. Enabled by default; pass --no-validate to skip when the target hardware or driver is unavailable.
--compiler choice ort Compiler backend: ort (ONNX Runtime) or qairt (Qualcomm AI Runtime Tools).
--qnn-sdk-root path None Path to the QNN SDK root directory.
--embed/--no-embed flag false Embed the EP context blob inside the ONNX file instead of writing a separate .bin file.
--list flag false List available compiler backends for the selected device and exit without compiling.
--help -h flag Show this message and exit.

How it works

winml compile resolves the target execution provider from --device and --ep, then calls the winml-cli compiler API to hand the ONNX graph to the EP's offline compilation toolchain. When --device auto (the default), the target EP is determined by auto-detecting available hardware. For NPU targets, ONNX Runtime's QNN EP generates a binary .bin context file (or embeds it inline with --embed) that encodes the hardware-optimized execution plan, eliminating graph partitioning at load time. An optional post-compilation validation pass runs a forward pass through the target EP; skip it with --no-validate when the target hardware is absent.

Examples

# Compile with auto device detection (default compiler)
winml compile -m resnet50_qdq.onnx
Input: resnet50_qdq.onnx
Device: npu
Provider: qnn
Compiler: ort

Compiling model...

Success! Model compiled
Output: resnet50_qdq_ctx.onnx
Compile time: 12.40s
Total time: 13.05s
# List available compiler backends for NPU before committing to a run
winml compile --list --device npu
# Compile a pre-quantized BERT model for NPU with context embedded inline
winml compile -m bert-base-uncased_qdq.onnx --embed
# Compile for GPU using the OpenVINO execution provider
winml compile -m microsoft_resnet50.onnx --device gpu --ep openvino

Common pitfalls

  • --embed inflates the .onnx file significantly. Embedding the EP context produces a single portable file but can make it impractical to open or inspect the ONNX graph with standard tooling.
  • Validation requires the target hardware. The post-compilation validation step runs an actual inference pass; on a machine without the NPU driver or the relevant EP installed, always pass --no-validate.
  • --device auto auto-detects the best available hardware. Pass --device npu, --device gpu, or --device cpu explicitly when targeting specific hardware regardless of what is auto-detected.

See also