Skip to content

How winml-cli Works

winml-cli is a toolkit for converting PyTorch and Hugging Face models into ONNX artifacts that are optimized and compiled for Windows ML execution providers (EPs). Starting from a model identifier or a pre-exported ONNX file, winml-cli runs a staged pipeline — export, optimize, quantize, compile — and produces a final model.onnx ready for inference via a Windows ML session.

Each stage is independently controllable. Quantization and compilation are optional and can be bypassed with a flag or by leaving the corresponding section of the build configuration empty. The same pipeline API that powers winml build is also the programmatic entry point for WinMLAutoModel.from_pretrained().

The Pipeline at a Glance

winml-cli workflow

The stages run in order, and each one writes an intermediate ONNX file to the output directory. All intermediate artifacts are preserved so you can inspect any stage's output or feed a pre-processed file into a later stage directly.

Pipeline Stages

Export — winml export

winml export loads a Hugging Face model (pretrained or random-weight), traces it with torch.export or an Optimum-based exporter, and writes a portable, device-agnostic ONNX file. The output at this stage is a plain ONNX graph with float32 weights and no EP-specific nodes.

Analyze — winml analyze

winml analyze performs static compatibility analysis on an ONNX graph against a target execution provider. It classifies every node as Supported, Partial, Unsupported, or Unknown — without running the model on the device. Use it before building to check if your model (or an intermediate artifact from any pipeline stage) will run cleanly on the target EP:

winml analyze -m model.onnx --ep qnn --device npu

Add --optim-config optim.json to output auto-discovered optimization recommendations that can be fed directly into winml optimize. The same analyzer also drives the autoconf feedback loop inside winml build.

Optimize — winml optimize

winml optimize runs graph-level transformations on the exported ONNX: operator fusion (attention, layer norm, GeLU), constant folding, and graph pruning. The optimize stage also contains an autoconf loop: a static analyzer inspects the graph for nodes that the target EP cannot dispatch natively, and re-runs optimization with adjusted fusion flags until no further improvements are found (up to a configurable iteration limit).

Quantize — winml quantize

winml quantize inserts Quantize-Dequantize (QDQ) nodes into the optimized graph to reduce weights and activations to lower-precision types (for example, int8 weights with uint8 activations). Calibration data is used to compute quantization parameters per tensor. If the input model already contains QDQ nodes, this stage is skipped automatically.

Compile — winml compile

winml compile invokes an EP-specific compiler (for example, the QNN compiler for NPU targets) to embed a pre-compiled binary cache inside the ONNX graph as an EPContext node. At inference time, the EP loads the cached binary directly, bypassing per-session compilation. Compilation is optional; omitting it produces a portable ONNX that is compiled on first load by the runtime.

Perf and Eval — winml perf / winml eval

After the model is built, winml perf benchmarks inference latency and throughput using a Windows ML session, and winml eval runs task-specific accuracy evaluation. Neither command modifies the model; they consume the final model.onnx produced by the pipeline.

winml build as the One-Shot Wrapper

Running each stage individually is useful when iterating on a specific step, but the normal workflow is winml build, which orchestrates the full pipeline in a single command:

winml build -m microsoft/resnet-50 -o output/

The -c config.json flag is optional. If omitted, winml build auto-generates a default config internally. To customize pipeline settings, generate a config first with winml config and then pass it:

winml config -m microsoft/resnet-50 -o config.json
winml build -c config.json -m microsoft/resnet-50 -o output/

winml build auto-detects whether the input is a Hugging Face model ID or an existing ONNX file and calls the appropriate internal API (build_hf_model or build_onnx_model). When given an ONNX file directly, the export stage is skipped and the pipeline starts at optimize.

Individual stages can be bypassed from the command line without editing the config file:

# Skip quantization and compilation
winml build -m bert-base-uncased -o output/ --no-quant --no-compile

# Skip optimization (for pre-quantized input)
winml build -m model_qdq.onnx -o output/ --no-optimize

Configuration: WinMLBuildConfig vs CLI Flags

Pipeline behavior is primarily governed by a WinMLBuildConfig JSON file generated by winml config. The config is a hierarchical structure with one section per stage:

WinMLBuildConfig
├── loader    — model type, task, input constraints
├── export    — input tensor specs, opset, backend
├── optim     — fusion flags, optimization level
├── quant     — precision, calibration settings (null = skip stage)
├── compile   — target EP, device (null = skip stage)
└── eval      — evaluation settings

Setting quant or compile to null in the JSON file is equivalent to passing --no-quant or --no-compile on the command line; both result in the corresponding stage being skipped. CLI flags override the config at runtime without modifying the file, which is convenient for one-off experiments.

The config file is written (or updated) to the output directory after the optimize stage completes, capturing any autoconf-adjusted fusion flags so the build is reproducible. This persisted winml_build_config.json is a self-contained pipeline specification that you can check into version control and run in CI/CD (winml build -c winml_build_config.json -m <model> -o output/) for repeatable, unattended builds across environments.

For the full field-by-field schema, see Reference — Config Schema.

See Also