Skip to content

Datatype and Quantization

Every ONNX tensor carries data in a specific numeric type — float32, float16, int8, int16 — and every winml-cli pipeline makes deliberate choices about which type to use where. This page covers both halves of that decision: the datatype family winml-cli understands, and the quantization workflow that converts a model from one datatype to another to shrink it and run it faster on integer-native hardware.

Quantization is the headline use of datatypes in winml-cli. By replacing float32 weights and activations with int8 or mixed precisions, you typically get a 2–4× smaller model artifact and a 2–8× latency speedup on NPU hardware. The trade-off is a potential reduction in model accuracy, the degree of which depends on the precision chosen and the sensitivity of the model.

Datatypes

winml-cli exposes a precision shorthand on the --precision flag that encodes the weight/activation dtype pair as a single string. The table below lists every precision from _NAMED_PRECISIONS in config/precision.py, together with the resolved quantization types. Float precisions (fp32, fp16) carry no quantization types because weights and activations remain in floating point throughout.

Precision Weight dtype Activation dtype Notes
auto device-dependent device-dependent Resolves to w8a16 (NPU), fp16 (GPU/CPU) at runtime
fp32 float32 float32 No quantization; baseline accuracy
fp16 float16 float16 Half-precision float; no QDQ nodes inserted
int8 uint8 uint8 Static quantization; valid for QNN EP
int16 int16 uint16 Higher-accuracy quantization; larger model than int8
w8a8 uint8 uint8 Equivalent to int8; explicit mixed-precision notation
w8a16 uint8 uint16 Mixed: compact weights, wider activations for accuracy
w4a16 n/a n/a Not supported. Rejected at validation — is_quantized_precision("w4a16") returns False because 4-bit weight types are absent from _BITS_TO_WEIGHT_TYPE in precision.py. The string is not a recognized precision.

The --weight-type and --activation-type flags on winml quantize accept uint8, int8, uint16, or int16 and override whatever the --precision shorthand would have resolved. This is useful when you need an unsigned weight type for QNN compatibility but a signed activation type for a specific operator constraint. See Weight and Activation for why the two need separate flags in the first place.

How quantization works in winml-cli

winml-cli applies quantization by inserting QDQ (Quantize/Dequantize) nodes into the ONNX graph. The resulting file is a standard ONNX model that any ONNX Runtime execution provider can consume and optimize for its target hardware — the EP reads the QDQ pattern and fuses adjacent operations into true integer kernels.

Calibration

Static quantization — the kind winml-cli applies — requires a calibration pass before inserting QDQ nodes. During calibration, a small set of representative inputs runs through the original floating-point model so that winml-cli can observe the actual range of values each tensor takes at runtime. Those observed ranges are then used to choose the scale and zero-point constants baked into the QDQ nodes.

The --samples flag controls how many calibration inputs are used (default: 10). More samples generally produce better range estimates but take longer. The --method flag selects the algorithm used to summarize the observed ranges:

  • minmax (default) — uses the absolute minimum and maximum observed values. Fast and predictable; can be sensitive to outliers.
  • entropy — minimizes the KL-divergence between the original and quantized distribution. Often yields better accuracy on models with heavy-tailed activation distributions.
  • percentile — clips a small fraction of extreme values before computing the range. A practical middle ground when outliers are present but entropy calibration is slow.

Example using entropy calibration with more samples:

winml quantize -m model.onnx --precision int8 --samples 128 --method entropy

The QDQ pattern

The QDQ pattern is the standard ONNX representation for static quantization. winml-cli wraps the inputs and outputs of quantizable operators with pairs of QuantizeLinear and DequantizeLinear nodes. At the graph level the model still operates in floating-point; the QDQ nodes encode the scale and zero-point metadata that a runtime needs to fuse adjacent operations into true integer kernels.

When the model runs under ONNX Runtime, the execution provider — whether CPU, DirectML, or a dedicated NPU EP — reads those QDQ patterns and performs its own graph fusion. This means the EP is free to apply hardware-specific optimizations without winml-cli needing to know anything about the target device's internal ISA or operator library. The QDQ model produced by winml quantize is a single portable artifact that can be deployed to any EP that supports integer execution.

When quantization is lossy

Not all precision choices carry equal accuracy risk:

  • fp16 is usually lossless in practice. Rounding errors relative to fp32 are small enough that most models show no measurable accuracy difference.
  • int8 and int16 are inherently lossy. Compressing a 32-bit float into 8 or 16 bits discards information, and the magnitude of accuracy degradation depends on how well the calibration data represents the deployment distribution.
  • Compound precisions like w8a16 reduce the risk compared to full int8 by preserving more precision in activations, but they are still lossy relative to fp32.

Always validate accuracy after quantizing an integer-precision model. Run winml eval on a representative dataset and compare the metrics against the original floating-point baseline before shipping the quantized artifact.

See also