Skip to content

winml eval

Evaluate ONNX model accuracy on a standard dataset.

When to use this

Use winml eval to measure how accurately a model performs on real data — especially after quantization, where comparing the quantized model against the floating-point baseline reveals any accuracy regression introduced by precision reduction.

Synopsis

$ winml eval [options]

Flags

Flag Short Type Default Description
--model -m TEXT HuggingFace model ID, or path to a local .onnx file. Required (unless --model-id is provided directly).
--model-id TEXT HuggingFace model ID used for preprocessor and config resolution when -m points to an .onnx file. Required when -m is an ONNX file.
--task TEXT auto-detected Task name (e.g., image-classification). Auto-detected from --model-id when not provided. Required when -m is an ONNX file and the task cannot be inferred.
--precision TEXT auto Precision used when building the model from a HuggingFace ID. One of auto, fp32, fp16, int8, int16, or a mixed w{x}a{y} spec (e.g., w8a16). fp16/fp32 skip quantization. Ignored when -m is a pre-built .onnx file — the precision is already baked in.
--device choice auto Target device. Choices: auto, npu, gpu, cpu. auto selects the best available device. Combined with --precision, this drives the build when -m is a HuggingFace ID.
--ep / --execution-provider TEXT Target ONNX Runtime execution provider when finer control than --device is needed. Full names (e.g., QNNExecutionProvider, OpenVINOExecutionProvider, VitisAIExecutionProvider) and aliases (qnn, ov/openvino, vitis/vitisai) are accepted.
--dataset TEXT task default HuggingFace dataset path (e.g., imagenet-1k, nyu-mll/glue). If omitted, a default dataset is selected based on the task.
--dataset-name TEXT Dataset configuration name for multi-config datasets.
--dataset-revision TEXT Git revision (branch, tag, or commit) of the dataset to load. Use refs/convert/parquet for HF datasets that are only served via the parquet mirror.
--dataset-script TEXT Path to a Python script that builds the evaluation dataset locally. Requires --trust-remote-code.
--trust-remote-code / --no-trust-remote-code flag false Allow executing custom code from model repositories or dataset scripts. Required with --dataset-script. Use only with trusted sources.
--samples INTEGER 100 Number of dataset samples to evaluate.
--split TEXT validation Dataset split to use (e.g., validation, test, train).
--shuffle / --no-shuffle flag shuffle Shuffle the dataset before sampling. Disable with --no-shuffle for reproducible sample ordering.
--streaming / --no-streaming flag false Stream the dataset from the Hub instead of downloading the full split. Useful for large datasets.
--column TEXT (multiple) Column mapping as key=value pairs (e.g., --column input_column=image). Can be specified multiple times.
--label-mapping PATH Path to a JSON file mapping dataset label names to the integer class IDs the model emits: {"label_name": id}.
--output -o PATH Output JSON file path for the evaluation results.
--schema flag false Print the expected dataset schema for the given --task and exit. Does not run evaluation.
--mode onnx\|compare onnx Evaluation mode. onnx evaluates the ONNX candidate on a dataset. compare runs the ONNX candidate and the HuggingFace reference on identical random inputs and reports per-tensor similarity metrics — no dataset required.

How it works

winml eval loads the model and runs the evaluation pipeline via the internal evaluate function (supporting both HuggingFace IDs and local ONNX files), then pulls the requested number of samples from a HuggingFace dataset. Each sample is preprocessed using the tokenizer or image processor associated with the model ID, passed through the ONNX Runtime session, and the output is compared against the ground-truth label. Aggregated metrics (accuracy, F1, etc.) are printed to the console and optionally written to a JSON file. When -m is an ONNX file, --model-id must be provided so the command knows which preprocessor and label vocabulary to use.

Examples

Evaluate a HuggingFace model using the task-default dataset:

$ winml eval -m microsoft/resnet-50
Task:     image-classification
Dataset:  timm/mini-imagenet (test, 100 samples)
Device:   auto

Accuracy: 76.00%

Results saved to: microsoft_resnet-50_eval.json

Evaluate a pre-exported ONNX file, providing the source model ID for preprocessing:

$ winml eval -m model.onnx --model-id microsoft/resnet-50 --dataset timm/mini-imagenet

Evaluate a BERT model on the MRPC paraphrase task with column remapping:

$ winml eval -m Intel/bert-base-uncased-mrpc --dataset nyu-mll/glue --dataset-name mrpc --column input_column=sentence1 --column second_input_column=sentence2 --samples 500

Check what dataset columns are expected before running, then remap them to match your dataset:

$ winml eval --schema --task text-classification
Input schema for text-classification models
==================================================

--column option schema

Evaluating needs a dataset with the following columns:
  input_column
      input text (default: text)
  label_column
      class label (ClassLabel or integer) (default: label)
  second_input_column
      second text for sentence-pair tasks (optional) (default: None)

Override any default with --column:
  --column input_column=<your_text_column>
  --column label_column=<your_label_column>
  --column second_input_column=<your_pair_column>

The GLUE SST-2 dataset uses sentence instead of the default text column, so remap it with a single --column override:

$ winml eval -m distilbert/distilbert-base-uncased-finetuned-sst-2-english --dataset nyu-mll/glue --dataset-name sst2 --column input_column=sentence --samples 500

Evaluate against a custom dataset whose label names differ from the model's class IDs. The --label-mapping flag points to a JSON file whose keys are the label name strings as they appear in the dataset and whose values are the integer class IDs the model emits. For example, ResNet-50 outputs ImageNet-1k class IDs (0999), so if your custom dataset uses readable strings like "tabby cat" or "golden retriever", labels.json translates each dataset label to the corresponding ImageNet ID the model predicts:

{
  "tabby cat": 281,
  "Egyptian cat": 285,
  "golden retriever": 207
}
$ winml eval -m microsoft/resnet-50 --dataset my-org/my-pets-dataset --label-mapping labels.json -o results/resnet_eval.json

Evaluate a composite model from pre-exported ONNX files. Some tasks (e.g., image-to-text, encoder-decoder, dual-encoder) split the model across multiple ONNX files, one per role. Pass -m once per role as <role>=<path>.onnx and supply --model-id so the preprocessor and tokenizer can be resolved. Run winml eval --schema --task image-to-text to see the expected roles for a task:

$ winml eval -m encoder=encoder.onnx -m decoder=decoder.onnx --model-id microsoft/trocr-base-printed

Common pitfalls

  • ONNX file without --model-id fails. When -m is a .onnx path, --model-id is mandatory. Without it the command cannot resolve the preprocessor or label vocabulary and will exit with a usage error.
  • The task-default dataset may not match every model. A default dataset cannot fit every model. Classification and detection models in particular need a dataset whose label space and domain match what the model was trained on — using the default may produce misleadingly low scores, missing-label errors, or a dataset-schema error. Always pass --dataset (and --label-mapping if needed) when evaluating a model whose label space or domain differs from the task default.
  • Some dataset requires Hub credentials for gated datasets. Some datasets (e.g., imagenet-1k) require a HuggingFace account with accepted terms of use. Log in with huggingface-cli login before running eval on gated data.
  • --shuffle is on by default. The random 100-sample slice changes between runs unless you pass --no-shuffle. Use --no-shuffle when comparing two model variants to ensure they see identical samples.
  • --streaming skips the local cache. Streaming mode avoids downloading the full split but prevents random shuffling on large datasets. For reproducible evaluation, download the split once and omit --streaming.
  • Column names vary across datasets. If the evaluator raises a missing-column error, run winml eval --schema --task <task> to inspect the expected schema and use --column to remap dataset field names to the expected names.

See also