Examples#

Scenario

Model

Examples

Hardware Targeted Optimization

NLP

deepseek

Link

QDQ: QDQ Model with 4-bit Weights & 16-bit Activations
QNN EP: PTQ + AOT Compilation for Qualcomm NPUs using QNN EP
Vitis AI EP: PTQ + AOT Compilation for AMD NPUs using Vitis AI EP
OpenVINO EP: PTQ + AOT Compilation for OpenVINO EP
Intel® NPU: PTQ + AWQ with 4-bit weight compression using Intel® Optimum OpenVINO for ONNX OpenVINO IR Encapsulated Model

llama2

Link

CPU: with ONNX Runtime optimizations for optimized FP32 ONNX model
CPU: with ONNX Runtime optimizations for optimized INT8 ONNX model
CPU: with ONNX Runtime optimizations for optimized INT4 ONNX model
GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model
GPU: with ONNX Runtime optimizations for optimized INT4 ONNX model
GPU: with QLoRA for model fine tune and ONNX Runtime optimizations for optimized ONNX model
AzureML compute: with AzureML compute to fine tune and optimize for your local GPUs

llama3

Link

QDQ: QDQ Model with 4-bit Weights & 16-bit Activations
QNN EP: PTQ + AOT Compilation for Qualcomm NPUs using QNN EP
Vitis AI EP: PTQ + AOT Compilation for AMD NPUs using Vitis AI EP
OpenVINO EP: PTQ + AOT Compilation for AMD NPUs using OpenVINO EP
Intel® NPU: PTQ + AWQ with 4-bit weight compression using Intel® Optimum OpenVINO for ONNX OpenVINO IR Encapsulated Model

mistral

Link

CPU: with Optimum conversion and ONNX Runtime optimizations and Intel® Neural Compressor static quantization for optimized INT8 ONNX model
GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model

open llama

Link

GPU: with Optimum conversion and merging and ONNX Runtime optimizations for optimized ONNX model
GPU: with SparseGPT and TorchTRT conversion for an optimized PyTorch model with sparsity
AzureML compute: with Optimum conversion and merging and ONNX Runtime optimizations in AzureML
CPU: with Optimum conversion and merging and ONNX Runtime optimizations and Intel® Neural Compressor 4-bits weight-only quantization for optimized INT4 ONNX model

phi2

Link

CPU: with ONNX Runtime optimizations fp32/int4
GPU with ONNX Runtime optimizations fp16/int4, with PyTorch QLoRA for model fine tune
GPU with SliceGPT for an optimized PyTorch model with sparsity

phi3.5

Link

QDQ: QDQ Model with 4-bit Weights & 16-bit Activations
QNN EP: PTQ + AOT Compilation for Qualcomm NPUs using QNN EP
Vitis AI EP: PTQ + AOT Compilation for AMD NPUs using Vitis AI EP
OpenVINO EP: PTQ + AOT Compilation for AMD NPUs using OpenVINO EP
Intel® NPU: PTQ + AWQ with 4-bit weight compression using Intel® Optimum OpenVINO for ONNX OpenVINO IR Encapsulated Model

phi4

Link

Intel® NPU: PTQ + AWQ with 4-bit weight compression using Intel® Optimum OpenVINO for ONNX OpenVINO IR Encapsulated Model

qwen2.5

Link

QDQ: QDQ Model with 4-bit Weights & 16-bit Activations
QNN EP: PTQ + AOT Compilation for Qualcomm NPUs using QNN EP
Vitis AI EP: PTQ + AOT Compilation for AMD NPUs using Vitis AI EP
OpenVINO EP: PTQ + AOT Compilation for AMD NPUs using OpenVINO EP
Intel® NPU: PTQ + AWQ with 4-bit weight compression using Intel® Optimum OpenVINO for ONNX OpenVINO IR Encapsulated Model

falcon

Link

GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model

red pajama

Link

CPU: with Optimum conversion and merging and ONNX Runtime optimizations for a single optimized ONNX model

bert

Link

CPU: with ONNX Runtime optimizations and quantization for optimized INT8 ONNX model
CPU: with ONNX Runtime optimizations and Intel® Neural Compressor quantization for optimized INT8 ONNX model
CPU: with PyTorch QAT Customized Training Loop and ONNX Runtime optimizations for optimized ONNX INT8 model
GPU: with ONNX Runtime optimizations for CUDA EP
GPU: with ONNX Runtime optimizations for TRT EP
QNN EP: with ONNX Runtime optimizations for QNN EP
Vitis AI EP: with ONNX Runtime optimizations for Vitis AI EP
OpenVINO EP: with ONNX Runtime optimizations for OpenVINO EP
QDQ: with ONNX Runtime optimizations and INT8 quantization encoded in QDQ format
Intel® NPU: PTQ using Intel® NNCF for ONNX OpenVINO IR encapsulated model

deberta

Link

GPU: Optimize Azureml Registry Model with ONNX Runtime optimizations and quantization

gptj

Link

CPU: with Intel® Neural Compressor static/dynamic quantization for INT8 ONNX model

bge

Link

NPU: with ONNX Runtime optimizations for QNN EP

audio spectrogram
transformer

Link

CPU: with ONNX Runtime optimizations and quantization for optimized INT8 ONNX model

Vision

stable diffusion

Link

GPU: with ONNX Runtime optimization for DirectML EP
GPU: with ONNX Runtime optimization for CUDA EP
Intel CPU: with OpenVINO toolkit
QDQ: with ONNX Runtime static Quantization for ONNX INT8 model with QDQ format

stable diffusion XL

Link

GPU: with ONNX Runtime optimizations with DirectML EP
GPU: with ONNX Runtime optimization for CUDA EP

squeezenet

Link

GPU: with ONNX Runtime optimizations with DirectML EP

mobilenet

Link

QNN EP: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP

clip

Link

QNN EP: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP
Vitis AI EP: with ONNX Runtime static QDQ quantization for ONNX Runtime Vitis AI EP
QDQ: with ONNX Runtime static Quantization for ONNX INT8 model with QDQ format
Intel® NPU: PTQ using Intel® NNCF for ONNX OpenVINO IR encapsulated model

resnet

Link

CPU: with ONNX Runtime static/dynamic Quantization for ONNX INT8 model
QDQ: with ONNX Runtime static Quantization for ONNX INT8 model with QDQ format
CPU: with PyTorch QAT Default Training Loop and ONNX Runtime optimizations for ONNX INT8 model
CPU: with PyTorch QAT Lightning Module and ONNX Runtime optimizations for ONNX INT8 model
AMD DPU: with AMD Vitis-AI Quantization
Intel GPU: with ONNX Runtime optimizations with multiple EPs
QNN EP: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP
Intel® NPU: PTQ using Intel® NNCF for ONNX OpenVINO IR encapsulated model

VGG

Link

Qualcomm NPU: with SNPE toolkit

super resolution

Link

CPU: with ONNX Runtime pre/post processing integration for a single ONNX model

Vision Transformer

Link

QNN EP: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP
Vitis AI EP: with ONNX Runtime static QDQ quantization for ONNX Runtime Vitis AI EP
QDQ: with ONNX Runtime static Quantization for ONNX INT8 model with QDQ format
Intel® NPU: PTQ using Intel® NNCF for ONNX OpenVINO IR encapsulated model

Table Transformer Detection

Link

QNN EP: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP