Examples#

Scenario

Model

Examples

Hardware Targeted Optimization

NLP

deepseek

Link

QDQ: QDQ Model with 4-bit Weights & 16-bit Activations
Qualcomm NPU: PTQ + AOT Compilation for Qualcomm NPUs using QNN EP

llama2

Link

CPU: with ONNX Runtime optimizations for optimized FP32 ONNX model
CPU: with ONNX Runtime optimizations for optimized INT8 ONNX model
CPU: with ONNX Runtime optimizations for optimized INT4 ONNX model
GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model
GPU: with ONNX Runtime optimizations for optimized INT4 ONNX model
GPU: with QLoRA for model fine tune and ONNX Runtime optimizations for optimized ONNX model
AzureML compute: with AzureML compute to fine tune and optimize for your local GPUs

llama3

Link

QDQ: QDQ Model with 4-bit Weights & 16-bit Activations
Qualcomm NPU: PTQ + AOT Compilation for Qualcomm NPUs using QNN EP

mistral

Link

CPU: with Optimum conversion and ONNX Runtime optimizations and Intel® Neural Compressor static quantization for optimized INT8 ONNX model
GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model

open llama

Link

GPU: with Optimum conversion and merging and ONNX Runtime optimizations for optimized ONNX model
GPU: with SparseGPT and TorchTRT conversion for an optimized PyTorch model with sparsity
AzureML compute: with Optimum conversion and merging and ONNX Runtime optimizations in AzureML
CPU: with Optimum conversion and merging and ONNX Runtime optimizations and Intel® Neural Compressor 4-bits weight-only quantization for optimized INT4 ONNX model

phi2

Link

CPU: with ONNX Runtime optimizations fp32/int4
GPU with ONNX Runtime optimizations fp16/int4, with PyTorch QLoRA for model fine tune
GPU with SliceGPT for an optimized PyTorch model with sparsity

phi3.5

Link

QDQ: QDQ Model with 4-bit Weights & 16-bit Activations
Qualcomm NPU: PTQ + AOT Compilation for Qualcomm NPUs using QNN EP

qwen2.5

Link

QDQ: QDQ Model with 4-bit Weights & 16-bit Activations
Qualcomm NPU: PTQ + AOT Compilation for Qualcomm NPUs using QNN EP

falcon

Link

GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model

red pajama

Link

CPU: with Optimum conversion and merging and ONNX Runtime optimizations for a single optimized ONNX model

bert

Link

CPU: with ONNX Runtime optimizations and quantization for optimized INT8 ONNX model
CPU: with ONNX Runtime optimizations and Intel® Neural Compressor quantization for optimized INT8 ONNX model
CPU: with PyTorch QAT Customized Training Loop and ONNX Runtime optimizations for optimized ONNX INT8 model
GPU: with ONNX Runtime optimizations for CUDA EP
GPU: with ONNX Runtime optimizations for TRT EP
NPU: with ONNX Runtime optimizations for QNN EP
QDQ: with ONNX Runtime optimizations and INT8 quantization encoded in QDQ format

deberta

Link

GPU: Optimize Azureml Registry Model with ONNX Runtime optimizations and quantization

gptj

Link

CPU: with Intel® Neural Compressor static/dynamic quantization for INT8 ONNX model

bge

Link

NPU: with ONNX Runtime optimizations for QNN EP

Audio

whisper

Link

CPU: with ONNX Runtime optimizations for all-in-one ONNX model in FP32
CPU: with ONNX Runtime optimizations for all-in-one ONNX model in INT8
CPU: with ONNX Runtime optimizations and Intel® Neural Compressor Dynamic Quantization for all-in-one ONNX model in INT8
GPU: with ONNX Runtime optimizations for all-in-one ONNX model in FP32
GPU: with ONNX Runtime optimizations for all-in-one ONNX model in FP16
GPU: with ONNX Runtime optimizations for all-in-one ONNX model in INT8

audio spectrogram
transformer

Link

CPU: with ONNX Runtime optimizations and quantization for optimized INT8 ONNX model

Vision

stable diffusion
stable diffusion XL

Link

GPU: with ONNX Runtime optimization for DirectML EP
GPU: with ONNX Runtime optimization for CUDA EP
Intel CPU: with OpenVINO toolkit

squeezenet

Link

GPU: with ONNX Runtime optimizations with DirectML EP

mobilenet

Link

Qualcomm NPU: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP

clip

Link

Qualcomm NPU: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP
QDQ: with ONNX Runtime static Quantization for ONNX INT8 model with QDQ format

resnet

Link

CPU: with ONNX Runtime static/dynamic Quantization for ONNX INT8 model
QDQ: with ONNX Runtime static Quantization for ONNX INT8 model with QDQ format
CPU: with PyTorch QAT Default Training Loop and ONNX Runtime optimizations for ONNX INT8 model
CPU: with PyTorch QAT Lightning Module and ONNX Runtime optimizations for ONNX INT8 model
AMD DPU: with AMD Vitis-AI Quantization
Intel GPU: with ONNX Runtime optimizations with multiple EPs
Qualcomm NPU: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP

VGG

Link

Qualcomm NPU: with SNPE toolkit

inception

Link

Qualcomm NPU: with SNPE toolkit

super resolution

Link

CPU: with ONNX Runtime pre/post processing integration for a single ONNX model

Vision Transformer

Link

Qualcomm NPU: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP
QDQ: with ONNX Runtime static Quantization for ONNX INT8 model with QDQ format

Table Transformer Detection

Link

Qualcomm NPU: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP