Examples

Scenario

Model

Examples

Hardware Targeted Optimization

NLP

llama2

Link

CPU: with ONNX Runtime optimizations for optimized FP32 ONNX model
CPU: with ONNX Runtime optimizations for optimized INT8 ONNX model
CPU: with ONNX Runtime optimizations for optimized INT4 ONNX model
GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model
GPU: with ONNX Runtime optimizations for optimized INT4 ONNX model
GPU: with QLoRA for model fine tune and ONNX Runtime optimizations for optimized INT4 ONNX model
AzureML compute: with AzureML compute to fine tune and optimize for your local GPUs

mistral

Link

CPU: with Optimum conversion and ONNX Runtime optimizations and Intel® Neural Compressor static quantization for optimized INT8 ONNX model
GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model

open llama

Link

GPU: with Optimum conversion and merging and ONNX Runtime optimizations for optimized ONNX model
GPU: with SparseGPT and TorchTRT conversion for an optimized PyTorch model with sparsity
GPU: with PyTorch LoRA/QLoRA/LoftQ for model fine tune
GPU: with ONNX Runtime QLoRA for model fine tune
AzureML compute: with Optimum conversion and merging and ONNX Runtime optimizations in AzureML
CPU: with Optimum conversion and merging and ONNX Runtime optimizations and Intel® Neural Compressor 4-bits weight-only quantization for optimized INT4 ONNX model

phi

Link

GPU: with PyTorch QLoRA for model fine tune

phi2

Link

CPU: with ONNX Runtime optimizations fp32/int4
GPU with ONNX Runtime optimizations fp16/int4, with PyTorch QLoRA for model fine tune
GPU with SliceGPT for an optimized PyTorch model with sparsity

falcon

Link

GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model

red pajama

Link

CPU: with Optimum conversion and merging and ONNX Runtime optimizations for a single optimized ONNX model

bert

Link

CPU: with ONNX Runtime optimizations and quantization for optimized INT8 ONNX model
CPU: with ONNX Runtime optimizations and Intel® Neural Compressor quantization for optimized INT8 ONNX model
CPU: with PyTorch QAT Customized Training Loop and ONNX Runtime optimizations for optimized ONNX INT8 model
GPU: with ONNX Runtime optimizations for CUDA EP
GPU: with ONNX Runtime optimizations for TRT EP

deberta

Link

GPU: Optimize Azureml Registry Model with ONNX Runtime optimizations and quantization

gptj

Link

CPU: with Intel® Neural Compressor static/dynamic quantization for INT8 ONNX model

Audio

whisper

Link

CPU: with ONNX Runtime optimizations for all-in-one ONNX model in FP32
CPU: with ONNX Runtime optimizations for all-in-one ONNX model in INT8
CPU: with ONNX Runtime optimizations and Intel® Neural Compressor Dynamic Quantization for all-in-one ONNX model in INT8
GPU: with ONNX Runtime optimizations for all-in-one ONNX model in FP32
GPU: with ONNX Runtime optimizations for all-in-one ONNX model in FP16
GPU: with ONNX Runtime optimizations for all-in-one ONNX model in INT8

audio spectrogram
transformer

Link

CPU: with ONNX Runtime optimizations and quantization for optimized INT8 ONNX model

Vision

stable diffusion
stable diffusion XL

Link

GPU: with ONNX Runtime optimization for DirectML EP
GPU: with ONNX Runtime optimization for CUDA EP
Intel CPU: with OpenVINO toolkit

squeezenet

Link

GPU: with ONNX Runtime optimizations with DirectML EP

mobilenet

Link

Qualcomm NPU: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP

resnet

Link

CPU: with ONNX Runtime static/dynamic Quantization for ONNX INT8 model
CPU: with PyTorch QAT Default Training Loop and ONNX Runtime optimizations for ONNX INT8 model
CPU: with PyTorch QAT Lightning Module and ONNX Runtime optimizations for ONNX INT8 model
AMD DPU: with AMD Vitis-AI Quantization
Intel GPU: with ONNX Runtime optimizations with multiple EPs

VGG

Link

Qualcomm NPU: with SNPE toolkit

inception

Link

Qualcomm NPU: with SNPE toolkit

super resolution

Link

CPU: with ONNX Runtime pre/post processing integration for a single ONNX model