NLP |
llama2 |
Link |
CPU : with ONNX Runtime optimizations for optimized FP32 ONNX model
CPU : with ONNX Runtime optimizations for optimized INT8 ONNX model
CPU : with ONNX Runtime optimizations for optimized INT4 ONNX model
GPU : with ONNX Runtime optimizations for optimized FP16 ONNX model
GPU : with ONNX Runtime optimizations for optimized INT4 ONNX model
GPU : with QLoRA for model fine tune and ONNX Runtime optimizations for optimized ONNX model
AzureML compute : with AzureML compute to fine tune and optimize for your local GPUs
|
|
mistral |
Link |
CPU : with Optimum conversion and ONNX Runtime optimizations and Intel® Neural Compressor static quantization for optimized INT8 ONNX model
GPU : with ONNX Runtime optimizations for optimized FP16 ONNX model
|
|
open llama |
Link |
GPU : with Optimum conversion and merging and ONNX Runtime optimizations for optimized ONNX model
GPU : with SparseGPT and TorchTRT conversion for an optimized PyTorch model with sparsity
AzureML compute : with Optimum conversion and merging and ONNX Runtime optimizations in AzureML
CPU : with Optimum conversion and merging and ONNX Runtime optimizations and Intel® Neural Compressor 4-bits weight-only quantization for optimized INT4 ONNX model
|
|
phi2 |
Link |
CPU : with ONNX Runtime optimizations fp32/int4
GPU with ONNX Runtime optimizations fp16/int4, with PyTorch QLoRA for model fine tune
GPU with SliceGPT for an optimized PyTorch model with sparsity
|
|
falcon |
Link |
GPU : with ONNX Runtime optimizations for optimized FP16 ONNX model
|
|
red pajama |
Link |
CPU : with Optimum conversion and merging and ONNX Runtime optimizations for a single optimized ONNX model
|
|
bert |
Link |
CPU : with ONNX Runtime optimizations and quantization for optimized INT8 ONNX model
CPU : with ONNX Runtime optimizations and Intel® Neural Compressor quantization for optimized INT8 ONNX model
CPU : with PyTorch QAT Customized Training Loop and ONNX Runtime optimizations for optimized ONNX INT8 model
GPU : with ONNX Runtime optimizations for CUDA EP
GPU : with ONNX Runtime optimizations for TRT EP
|
|
deberta |
Link |
GPU : Optimize Azureml Registry Model with ONNX Runtime optimizations and quantization
|
|
gptj |
Link |
CPU : with Intel® Neural Compressor static/dynamic quantization for INT8 ONNX model
|
Audio |
whisper |
Link |
CPU : with ONNX Runtime optimizations for all-in-one ONNX model in FP32
CPU : with ONNX Runtime optimizations for all-in-one ONNX model in INT8
CPU : with ONNX Runtime optimizations and Intel® Neural Compressor Dynamic Quantization for all-in-one ONNX model in INT8
GPU : with ONNX Runtime optimizations for all-in-one ONNX model in FP32
GPU : with ONNX Runtime optimizations for all-in-one ONNX model in FP16
GPU : with ONNX Runtime optimizations for all-in-one ONNX model in INT8
|
|
audio spectrogram transformer |
Link |
CPU : with ONNX Runtime optimizations and quantization for optimized INT8 ONNX model
|
Vision |
stable diffusion stable diffusion XL |
Link |
GPU : with ONNX Runtime optimization for DirectML EP
GPU : with ONNX Runtime optimization for CUDA EP
Intel CPU : with OpenVINO toolkit
|
|
squeezenet |
Link |
GPU : with ONNX Runtime optimizations with DirectML EP
|
|
mobilenet |
Link |
Qualcomm NPU : with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP
|
|
resnet |
Link |
CPU : with ONNX Runtime static/dynamic Quantization for ONNX INT8 model
CPU : with PyTorch QAT Default Training Loop and ONNX Runtime optimizations for ONNX INT8 model
CPU : with PyTorch QAT Lightning Module and ONNX Runtime optimizations for ONNX INT8 model
AMD DPU : with AMD Vitis-AI Quantization
Intel GPU : with ONNX Runtime optimizations with multiple EPs
|
|
VGG |
Link |
Qualcomm NPU : with SNPE toolkit
|
|
inception |
Link |
Qualcomm NPU : with SNPE toolkit
|
|
super resolution |
Link |
CPU : with ONNX Runtime pre/post processing integration for a single ONNX model
|