ONNX related – General¶
Olive provides multiple Passes that execute optimization tools related to ONNX. ONNX is an open format built to represent machine learning models. ONNX Runtime is a cross-platform machine-learning model accelerator, with a flexible interface to integrate hardware-specific libraries.
Olive provides easy access to the model optimization tools available in ONNX Runtime.
Model Conversion¶
The user might not have a model ready in the ONNX format. OnnxConversion
converts PyTorch models to ONNX using
torch.onnx.
Please refer to OnnxConversion for more details about the pass and its config parameters.
Example Configuration¶
{
"type": "OnnxConversion",
"config": {
"target_opset": 13
}
}
Model Optimizer¶
OnnxModelOptimizer
optimizes an ONNX model by fusing nodes. Fusing nodes involves merging multiple nodes in a model into a single node to
reduce the computational cost and improve the performance of the model.
The optimization process involves analyzing the structure of the ONNX model and identifying nodes that can be fused.
Please refer to OnnxModelOptimizer for more details about the pass and its config parameters.
Example Configuration¶
{
"type": "OnnxModelOptimizer"
}
ORT Transformers Optimization¶
While ONNX Runtime automatically applies most optimizations while loading transformer models, some of the latest optimizations that have not
yet been integrated into ONNX Runtime.
OrtTransformersOptimization
provides an offline capability to optimize transformers models
in scenarios where ONNX Runtime does not apply the optimization at load time.
These optimizations are provided by onnxruntime through
onnxruntime.transformers. Please
refer to the corresponding documentation
for more details on the optimizations done by this tool.
Please refer to OrtTransformersOptimization for more details about the pass and its config parameters.
Example Configuration¶
{
"type": "OrtTransformersOptimization",
"config": {"model_type": "bert"}
}
Append Pre/Post Processing Ops¶
‘AppendPrePostProcessingOps’ inserts pre and post processing ops into the ONNX graph.
Example Configuration¶
{
"type": "AppendPrePostProcessingOps",
"config": {
"tool_command": "superresolution",
"tool_command_args": {
"output_format": "png"
}
}
}
{
"type": "AppendPrePostProcessingOps",
"config": {
"tool_command": "whisper",
"tool_command_args": {
"use_audio_decoder": true
}
}
}
Insert Beam Serch Op¶
InsertBeamSearch
chains two model components (for example, encoder and decoder) together by inserting beam search op in between them.
Example Configuration¶
{
"type": "InsertBeamSearch",
"config": {"no_repeat_ngram_size": 4}
}
Post Training Quantization (PTQ)¶
Quantization is a technique to compress deep learning models by reducing the precision of the model weights from 32 bits to 8 bits. This technique is used to reduce the memory footprint and improve the inference performance of the model. Quantization can be applied to the weights of the model, the activations of the model, or both.
There are two ways to quantize a model in onnxruntime:
Dynamic Quantization: Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically, which means there is no any requirement for the calibration dataset.
These calculations increase the cost of inference, while usually achieve higher accuracy comparing to static ones.
Static Quantization: Static quantization method runs the model using a set of inputs called calibration data. In this way, user must provide a calibration dataset to calculate the quantization parameters (scale and zero point) for activations before quantizing the model.
Quantize with onnxruntime¶
Olive consolidates the dynamic and static quantization into a single pass called OnnxQuantization
, and provide the user with the ability to
tune both quantization methods and hyperparameter at the same time.
If the user desires to only tune either of dynamic or static quantization, Olive also supports them through OnnxDynamicQuantization
and
OnnxStaticQuantization
respectively.
Please refer to OnnxQuantization, OnnxDynamicQuantization and OnnxStaticQuantization for more details about the passes and their config parameters.
Quantize with Intel® Neural Compressor¶
In addition to the default onnxruntime quantization tool, Olive also integrates Intel® Neural Compressor.
Intel® Neural Compressor is a model compression tool across popular deep learning frameworks including TensorFlow, PyTorch, ONNX Runtime (ORT) and MXNet, which supports a variety of powerful model compression techniques, e.g., quantization, pruning, distillation, etc. As a user-experience-driven and hardware friendly tool, Intel® Neural Compressor focuses on providing users with an easy-to-use interface and strives to reach “quantize once, run everywhere” goal.
Olive consolidates the Intel® Neural Compressor dynamic and static quantization into a single pass called IncQuantization
, and provide the user with the ability to
tune both quantization methods and hyperparameter at the same time.
If the user desires to only tune either of dynamic or static quantization, Olive also supports them through IncDynamicQuantization
and
IncStaticQuantization
respectively.
Please refer to IncQuantization, IncDynamicQuantization and IncStaticQuantization for more details about the passes and their config parameters.
Quantize with AMD Vitis AI Quantizer¶
Olive also integrates AMD Vitis AI Quantizer for quantization.
The Vitis™ AI development environment accelerates AI inference on AMD® hardware platforms. The Vitis AI quantizer can reduce the computing complexity by converting the 32-bit floating-point weights and activations to fixed-point like INT8. The fixed-point network model requires less memory bandwidth, thus providing faster speed and higher power efficiency than the floating-point model. Olive consolidates the Vitis™ AI quantization into a single pass called VitisAIQuantization which supports power-of-2 scale quantization methods and supports Vitis AI Execution Provider.
Please refer to VitisAIQuantization for more details about the pass and its config parameters.
Example Configuration¶
a. Tune the parameters of the OlivePass with pre-defined searchable values
{
"type": "OnnxQuantization",
"config": {
"user_script": "./user_script.py",
"dataloader_func": "glue_calibration_reader"
}
}
b. Select parameters to tune
{
"type": "OnnxQuantization",
"config": {
// select per_channel to tune with "SEARCHABLE_VALUES".
// other parameters will use the default value, not to be tuned.
"per_channel": "SEARCHABLE_VALUES",
"user_script": "./user_script.py",
"dataloader_func": "glue_calibration_reader",
},
"disable_search": true
}
c. Use default values of the OlivePass (no tuning in this way)
{
"type": "OnnxQuantization",
"config": {
// set per_channel to "DEFAULT_VALUE"
"per_channel": "DEFAULT_VALUE",
"user_script": "./user_script.py",
"dataloader_func": "glue_calibration_reader",
},
"disable_search": true
}
d. Specify parameters with user defined values
"onnx_quantization": {
"type": "OnnxQuantization",
"config": {
// set per_channel to True.
"per_channel": true,
"user_script": "./user_script.py",
"dataloader_func": "glue_calibration_reader",
},
"disable_search": true
}
Check out this file
for an example implementation of "user_script.py"
and "glue_calibration_reader"
.
check out this file for an example for Intel® Neural Compressor quantization.
ORT Performance Tuning¶
ONNX Runtime provides high performance across a range of hardware options through its Execution Providers interface for different execution
environments.
For each model running with each execution provider, there are settings that can be tuned (e.g. thread number, execution mode, etc) to
improve performance.
OrtPerfTuning
covers basic knobs that can be leveraged to find the best performance for your model and hardware.
Example Configuration¶
{
"type": "OrtPerfTuning",
"config": {
"user_script": "user_script.py",
"dataloader_func": "create_dataloader",
"batch_size": 1
}
}
Check out this file
for an example implementation of "user_script.py"
and "create_dataloader"
.
Float16 Conversion¶
Converting a model to use Float16 instead of Float32 can decrease the model size and improve performance on some GPUs. The OnnxFloatToFloat16
pass wraps onnxconverter_common.float16.convert_float_to_float16, which convert most nodes/operators to use Float16 instead of Float32.
Conversion to Float16 is often exposed at multiple stages of optimization, including model conversion and transformer optimization. This stand-alone pass is best suited for models that are not transformer architectures, where fusions may rely on a specific data types in node patterns.
Example Configuration¶
a. The most basic configuration, which is suitable for many models, leaves all configuration options set to their default values:
{
"type": "OnnxFloatToFloat16"
}
b. More fine-grained control of the conversion conditions is also possible:
{
"type": "OnnxFloatToFloat16",
"config": {
// Don't convert input/output nodes to Float16
"keep_io_types": true
}
}
See Float16 Conversion for more detailed description of the available configuration parameters.
Mixed Precision Conversion¶
Converting model to mixed precision.
If float16 conversion is giving poor results, you can convert most of the ops to float16 but leave some in float32. The OrtMixedPrecision
pass finds a minimal set of ops to skip while retaining a certain level of accuracy.
The default value for op_block_list
is ["SimplifiedLayerNormalization", "SkipSimplifiedLayerNormalization", "Relu", "Add"]
.
Example Configuration¶
a. The most basic configuration, which is suitable for many models, leaves all configuration options set to their default values:
{
"type": "OrtMixedPrecision"
}
b. More fine-grained control of the conversion conditions is also possible:
{
"type": "OrtMixedPrecision",
"config": {
"op_block_list": [
"Add",
"LayerNormalization",
"SkipLayerNormalization",
"FastGelu",
"EmbedLayerNormalization",
]
}
}