ONNX
ONNX is an open graph format to represent machine learning models. ONNX Runtime is a cross-platform machine-learning model accelerator, with a flexible interface to integrate hardware-specific libraries.
Olive provides multiple transformations and optimizations based on various ONNX to improve model performance.
Model Optimizer
OnnxPeepholeOptimizer
optimizes an ONNX model by fusing nodes. Fusing nodes involves merging multiple nodes in a model into a single node to
reduce the computational cost and improve the performance of the model. The optimization process involves analyzing the structure of the ONNX model and identifying nodes that can be fused.
Also, inserts a Cast
operation for cases where ArgMax
input. For example, before ONNXRuntime 1.20, TensorProto.INT64 isn’t supported on CPU or CUDA EP so a Cast
operator inserted to cast the inputs to TensorProto.INT32.
Please refer to OnnxPeepholeOptimizer for more details about the pass and its config parameters.
Example Configuration
{
"type": "OnnxPeepholeOptimizer"
}
ORT Transformers Optimization
While ONNX Runtime automatically applies most optimizations while loading transformer models, some of the latest optimizations that have not
yet been integrated into ONNX Runtime.
OrtTransformersOptimization
provides an offline capability to optimize transformers models
in scenarios where ONNX Runtime does not apply the optimization at load time.
These optimizations are provided by onnxruntime through
onnxruntime.transformers. Please
refer to the corresponding documentation
for more details on the optimizations done by this tool.
Please refer to OrtTransformersOptimization for more details about the pass and its config parameters.
Example Configuration
{
"type": "OrtTransformersOptimization",
"model_type": "bert"
}
Append Pre/Post Processing Ops
AppendPrePostProcessingOps
inserts pre and post processing ops into the ONNX graph.
Example Configuration
{
"type": "AppendPrePostProcessingOps",
"tool_command": "superresolution",
"tool_command_args": {
"output_format": "png"
}
}
{
"type": "AppendPrePostProcessingOps",
"tool_command": "whisper",
"tool_command_args": {
"use_audio_decoder": true
}
}
AppendPrePostProcessingOps
also supports pre/post processing ops by leveraging the onnxruntime-extension steps and PrePostProcessor
.
You can refer to here to see how to leverage PrePostProcessor
to customize pre and post processing ops.
Olive introduces two placeholders to represent the model input/output shape dimension value:
__model_input__
and__model_output__
.To support the IoMapEntry, the step need choose use the full form. For example:
"YCbCrToPixels": {
"params": {
"layout": "BGR",
},
"io_map": [
["Y1_uint8", 0, 0],
["Cb1_uint8", 0, 1],
["Cr1_uint8", 0, 2],
],
}
The
tool_command_args
will be used to describe the input parameters to create thePrePostProcessor
instance. It is list ofPrePostProcessorInput
. Thename
is the tensor name. Thedata_type
andshape
will be used to create the tensor type. Theshape
can be a list of integers or a list of string.
Users that write their own pre/post processing steps need to have the knowledge about whether the step includes the operators that is built-in support or supported in onnxruntime-extensions.
For example, for some ops like ConvertImageToBGR
which requires other extensions may be incompatible with ort-web, user need to exclude this kind of ops to generate proper models.
Here are some examples to describe the pre/post processing which is exactly same with superresolution
{
"pre": [
{"ConvertImageToBGR": {}},
{
"Resize": {
"resize_to": [
{"type": "__model_input__", "input_index": 0, "dim_index": -2},
{"type": "__model_input__", "input_index": 0, "dim_index": -1},
]
}
},
{
"CenterCrop": {
"height": {"type": "__model_input__", "input_index": 0, "dim_index": -2},
"width": {"type": "__model_input__", "input_index": 0, "dim_index": -1},
}
},
{"PixelsToYCbCr": {"layout": "BGR"}},
{"ImageBytesToFloat": {}},
{"Unsqueeze": {"axes": [0, 1]}},
],
"post": [
{"Squeeze": {"axes": [0, 1]}},
{"FloatToImageBytes": {"name": "Y1_uint8"}},
{
"Resize": {
"params": {
"resize_to": [
{"type": "__model_output__", "output_index": 0, "dim_index": -2},
{"type": "__model_output__", "output_index": 0, "dim_index": -1},
],
"layout": "HW",
},
"io_map": [["PixelsToYCbCr", 1, 0]],
}
},
{"FloatToImageBytes": {"multiplier": 1.0, "name": "Cb1_uint8"}},
{
"Resize": {
"params": {
"resize_to": [
{"type": "__model_output__", "output_index": 0, "dim_index": -2},
{"type": "__model_output__", "output_index": 0, "dim_index": -1},
],
"layout": "HW",
},
"io_map": [["PixelsToYCbCr", 2, 0]],
}
},
{"FloatToImageBytes": {"multiplier": 1.0, "name": "Cr1_uint8"}},
{
"YCbCrToPixels": {
"params": {
"layout": "BGR",
},
"io_map": [
["Y1_uint8", 0, 0],
["Cb1_uint8", 0, 1],
["Cr1_uint8", 0, 2],
],
}
},
{"ConvertBGRToImage": {"image_format": "png"}},
],
"tool_command_args": [
{
"name": "image",
"data_type": "uint8",
"shape": ["num_bytes"],
}
],
"target_opset": 16,
}
Insert Beam Search Op
InsertBeamSearch
chains two model components (for example, encoder and decoder) together by inserting beam search op in between them.
Example Configuration
{
"type": "InsertBeamSearch",
"no_repeat_ngram_size": 4
}
ORT Performance Tuning
ONNX Runtime provides high performance across a range of hardware options through its Execution Providers interface for different execution
environments.
For each model running with each execution provider, there are settings that can be tuned (e.g. thread number, execution mode, etc) to
improve performance.
OrtSessionParamsTuning
covers basic knobs that can be leveraged to find the best performance for your model and hardware.
Example Configuration
{
"type": "OrtSessionParamsTuning",
"data_config": "session_params_tuning_data_config",
"batch_size": 1,
"providers_list" : [
[
"CUDAExecutionProvider",
{
"device_id": 0,
"arena_extend_strategy": "kNextPowerOfTwo",
"gpu_mem_limit": 2147483648, // 2 * 1024 * 1024 * 1024,
"cudnn_conv_algo_search": "EXHAUSTIVE",
"do_copy_in_default_stream": true,
},
],
"CPUExecutionProvider",
],
"enable_profiling": false
}
Check out this file
for an example implementation of "user_script.py"
and "calib_data_config/dataloader_config/type"
.
Extract Adapters
LoRA, QLoRA and related techniques allow us to fine-tune a pre-trained model by adding a small number of trainable matrices called adapters. The same base model can be used for multiple tasks by adding different adapters for each task. To support using multiple adapters with the same optimized onnx model, the ExtractAdapters
pass extracts the adapters weights from the model and saves them to a separate file. The model graph is then modified in one of the following ways:
Adapters weights are set as external tensors pointing to a non-existent file. The onnx model is thus invalid by itself as it cannot be loaded. In order to create an inference session using this model, the adapter weights must be added to a sessions options object using
add_initializer
oradd_external_initializers
.Adapter weights are converted into model inputs. The onnx model is valid. During inference, the adapter weights must be provided as part of the inputs. We call them constant inputs here since these weights don’t change between runs when using the one set of adapters.
Example Configuration
a. As external initializers
{
"type": "ExtractAdapters",
"make_inputs": false
}
b. As constant inputs with packed weights
{
"type": "ExtractAdapters",
"make_inputs": true,
"pack_inputs": true
}
Please refer to ExtractAdapters for more details about the pass and its config parameters.
Olive also provides a command line tool to convert adapters saved after peft fine-tuning to a format compatible with a model that has been optimized with the ExtractAdapters
pass. More details on the olive convert-adapters
command can be found at Command Line Tools.