ONNX#

ONNX is an open graph format to represent machine learning models. ONNX Runtime is a cross-platform machine-learning model accelerator, with a flexible interface to integrate hardware-specific libraries.

Olive provides multiple transformations and optimizations based on various ONNX to improve model performance.

Peeophole Optimizer#

OnnxPeepholeOptimizer optimizes an ONNX model. The optimization process involves analyzing the structure of the ONNX model and identifying opportunities.

The OnnxPeepholeOptimizer leverages onnxscript (https://onnxscript.ai/tutorial/optimizer/optimize.html) and onnxoptimizer(https://github.com/onnx/optimizer) underneath.

Optimization	Description
Constant Folding	Applies constant folding optimization to the model.
Constant Propagation	Applies constant propagation optimization to the model. Applied as part of constant folding.
Sequence Simplification	Simplifies Sequence-based ops (e.g., SequenceConstruct, ConcatFromSequence). Part of constant folding.
Remove Unused Nodes	Removes unused nodes from the model.
Remove Unused Functions	Removes unused function protos from the model.
Inline Functions with Unused Outputs	Inlines function nodes with unused outputs.
Inline Simple Functions	Inlines simple functions based on a node count threshold.
Eliminate Nop Cast	Eliminates no-operation (nop) Casts.
Eliminate Nop Dropout	Eliminates no-operation Dropouts.
Eliminate Nop Flatten	Eliminates no-operation Flattens.
Extract Constant to Initializer	Extracts constants to initializers.
Eliminate If with Const Cond	Eliminates If nodes with constant conditions.
Eliminate Nop Monotone ArgMax	Eliminates nop monotone ArgMax.
Eliminate Nop Pad	Eliminates no-operation Pads.
Eliminate Nop Concat	Eliminates no-operation Concats.
Eliminate Nop Split	Eliminates no-operation Splits.
Eliminate Nop Expand	Eliminates no-operation Expands.
Eliminate Shape Gather	Eliminates Shape Gather operations.
Eliminate Slice after Shape	Eliminates Slice nodes that occur after Shape nodes.
Eliminate Nop Transpose	Eliminates no-operation Transposes.
Fuse Add Bias into Conv	Fuses Add operations as biases into Conv layers.
Fuse BN into Conv	Fuses BatchNormalization into Conv layers.
Fuse Consecutive Concats	Fuses consecutive Concat operations.
Fuse Consecutive LogSoftmax	Fuses consecutive LogSoftmax operations.
Fuse Consecutive Reduce+Unsqueeze	Fuses consecutive Reduce and Unsqueeze operations.
Fuse Consecutive Squeezes	Fuses consecutive Squeeze operations.
Fuse Consecutive Transposes	Fuses consecutive Transpose operations.
Fuse MatMul+Add Bias into GEMM	Fuses MatMul and Add operations into GEMM layers.
Fuse Pad into Conv	Fuses Pad operations into Conv layers.
Fuse Pad into Pool	Fuses Pad operations into Pool layers.
Fuse Transpose into GEMM	Fuses Transpose operations into GEMM layers.
Fuse Concat into Reshape	Fuses Concat operations into Reshape layers.
Eliminate Nop Reshape	Eliminates no-operation Reshapes.
Eliminate Nop with Unit	Eliminates no-operation nodes with unit values.
Eliminate Common Subexpression	Eliminates common sub-expressions.
Fuse QKV	Fuses query, key, and value layers in transformer models.
Fuse Consecutive Unsqueezes	Fuses consecutive Unsqueeze operations.
Eliminate Deadend Nodes	Eliminates dead-end nodes.
Eliminate Identity Nodes	Eliminates Identity nodes.
Eliminate Shape Ops	Eliminates Shape operations where possible.
Fuse Consecutive Slices	Fuses consecutive Slice operations.
Eliminate Unused Initializer	Eliminates unused initializers.
Eliminate Duplicate Initializer	Eliminates duplicate initializers.
Broadcast to MatMul	Converts broadcast patterns into MatMul operations for better efficiency.
Cast Constant of Shape	Simplifies constant casting for shape operations.
GEMM to MatMul+Add	Converts GEMM operations into MatMul and Add for improved compatibility.
No-Op Removal	Removes redundant or no-op operations in the computation graph.

Please refer to OnnxPeepholeOptimizer for more details about the pass and its config parameters.

Example Configuration#

{
    "type": "OnnxPeepholeOptimizer"
}

ORT Transformers Optimization#

While ONNX Runtime automatically applies most optimizations while loading transformer models, some of the latest optimizations that have not yet been integrated into ONNX Runtime. OrtTransformersOptimization provides an offline capability to optimize transformers models in scenarios where ONNX Runtime does not apply the optimization at load time. These optimizations are provided by onnxruntime through onnxruntime.transformers. Please refer to the corresponding documentation for more details on the optimizations done by this tool.

Please refer to OrtTransformersOptimization for more details about the pass and its config parameters.

Example Configuration#

{
    "type": "OrtTransformersOptimization",
    "model_type": "bert"
}

ONNX Surgeries#

Olive provides ability to apply many graph surgeries on the ONNX model. In the examples, the surgeries Pass represents a list of surgeries to be applied sequentially on the ONNX model. You can combine multiple surgeries in the list to perform consecutive modifications in a single operation.

Example#

"surgeries": {
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "RenameInputs",
            "old_names": ["input1", "input2"],
            "new_names": ["renamed_input1", "renamed_input2"]
        },
        {
            "surgeon": "RenameOutputs",
            "old_names": ["output1", "output2"],
            "new_names": ["renamed_output1", "renamed_output2"]
        },
        {
            "surgeon": "InferShapes"
        }
    ]
}

`RenameInputs`#

Description#

Renames specific inputs in the ONNX model.

Configurations#

old_names: List of original input names to rename.
new_names: List of new input names.

Example#

Initial ONNX model graph:

graph {
  input: "input1"
  input: "input2"
  node {
    op_type: "Add"
    input: ["input1", "input2"]
    output: ["add_output"]
  }
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "RenameInputs",
            "old_names": ["input1", "input2"],
            "new_names": ["renamed_input1", "renamed_input2"]
        }
    ]
}

Transformed ONNX model graph:

graph {
  input: "renamed_input1"
  input: "renamed_input2"
  node {
    op_type: "Add"
    input: ["renamed_input1", "renamed_input2"]
    output: ["add_output"]
  }
}

`RenameOutputs`#

Description#

Renames specific outputs in the ONNX model.

Configurations#

old_names: List of original output names to rename.
new_names: List of new output names.

Example#

Initial ONNX model graph:

graph {
  node {
    op_type: "Add"
    input: ["input1", "input2"]
    output: ["output1"]
  }
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "RenameOutputs",
            "old_names": ["output1"],
            "new_names": ["renamed_output1"]
        }
    ]
}

Transformed ONNX model graph:

graph {
  node {
    op_type: "Add"
    input: ["input1", "input2"]
    output: ["renamed_output1"]
  }
}

`InferShapes`#

Description#

Performs shape inference on the ONNX model to populate missing shape information.

Example#

Initial ONNX model graph:

graph {
  input: "input1" (FLOAT) shape: [1]
  input: "input2" (FLOAT) shape: [1]
  node {
    op_type: "Add"
    input: ["input1", "input2"]
    output: ["add_output"]
  }
  output: "add_output" (FLOAT)
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "InferShapes"
        }
    ]
}

Transformed ONNX model graph (with inferred shapes):

graph {
  input: "input1" (FLOAT) shape: [1]
  input: "input2" (FLOAT) shape: [1]
  node {
    op_type: "Add"
    input: ["input1", "input2"]
    output: ["add_output"]
  }
  output: "add_output" (FLOAT) shape: [1]
}

`RemoveShapes`#

Description#

Removes all shape information (value_info) from the ONNX model.

Example#

Initial ONNX model graph:

graph test-model {
  # Inputs
  input: "input1" (FLOAT) shape: [1]
  input: "input2" (FLOAT) shape: [1]

  # Value Info
  value_info: "intermediate" (FLOAT) shape: [1]

  # Nodes
  node {
    op_type: "Add"
    name: "Add"
    input: ["input1", "input2"]
    output: ["intermediate"]
  }

  node {
    op_type: "Relu"
    name: "Relu"
    input: ["intermediate"]
    output: ["output1"]
  }

  # Outputs
  output: "output1" (FLOAT) shape: [1]
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "RemoveShapes"
        }
    ]
}

Transformed ONNX model graph:

graph test-model {
  # Inputs
  input: "input1" (FLOAT) shape: [1]
  input: "input2" (FLOAT) shape: [1]

  # Nodes
  node {
    op_type: "Add"
    name: "Add"
    input: ["input1", "input2"]
    output: ["intermediate"]
  }

  node {
    op_type: "Relu"
    name: "Relu"
    input: ["intermediate"]
    output: ["output1"]
  }

  # Outputs
  output: "output1" (FLOAT) shape: [1]
}

`RemoveInitializerFromInputs`#

Description#

Removes initializers from the input list (graph.input) of an ONNX model.

Example#

Initial ONNX model graph:

graph {
  input: "input1"
  input: "input2"
  initializer {
    name: "input2"
    data_type: FLOAT
    dims: [1, 3]
    raw_data: "\000\000\200?\000\000\000@\000\000@@"
  }
  node {
    op_type: "Add"
    input: ["input1", "input2"]
    output: ["add_output"]
  }
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "RemoveInitializerFromInputs"
        }
    ]
}

Transformed ONNX model graph:

graph {
  input: "input1"
  initializer {
    name: "input2"
    data_type: FLOAT
    dims: [1, 3]
    raw_data: "\000\000\200?\000\000\000@\000\000@@"
  }
  node {
    op_type: "Add"
    input: ["input1", "input2"]
    output: ["add_output"]
  }
}

`ReorderInputs`#

Description#

Reorders the inputs of the ONNX model according to a specified permutation.

Configurations#

permutation: A list specifying the new order of inputs, as a permutation of the original indices.

Example#

Initial ONNX model graph:

graph {
  input: "input1"
  input: "input2"
  node {
    op_type: "Add"
    input: ["input1", "input2"]
    output: ["add_output"]
  }
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "ReorderInputs",
            "permutation": [1, 0]
        }
    ]
}

Transformed ONNX model graph:

graph {
  input: "input2"
  input: "input1"
  node {
    op_type: "Add"
    input: ["input1", "input2"]
    output: ["add_output"]
  }
}

`ReplaceErfWithTanh`#

Description#

Replaces Erf nodes in the ONNX model with an equivalent computation using Tanh. The replacement involves scaling the input and applying the Tanh function to produce a result that approximates the Erf behavior.

Example#

Initial ONNX model graph:

graph {
  input: "input"
  output: "erf_output"
  node {
    op_type: "Erf"
    input: ["input"]
    output: ["erf_output"]
  }
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "ReplaceErfWithTanh"
        }
    ]
}

Transformed ONNX model graph:

graph {
  input: "input"
  initializer: "scale_0" (FLOAT, value: 1.203)
  node {
    op_type: "Mul"
    input: ["input", "scale_0"]
    output: ["mul_0"]
    name: "Sub_Mul_0"
  }
  node {
    op_type: "Tanh"
    input: ["mul_0"]
    output: ["erf_output"]
    name: "Sub_Tanh_0"
  }
  output: "erf_output"
}

`ZeroOutInput`#

Description#

Replaces a specific input of a node with a constant tensor of zeros.

Configurations#

node_name: Name of the node whose input is to be replaced.
input_idx: Index of the input to be replaced.

Example#

Initial ONNX model graph:

graph {
  input: "input1"
  input: "input2"
  node {
    op_type: "Add"
    input: ["input1", "input2"]
    output: ["add_output"]
  }
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "ZeroOutInput",
            "node_name": "AddNode",
            "input_idx": 1
        }
    ]
}

Transformed ONNX model graph:

graph {
  input: "input1"
  node {
    op_type: "Constant"
    output: ["Add_zero_output_0"]
    value: [0.0]
  }
  node {
    op_type: "Add"
    input: ["input1", "Add_zero_output_0"]
    output: ["add_output"]
  }
}

`RemoveInputs`#

Description#

Removes specific inputs from the ONNX model.

Configurations#

names: List of input names to remove.

Example#

Initial ONNX model graph:

graph {
  input: "input1"
  input: "input2"
  node {
    op_type: "Add"
    input: ["input1", "input2"]
    output: ["add_output"]
  }
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "RemoveInputs",
            "names": ["input2"]
        }
    ]
}

Transformed ONNX model graph:

graph {
  input: "input1"
  node {
    op_type: "Add"
    input: ["input1"]
    output: ["add_output"]
  }
}

`ExposeOutputs`#

Description#

Exposes the outputs of specific nodes in the ONNX model as graph outputs.

Configurations#

names: List of node names whose outputs should be exposed as graph outputs.

Example#

Initial ONNX model graph:

graph {
  node {
    op_type: "Relu"
    input: ["input1"]
    output: ["relu_output"]
  }
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "ExposeOutputs",
            "names": ["ReluNode"]
        }
    ]
}

Transformed ONNX model graph:

graph {
  node {
    op_type: "Relu"
    input: ["input1"]
    output: ["relu_output"]
  }
  output: "relu_output"
}

`ExposeQuantizedOutput`#

Description#

Replaces a specified output with its quantized version and exposes its scale and zero_point as graph outputs.

Configurations#

output_name: Name of the output to replace with its quantized version.

Example#

Initial ONNX model graph:

graph {
  node {
    op_type: "DequantizeLinear"
    input: ["quantized_tensor", "scale", "zero_point"]
    output: ["output1"]
  }
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "ExposeQuantizedOutput",
            "output_name": "output1"
        }
    ]
}

Transformed ONNX model graph:

graph {
  output: "quantized_tensor"
  output: "scale"
  output: "zero_point"
}

`RMSNormToL2Norm`#

Description#

Replace RMSNorm subgraph with L2Norm subgraph.

Example#

Initial model graph:

RMSNorm pattern:
    +-----------------------------------------------+
    |                                               |
    |                                               v
[Root] --> Pow --> ReduceMean --> Add --> Sqrt --> Div --> Mul
          (y=2)     (axis=-1)   (B=E-6)

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "RMSNormToL2Norm"
        }
    ]
}

Transformed model graph:

[Root] --> LpNormalization --> Mul
           (p=2, axis=-1)

`SimplifiedLayerNormToL2Norm`#

Description#

Replace Skip/SimplifiedLayerNormalization nodes with L2Norm subgraph.

Example#

Initial model graph:

SimplifiedLayerNorm pattern:
[Root] --> SimplifiedLayerNormalization
             (axis=-1, epsilon=1e-6)

SkipSimpleLayerNorm pattern:
[Root1] ------------+
                    v
[Root2] --> SkipSimpleLayerNormalization
               (epsilon=1e-6)

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "SimplifiedLayerNormToL2Norm"
        }
    ]
}

Transformed model graph:

[Root] --> LpNormalization -> Mul
           (p=2, axis=-1)

[Root1] -------> Add
                  |
                  v
[Root2] --> LpNormalization --> Mul
            (p=2, axis=-1)

`ReplaceAttentionMaskValue`#

Description#

Replace the value of extended attention mask with a new value. This surgery is useful if the default mask value does not quantize well due to numerical instability.

Example#

Initial model graph:

graph {
  node {
    input: "input1"
    output: "output1"
    name: "ConstantOfShape"
    op_type: "ConstantOfShape"
    attribute {
      name: "value"
      t {
        dims: 1
        data_type: 1
        float_data: -3.4028234663852886e+38
        name: ""
      }
      type: TENSOR
    }
  }
  node {
    output: "Constant_output"
    name: "Constant"
    op_type: "Constant"
    attribute {
      name: "value"
      t {
        data_type: 1
        float_data: -3.4028234663852886e+38
        name: ""
      }
      type: TENSOR
    }
  }
  initializer {
    data_type: 1
    float_data: -3.4028234663852886e+38
    name: "init"
  }
}

After applying:

{
    "type": "GraphSurgeries",
    "surgeries": [
        {
            "surgeon": "ReplaceAttentionMaskValue"
        }
    ]
}

Transformed model graph:

graph {
  node {
    input: "input1"
    output: "output1"
    name: "ConstantOfShape"
    op_type: "ConstantOfShape"
    attribute {
      name: "value"
      t {
        dims: 1
        data_type: 1
        float_data: -10000.0
        name: ""
      }
      type: TENSOR
    }
  }
  node {
    output: "Constant_output"
    name: "Constant"
    op_type: "Constant"
    attribute {
      name: "value"
      t {
        data_type: 1
        float_data: -10000.0
        name: ""
      }
      type: TENSOR
    }
  }
  initializer {
    data_type: 1
    float_data: -10000.0
    name: "init"
  }
}

Append Pre/Post Processing Ops#

AppendPrePostProcessingOps inserts pre and post processing ops into the ONNX graph.

Example Configuration#

{
    "type": "AppendPrePostProcessingOps",
    "tool_command": "superresolution",
    "tool_command_args": {
        "output_format": "png"
    }
}

{
    "type": "AppendPrePostProcessingOps",
    "tool_command": "whisper",
    "tool_command_args": {
        "use_audio_decoder": true
    }
}

AppendPrePostProcessingOps also supports pre/post processing ops by leveraging the onnxruntime-extension steps and PrePostProcessor. You can refer to here to see how to leverage PrePostProcessor to customize pre and post processing ops.

Olive introduces two placeholders to represent the model input/output shape dimension value: __model_input__ and __model_output__.
To support the IoMapEntry, the step need choose use the full form. For example:

    "YCbCrToPixels": {
        "params": {
            "layout": "BGR",
        },
        "io_map": [
            ["Y1_uint8", 0, 0],
            ["Cb1_uint8", 0, 1],
            ["Cr1_uint8", 0, 2],
        ],
    }

The tool_command_args will be used to describe the input parameters to create the PrePostProcessor instance. It is list of PrePostProcessorInput. The name is the tensor name. The data_type and shape will be used to create the tensor type. The shape can be a list of integers or a list of string.

Users that write their own pre/post processing steps need to have the knowledge about whether the step includes the operators that is built-in support or supported in onnxruntime-extensions. For example, for some ops like ConvertImageToBGR which requires other extensions may be incompatible with ort-web, user need to exclude this kind of ops to generate proper models.

Here are some examples to describe the pre/post processing which is exactly same with superresolution

{
    "pre": [
        {"ConvertImageToBGR": {}},
        {
            "Resize": {
                "resize_to": [
                    {"type": "__model_input__", "input_index": 0, "dim_index": -2},
                    {"type": "__model_input__", "input_index": 0, "dim_index": -1},
                ]
            }
        },
        {
            "CenterCrop": {
                "height": {"type": "__model_input__", "input_index": 0, "dim_index": -2},
                "width": {"type": "__model_input__", "input_index": 0, "dim_index": -1},
            }
        },
        {"PixelsToYCbCr": {"layout": "BGR"}},
        {"ImageBytesToFloat": {}},
        {"Unsqueeze": {"axes": [0, 1]}},
    ],
    "post": [
        {"Squeeze": {"axes": [0, 1]}},
        {"FloatToImageBytes": {"name": "Y1_uint8"}},
        {
            "Resize": {
                "params": {
                    "resize_to": [
                        {"type": "__model_output__", "output_index": 0, "dim_index": -2},
                        {"type": "__model_output__", "output_index": 0, "dim_index": -1},
                    ],
                    "layout": "HW",
                },
                "io_map": [["PixelsToYCbCr", 1, 0]],
            }
        },
        {"FloatToImageBytes": {"multiplier": 1.0, "name": "Cb1_uint8"}},
        {
            "Resize": {
                "params": {
                    "resize_to": [
                        {"type": "__model_output__", "output_index": 0, "dim_index": -2},
                        {"type": "__model_output__", "output_index": 0, "dim_index": -1},
                    ],
                    "layout": "HW",
                },
                "io_map": [["PixelsToYCbCr", 2, 0]],
            }
        },
        {"FloatToImageBytes": {"multiplier": 1.0, "name": "Cr1_uint8"}},
        {
            "YCbCrToPixels": {
                "params": {
                    "layout": "BGR",
                },
                "io_map": [
                    ["Y1_uint8", 0, 0],
                    ["Cb1_uint8", 0, 1],
                    ["Cr1_uint8", 0, 2],
                ],
            }
        },
        {"ConvertBGRToImage": {"image_format": "png"}},
    ],
    "tool_command_args": [
        {
            "name": "image",
            "data_type": "uint8",
            "shape": ["num_bytes"],
        }
    ],
    "target_opset": 16,
}

Insert Beam Search Op#

InsertBeamSearch chains two model components (for example, encoder and decoder) together by inserting beam search op in between them.

Example Configuration#

{
    "type": "InsertBeamSearch",
    "no_repeat_ngram_size": 4
}

ORT Performance Tuning#

ONNX Runtime provides high performance across a range of hardware options through its Execution Providers interface for different execution environments. For each model running with each execution provider, there are settings that can be tuned (e.g. thread number, execution mode, etc) to improve performance. OrtSessionParamsTuning covers basic knobs that can be leveraged to find the best performance for your model and hardware.

Example Configuration#

{
    "type": "OrtSessionParamsTuning",
    "data_config": "session_params_tuning_data_config",
    "batch_size": 1,
    "providers_list" : [
        [
            "CUDAExecutionProvider",
            {
                "device_id": 0,
                "arena_extend_strategy": "kNextPowerOfTwo",
                "gpu_mem_limit": 2147483648, // 2 * 1024 * 1024 * 1024,
                "cudnn_conv_algo_search": "EXHAUSTIVE",
                "do_copy_in_default_stream": true,
            },
        ],
        "CPUExecutionProvider",
    ],
    "enable_profiling": false
}

Check out this file for an example implementation of "user_script.py" and "calib_data_config/dataloader_config/type".

ONNX#

Peeophole Optimizer#

Example Configuration#

ORT Transformers Optimization#

Example Configuration#

ONNX Surgeries#

Example#

RenameInputs#

Description#

Configurations#

Example#

RenameOutputs#

Description#

Configurations#

Example#

InferShapes#

Description#

Example#

RemoveShapes#

Description#

Example#

RemoveInitializerFromInputs#

Description#

Example#

ReorderInputs#

Description#

Configurations#

Example#

ReplaceErfWithTanh#

Description#

Example#

ZeroOutInput#

Description#

Configurations#

Example#

RemoveInputs#

Description#

Configurations#

Example#

ExposeOutputs#

Description#

Configurations#

Example#

ExposeQuantizedOutput#

Description#

Configurations#

Example#

RMSNormToL2Norm#

Description#

Example#

SimplifiedLayerNormToL2Norm#

Description#

Example#

ReplaceAttentionMaskValue#

Description#

Example#

Append Pre/Post Processing Ops#

Example Configuration#

Insert Beam Search Op#

Example Configuration#

ORT Performance Tuning#

Example Configuration#

`RenameInputs`#

`RenameOutputs`#

`InferShapes`#

`RemoveShapes`#

`RemoveInitializerFromInputs`#

`ReorderInputs`#

`ReplaceErfWithTanh`#

`ZeroOutInput`#

`RemoveInputs`#

`ExposeOutputs`#

`ExposeQuantizedOutput`#

`RMSNormToL2Norm`#

`SimplifiedLayerNormToL2Norm`#

`ReplaceAttentionMaskValue`#