Passes

The following passes are available in Olive.

Each pass is followed by a description of the pass and a list of the pass’s configuration options.

OnnxConversion

Convert a PyTorch model to ONNX model using torch.onnx.export on CPU.

Input: handler.pytorch.PyTorchModelHandler

Output: handler.composite.CompositeModelHandler | handler.onnx.DistributedOnnxModelHandler | handler.onnx.ONNXModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

target_opset

The version of the default (ai.onnx) opset to target.

type: int

default_value: 13

searchable_values: None

use_dynamo_exporter

Whether to use dynamo_export API to export ONNX model.

type: bool

default_value: False

searchable_values: None

device

The device to use for conversion, e.g., ‘cuda’ or ‘cpu’. If not specified, will use ‘cpu’ for PyTorch model and ‘cuda’ for DistributedPyTorchModel.

type: str

default_value: None

searchable_values: None

torch_dtype

The dtype to cast the model to before conversion, e.g., ‘float32’ or ‘float16’. If not specified, will use the model as is.

type: str

default_value: None

searchable_values: None

parallel_jobs

Number of parallel jobs. Defaulted to number of CPUs. Set it to 0 to disable.

type: int

default_value: None

searchable_values: None

merge_components

Whether to merge the converted components.

type: bool

default_value: False

searchable_values: None

merge_adapter_weights

Whether to merge adapter weights before conversion. After merging, the model structure is consistent with base model. That is useful if you cannot run conversion for some fine-tuned models with adapter weights

type: bool

default_value: False

searchable_values: None

save_metadata_for_token_generation

Whether to save metadata for token generation or not. Includes config.json, generation_config.json, and tokenizer related files.

type: bool

default_value: False

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

OnnxOpVersionConversion

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

target_opset

The version of the default (ai.onnx) opset to target. Default: latest opset version.

type: int

default_value: 21

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

OnnxModelOptimizer

Optimize ONNX model by fusing nodes.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

OrtTransformersOptimization

Use ONNX Transformer Optimizer to optimize transformer based models. Optimize transformer based models in scenarios where ONNX Runtime does not apply the optimization at load time. It is based on onnxruntime.transformers.optimizer.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

model_type

Transformer based model type, including bert (exported by PyTorch), gpt2 (exported by PyTorch), bert_tf (BERT exported by tf2onnx), bert_keras (BERT exported by keras2onnx), and unet/vae/clip (stable diffusion).

type: str

default_value: None

searchable_values: None

num_heads

Number of attention heads.

type: int

default_value: 0

searchable_values: None

num_key_value_heads

Number of key/value attention heads.

type: int

default_value: 0

searchable_values: None

hidden_size

Number of hidden nodes.

type: int

default_value: 0

searchable_values: None

optimization_options

Optimization options that turn on/off some fusions.

type: Dict[str, Any] | onnxruntime.transformers.fusion_options.FusionOptions

default_value: None

searchable_values: None

opt_level

Graph optimization level of Onnx Runtime: 0 - disable all (default), 1 - basic, 2 - extended, 99 - all.

type: Any

default_value: None

searchable_values: Categorical([0, 1, 2, 99])

use_gpu

Flag for GPU inference.

type: bool

default_value: False

searchable_values: None

only_onnxruntime

Whether only use onnxruntime to optimize model, and no python fusion. Disable some optimizers that might cause failure in symbolic shape inference or attention fusion, when opt_level > 1.

type: bool

default_value: False

searchable_values: Conditional(parents: (‘opt_level’,), support: {(2,): Categorical([False]), (99,): Categorical([False])}, default: Categorical([True, False]))

float16

Whether half-precision float will be used.

type: bool

default_value: False

searchable_values: None

keep_io_types

Keep input and output tensors in their original data type. Only used when float16 is True.

type: bool

default_value: True

searchable_values: None

force_fp32_ops

Operators that are forced to run in float32. Only used when float16 is True.

type: List[str]

default_value: None

searchable_values: None

force_fp32_nodes

Nodes that are forced to run in float32. Only used when float16 is True.

type: List[str]

default_value: None

searchable_values: None

force_fp16_inputs

Force the conversion of the inputs of some operators to float16, even if ‘convert_float_to_float16` tool prefers it to keep them in float32.

type: Dict[str, List[int]]

default_value: None

searchable_values: None

use_gqa

Replace MultiHeadAttention with GroupQueryAttention. True is only supported when float16 is True.

type: bool

default_value: False

searchable_values: None

input_int32

Whether int32 tensors will be used as input.

type: bool

default_value: False

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

OrtPerfTuning

Optimize ONNX Runtime inference settings.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

data_dir

Directory of sample inference data.

type: str | pathlib.Path | olive.resource_path.ResourcePath | olive.resource_path.ResourcePathConfig | None

default_value: None

searchable_values: None

dataloader_func

Dataloader function to load data from given data_dir with given batch size.

type: Callable | str

default_value: None

searchable_values: None

dataloader_func_kwargs

Keyword arguments for dataloader_func.

type: Dict[str, Any]

default_value: None

searchable_values: None

batch_size

Batch size for inference.

type: int

default_value: None

searchable_values: None

data_config

Data config to load data for computing latency.

type: olive.data.config.DataConfig | Dict

default_value: None

searchable_values: None

input_names

Input names list for ONNX model.

type: list

default_value: None

searchable_values: None

input_shapes

Input shapes list for ONNX model.

type: list

default_value: None

searchable_values: None

input_types

Input types list for ONNX model.

type: list

default_value: None

searchable_values: None

device

Device selected for tuning process.

type: str

default_value: cpu

searchable_values: None

cpu_cores

CPU cores used for thread tuning.

type: int

default_value: None

searchable_values: None

io_bind

Whether enable IOBinding Search for ONNX Runtime inference.

type: bool

default_value: False

searchable_values: None

enable_cuda_graph

Whether enable CUDA Graph for CUDA execution provider.

type: bool

default_value: False

searchable_values: None

providers_list

Execution providers framework list to execute the ONNX models.

type: list

default_value: [‘CPUExecutionProvider’]

searchable_values: None

execution_mode_list

Parallelism list between operators.

type: list

default_value: None

searchable_values: None

opt_level_list

Optimization level list for ONNX model.

type: list

default_value: None

searchable_values: None

trt_fp16_enable

Whether enable FP16 mode for TensorRT execution provider.

type: bool

default_value: False

searchable_values: None

intra_thread_num_list

List of intra thread number for test.

type: list

default_value: [None]

searchable_values: None

inter_thread_num_list

List of inter thread number for test.

type: list

default_value: [None]

searchable_values: None

extra_session_config

Extra customized session options during tuning process.

type: Dict[str, Any]

default_value: None

searchable_values: None

force_evaluate_other_eps

Whether force to evaluate all execution providers which are different with the associated execution provider.

type: bool

default_value: False

searchable_values: None

enable_profiling

Whether enable profiling for ONNX Runtime inference.

type: bool

default_value: False

searchable_values: None

OnnxFloatToFloat16

Converts a model to float16. It is based on onnxconverter-common.convert_float_to_float16. See https://onnxruntime.ai/docs/performance/model-optimizations/float16.html#float16-conversion

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

min_positive_val

Constant values will be clipped against this value

type: float

default_value: 1e-07

searchable_values: None

max_finite_val

Constant values will be clipped against this value

type: float

default_value: 10000.0

searchable_values: None

keep_io_types

Whether model inputs/outputs should be left as float32

type: bool

default_value: False

searchable_values: None

disable_shape_infer

Skips running onnx shape/type inference.

type: bool

default_value: False

searchable_values: None

op_block_list

List of op types to leave as float32

type: List[str]

default_value: None

searchable_values: None

node_block_list

List of node names to leave as float32

type: List[str]

default_value: None

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

OrtMixedPrecision

Convert model to mixed precision.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

op_block_list

List of op types to leave as float32

type: List[str]

default_value: [‘SimplifiedLayerNormalization’, ‘SkipSimplifiedLayerNormalization’, ‘Relu’, ‘Add’]

searchable_values: None

atol

Absolute tolerance for checking float16 conversion

type: float

default_value: 1e-06

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

QNNPreprocess

Preprocess ONNX model for quantization targeting QNN Execution Provider.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

fuse_layernorm

Whether to fuse ReduceMean sequence into a single LayerNormalization node.

type: bool

default_value: False

searchable_values: None

inputs_to_make_channel_last

inputs_to_make_channel_last: List of graph input names to transpose to be “channel-last”. For example, if “input0” originally has the shape (N, C, D1, D2, …, Dn), the resulting model will change input0’s shape to (N, D1, D2, …, Dn, C) and add a transpose node after it. Original: input0 (N, C, D1, D2, …, Dn) –> <Nodes> Updated: input0 (N, D1, D2, …, Dn, C) –> Transpose –> input0_chanfirst (N, C, D1, D2, …, Dn) –> <Nodes> This can potentially improve inference latency for QDQ models running on QNN EP because the additional transpose node may allow other transpose nodes inserted during ORT layout transformation to cancel out.

type: list

default_value: None

searchable_values: None

outputs_to_make_channel_last

List of graph output names to transpose to be “channel-last”. For example, if “output0” originally has the shape (N, C, D1, D2, …, Dn), the resulting model will change output0’s shape to (N, D1, D2, …, Dn, C) and add a transpose node before it. Original: <Nodes> –> output0 (N, C, D1, D2, …, Dn) Updated: <Nodes> –> output0_chanfirst (N, C, D1, D2, …, Dn) –> Transpose –> output0 (N, D1, D2, …, Dn, C) This can potentially improve inference latency for QDQ models running on QNN EP because the additional transpose node may allow other transpose nodes inserted during ORT layout transformation to cancel out.

type: list

default_value: None

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

OnnxDynamicQuantization

ONNX Dynamic Quantization Pass.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

quant_mode

dynamic quantization mode

type: str

default_value: dynamic

searchable_values: None

weight_type

Data type for quantizing weights which is used both in dynamic and static quantization. ‘QInt8’ for signed 8-bit integer, ‘QUInt8’ for unsigned 8-bit integer.

type: str

default_value: QInt8

searchable_values: Categorical([‘QInt8’, ‘QUInt8’])

op_types_to_quantize

List of operator types to quantize. If None, all quantizable.

type: list

default_value: None

searchable_values: None

nodes_to_quantize

List of node names to quantize. If None, all quantizable.

type: list

default_value: None

searchable_values: None

nodes_to_exclude

List of node names to exclude from quantization. If None, all quantizable.

type: list

default_value: None

searchable_values: None

per_channel

Quantize weights per channel. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization

type: bool

default_value: False

searchable_values: Categorical([True, False])

reduce_range

Quantize weights with 7-bits. It may improve the accuracy for some models running on non-VNNI machine, especially for per-channel mode. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization

type: bool

default_value: False

searchable_values: Categorical([True, False])

quant_preprocess

Shape inference and model optimization, in preparation for quantization. https://onnxruntime.ai/docs/performance/quantization.html#pre-processing

type: bool

default_value: True

searchable_values: Categorical([True, False])

extra.Sigmoid.nnapi

type: bool

default_value: False

searchable_values: None

ActivationSymmetric

symmetrize calibration data for activations

type: bool

default_value: False

searchable_values: None

WeightSymmetric

symmetrize calibration data for weights

type: bool

default_value: True

searchable_values: None

EnableSubgraph

If enabled, subgraph will be quantized. Dynamic mode currently is supported.

type: bool

default_value: False

searchable_values: None

ForceQuantizeNoInputCheck

By default, some latent operators like maxpool, transpose, do not quantize if their input is not quantized already. Setting to True to force such operator always quantize input and so generate quantized output. Also the True behavior could be disabled per node using the nodes_to_exclude.

type: bool

default_value: False

searchable_values: None

MatMulConstBOnly

If enabled, only MatMul with const B will be quantized.

type: bool

default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘dynamic’,): True, (‘static’,): False}, default: OLIVE_INVALID_PARAM_VALUE)

searchable_values: None

extra_options

Key value pair dictionary for extra_options in quantization. Please refer to https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/quantize.py for details about the supported options. If an option is one of [‘extra.Sigmoid.nnapi’, ‘ActivationSymmetric’, ‘WeightSymmetric’, ‘EnableSubgraph’, ‘ForceQuantizeNoInputCheck’, ‘MatMulConstBOnly’], it will be overwritten by the corresponding config parameter value.

type: dict

default_value: None

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

OnnxStaticQuantization

ONNX Static Quantization Pass.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

quant_mode

static quantization mode

type: str

default_value: static

searchable_values: None

weight_type

Data type for quantizing weights which is used both in dynamic and static quantization. ‘QInt8’ for signed 8-bit integer, ‘QUInt8’ for unsigned 8-bit integer.

type: str

default_value: QInt8

searchable_values: Categorical([‘QInt8’, ‘QUInt8’])

op_types_to_quantize

List of operator types to quantize. If None, all quantizable.

type: list

default_value: None

searchable_values: None

nodes_to_quantize

List of node names to quantize. If None, all quantizable.

type: list

default_value: None

searchable_values: None

nodes_to_exclude

List of node names to exclude from quantization. If None, all quantizable.

type: list

default_value: None

searchable_values: None

per_channel

Quantize weights per channel. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization

type: bool

default_value: False

searchable_values: Categorical([True, False])

reduce_range

Quantize weights with 7-bits. It may improve the accuracy for some models running on non-VNNI machine, especially for per-channel mode. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization

type: bool

default_value: False

searchable_values: Categorical([True, False])

quant_preprocess

Shape inference and model optimization, in preparation for quantization. https://onnxruntime.ai/docs/performance/quantization.html#pre-processing

type: bool

default_value: True

searchable_values: Categorical([True, False])

data_dir

Path to the directory containing the dataset. For local data, it is required if quant_mode is ‘static’ and dataloader_func is provided.

type: str | pathlib.Path | olive.resource_path.ResourcePath | olive.resource_path.ResourcePathConfig | None

default_value: None

searchable_values: None

batch_size

Batch size for calibration, only used if dataloader_func is provided.

type: int

default_value: 1

searchable_values: None

dataloader_func

Function/function name to generate dataloader for calibration, required if quant_mode is ‘static’ and data_config is None.

type: Callable | str

default_value: None

searchable_values: None

dataloader_func_kwargs

Keyword arguments for dataloader_func.

type: Dict[str, Any]

default_value: None

searchable_values: None

data_config

Data config for calibration, required if quant_mode is ‘static’ and dataloader_func is None.

type: olive.data.config.DataConfig | Dict

default_value: None

searchable_values: None

calibrate_method

Current calibration methods supported are MinMax and Entropy, Please use CalibrationMethod.MinMax or CalibrationMethod.Entropy as options. Percentile is not supported for onnxruntime==1.16.0, please avoid to set/search it.

type: str

default_value: MinMax

searchable_values: Categorical([‘MinMax’, ‘Entropy’, ‘Percentile’])

quant_format

QOperator format quantizes the model with quantized operators directly. QDQ format quantize the model by inserting QuantizeLinear/DeQuantizeLinear on the tensor.

type: str

default_value: QDQ

searchable_values: Categorical([‘QOperator’, ‘QDQ’])

activation_type

Quantization data type of activation. Please refer to https://onnxruntime.ai/docs/performance/quantization.html for more details on data type selection

type: str

default_value: QInt8

searchable_values: Conditional(parents: (‘quant_format’, ‘weight_type’), support: {(‘QDQ’, ‘QInt8’): Categorical([‘QInt8’]), (‘QDQ’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘QOperator’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘QOperator’, ‘QInt8’): Categorical([<SpecialParamValue.INVALID: ‘OLIVE_INVALID_PARAM_VALUE’>])}, default: Categorical([<SpecialParamValue.INVALID: ‘OLIVE_INVALID_PARAM_VALUE’>]))

prepare_qnn_config

Whether to generate a suitable quantization config for the input model. Should be set to True if model is targeted for QNN EP.

type: bool

default_value: False

searchable_values: None

extra.Sigmoid.nnapi

type: bool

default_value: False

searchable_values: None

ActivationSymmetric

symmetrize calibration data for activations

type: bool

default_value: False

searchable_values: None

WeightSymmetric

symmetrize calibration data for weights

type: bool

default_value: True

searchable_values: None

EnableSubgraph

If enabled, subgraph will be quantized. Dynamic mode currently is supported.

type: bool

default_value: False

searchable_values: None

ForceQuantizeNoInputCheck

By default, some latent operators like maxpool, transpose, do not quantize if their input is not quantized already. Setting to True to force such operator always quantize input and so generate quantized output. Also the True behavior could be disabled per node using the nodes_to_exclude.

type: bool

default_value: False

searchable_values: None

MatMulConstBOnly

If enabled, only MatMul with const B will be quantized.

type: bool

default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘dynamic’,): True, (‘static’,): False}, default: OLIVE_INVALID_PARAM_VALUE)

searchable_values: None

extra_options

Key value pair dictionary for extra_options in quantization. Please refer to https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/quantize.py for details about the supported options. If an option is one of [‘extra.Sigmoid.nnapi’, ‘ActivationSymmetric’, ‘WeightSymmetric’, ‘EnableSubgraph’, ‘ForceQuantizeNoInputCheck’, ‘MatMulConstBOnly’], it will be overwritten by the corresponding config parameter value.

type: dict

default_value: None

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

OnnxQuantization

Quantize ONNX model with static/dynamic quantization techniques.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

quant_mode

Onnx Quantization mode. ‘dynamic’ for dynamic quantization, ‘static’ for static quantization.

type: str

default_value: static

searchable_values: Categorical([‘dynamic’, ‘static’])

weight_type

Data type for quantizing weights which is used both in dynamic and static quantization. ‘QInt8’ for signed 8-bit integer, ‘QUInt8’ for unsigned 8-bit integer.

type: str

default_value: QInt8

searchable_values: Categorical([‘QInt8’, ‘QUInt8’])

op_types_to_quantize

List of operator types to quantize. If None, all quantizable.

type: list

default_value: None

searchable_values: None

nodes_to_quantize

List of node names to quantize. If None, all quantizable.

type: list

default_value: None

searchable_values: None

nodes_to_exclude

List of node names to exclude from quantization. If None, all quantizable.

type: list

default_value: None

searchable_values: None

per_channel

Quantize weights per channel. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization

type: bool

default_value: False

searchable_values: Categorical([True, False])

reduce_range

Quantize weights with 7-bits. It may improve the accuracy for some models running on non-VNNI machine, especially for per-channel mode. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization

type: bool

default_value: False

searchable_values: Categorical([True, False])

quant_preprocess

Shape inference and model optimization, in preparation for quantization. https://onnxruntime.ai/docs/performance/quantization.html#pre-processing

type: bool

default_value: True

searchable_values: Categorical([True, False])

data_dir

Path to the directory containing the dataset. For local data, it is required if quant_mode is ‘static’ and dataloader_func is provided.

type: str | pathlib.Path | olive.resource_path.ResourcePath | olive.resource_path.ResourcePathConfig | None

default_value: None

searchable_values: None

batch_size

Batch size for calibration, only used if dataloader_func is provided.

type: int

default_value: 1

searchable_values: None

dataloader_func

Function/function name to generate dataloader for calibration, required if quant_mode is ‘static’ and data_config is None.

type: Callable | str

default_value: None

searchable_values: None

dataloader_func_kwargs

Keyword arguments for dataloader_func.

type: Dict[str, Any]

default_value: None

searchable_values: None

data_config

Data config for calibration, required if quant_mode is ‘static’ and dataloader_func is None.

type: olive.data.config.DataConfig | Dict

default_value: None

searchable_values: None

calibrate_method

Current calibration methods supported are MinMax and Entropy, Please use CalibrationMethod.MinMax or CalibrationMethod.Entropy as options. Percentile is not supported for onnxruntime==1.16.0, please avoid to set/search it.

type: str

default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘static’,): ‘MinMax’, (‘dynamic’,): <SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>}, default: OLIVE_INVALID_PARAM_VALUE)

searchable_values: Conditional(parents: (‘quant_mode’,), support: {(‘static’,): Categorical([‘MinMax’, ‘Entropy’, ‘Percentile’])}, default: Categorical([<SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>]))

quant_format

QOperator format quantizes the model with quantized operators directly. QDQ format quantize the model by inserting QuantizeLinear/DeQuantizeLinear on the tensor.

type: str

default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘static’,): ‘QDQ’, (‘dynamic’,): <SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>}, default: OLIVE_INVALID_PARAM_VALUE)

searchable_values: Conditional(parents: (‘quant_mode’,), support: {(‘static’,): Categorical([‘QOperator’, ‘QDQ’])}, default: Categorical([<SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>]))

activation_type

Quantization data type of activation. Please refer to https://onnxruntime.ai/docs/performance/quantization.html for more details on data type selection

type: str

default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘static’,): ‘QInt8’, (‘dynamic’,): <SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>}, default: OLIVE_INVALID_PARAM_VALUE)

searchable_values: Conditional(parents: (‘quant_mode’, ‘quant_format’, ‘weight_type’), support: {(‘static’, ‘QDQ’, ‘QInt8’): Categorical([‘QInt8’]), (‘static’, ‘QDQ’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘static’, ‘QOperator’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘static’, ‘QOperator’, ‘QInt8’): Categorical([<SpecialParamValue.INVALID: ‘OLIVE_INVALID_PARAM_VALUE’>])}, default: Categorical([<SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>]))

prepare_qnn_config

Whether to generate a suitable quantization config for the input model. Should be set to True if model is targeted for QNN EP.

type: bool

default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘static’,): False, (‘dynamic’,): <SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>}, default: OLIVE_INVALID_PARAM_VALUE)

searchable_values: None

extra.Sigmoid.nnapi

type: bool

default_value: False

searchable_values: None

ActivationSymmetric

symmetrize calibration data for activations

type: bool

default_value: False

searchable_values: None

WeightSymmetric

symmetrize calibration data for weights

type: bool

default_value: True

searchable_values: None

EnableSubgraph

If enabled, subgraph will be quantized. Dynamic mode currently is supported.

type: bool

default_value: False

searchable_values: None

ForceQuantizeNoInputCheck

By default, some latent operators like maxpool, transpose, do not quantize if their input is not quantized already. Setting to True to force such operator always quantize input and so generate quantized output. Also the True behavior could be disabled per node using the nodes_to_exclude.

type: bool

default_value: False

searchable_values: None

MatMulConstBOnly

If enabled, only MatMul with const B will be quantized.

type: bool

default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘dynamic’,): True, (‘static’,): False}, default: OLIVE_INVALID_PARAM_VALUE)

searchable_values: None

extra_options

Key value pair dictionary for extra_options in quantization. Please refer to https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/quantize.py for details about the supported options. If an option is one of [‘extra.Sigmoid.nnapi’, ‘ActivationSymmetric’, ‘WeightSymmetric’, ‘EnableSubgraph’, ‘ForceQuantizeNoInputCheck’, ‘MatMulConstBOnly’], it will be overwritten by the corresponding config parameter value.

type: dict

default_value: None

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

DynamicToFixedShape

Convert dynamic shape to fixed shape for ONNX model.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

dim_param

Symbolic parameter name. Provide dim_value if specified.

type: List[str]

default_value: None

searchable_values: None

dim_value

Value to replace dim_param with in the model. Must be > 0.

type: List[int]

default_value: None

searchable_values: None

input_name

Model input name to replace shape of. Provide input_shape if specified.

type: List[str]

default_value: None

searchable_values: None

input_shape

Shape to use for input_shape. Provide comma separated list for the shape. All values must be > 0. e.g. [1,3,256,256]

type: List[List[int]]

default_value: None

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

IncDynamicQuantization

Intel® Neural Compressor Dynamic Quantization Pass.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

approach

dynamic quantization mode

type: str

default_value: dynamic

searchable_values: None

device

Intel® Neural Compressor quantization device. Support ‘cpu’ and ‘gpu’.

type: str

default_value: cpu

searchable_values: None

backend

Backend for model execution. Support ‘default’, ‘onnxrt_trt_ep’, ‘onnxrt_cuda_ep’

type: str

default_value: default

searchable_values: None

domain

Model domain. Support ‘auto’, ‘cv’, ‘object_detection’, ‘nlp’ and ‘recommendation_system’. Intel® Neural Compressor Adaptor will use specific quantization settings for different domains automatically, and explicitly specified quantization settings will override the automatic setting. If users set domain as auto, automatic detection for domain will be executed.

type: str

default_value: auto

searchable_values: None

workspace

Workspace for Intel® Neural Compressor quantization where intermediate files and tuning history file are stored. Default value is: “./nc_workspace/{}/”.format(datetime.datetime.now().strftime(“%Y-%m-%d_%H-%M-%S”))

type: str

default_value: None

searchable_values: None

recipes

Recipes for Intel® Neural Compressor quantization, support list is as below. ‘smooth_quant’: whether do smooth quant ‘smooth_quant_args’: parameters for smooth_quant ‘fast_bias_correction’: whether do fast bias correction ‘weight_correction’: whether do weight correction ‘gemm_to_matmul’: whether convert gemm to matmul and add, only valid for onnx models ‘graph_optimization_level’: support ‘DISABLE_ALL’, ‘ENABLE_BASIC’, ‘ENABLE_EXTENDED’, ‘ENABLE_ALL’ only valid for onnx models ‘first_conv_or_matmul_quantization’: whether quantize the first conv or matmul ‘last_conv_or_matmul_quantization’: whether quantize the last conv or matmul ‘pre_post_process_quantization’: whether quantize the ops in preprocessing and postprocessing ‘add_qdq_pair_to_weight’: whether add QDQ pair for weights, only valid for onnxrt_trt_ep ‘optypes_to_exclude_output_quant’: don’t quantize output of specified optypes ‘dedicated_qdq_pair’: whether dedicate QDQ pair, only valid for onnxrt_trt_ep

type: dict

default_value: {}

searchable_values: None

reduce_range

Whether use 7 bit to quantization.

type: bool

default_value: False

searchable_values: Categorical([True, False])

quant_level

Intel® Neural Compressor allows users to choose different tuning processes by specifying the quantization level (quant_level). Currently 3 quant_levels are supported. 0 is conservative strategy, 1 is basic or user-specified strategy, auto (default) is the combination of 0 and 1. Please refer to https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-process https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-algorithms for more details

type: str

default_value: auto

searchable_values: None

excluded_precisions

Precisions to be excluded, Default value is empty list. Intel® Neural Compressor enable the mixed precision with fp32 + bf16(only when device is ‘gpu’ and backend is ‘onnxrt_cuda_ep’) + int8 by default. If you want to disable bf16 data type, you can specify excluded_precisions = [‘bf16’].

type: list

default_value: []

searchable_values: None

tuning_criterion

Instance of TuningCriterion class. In this class you can set strategy, strategy_kwargs, timeout, max_trials and objective.

type: dict

default_value: {‘strategy’: ‘basic’, ‘strategy_kwargs’: None, ‘timeout’: 0, ‘max_trials’: 5, ‘objective’: ‘performance’}

searchable_values: None

metric

Accuracy metric to generate an evaluation function for Intel® Neural Compressor accuracy aware tuning.

type: olive.evaluator.metric.Metric | None

default_value: None

searchable_values: None

weight_only_config

INC weight only quantization config.

type: dict

default_value: {}

searchable_values: None

op_type_dict

INC weight only quantization config.

type: dict

default_value: {}

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

IncStaticQuantization

Intel® Neural Compressor Static Quantization Pass.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

approach

static quantization mode

type: str

default_value: static

searchable_values: None

diagnosis

Whether to enable diagnosis mode. If enabled, Intel® Neural Compressor will print the quantization summary.

type: bool

default_value: False

searchable_values: None

device

Intel® Neural Compressor quantization device. Support ‘cpu’ and ‘gpu’.

type: str

default_value: cpu

searchable_values: None

backend

Backend for model execution. Support ‘default’, ‘onnxrt_trt_ep’, ‘onnxrt_cuda_ep’

type: str

default_value: default

searchable_values: None

domain

Model domain. Support ‘auto’, ‘cv’, ‘object_detection’, ‘nlp’ and ‘recommendation_system’. Intel® Neural Compressor Adaptor will use specific quantization settings for different domains automatically, and explicitly specified quantization settings will override the automatic setting. If users set domain as auto, automatic detection for domain will be executed.

type: str

default_value: auto

searchable_values: None

workspace

Workspace for Intel® Neural Compressor quantization where intermediate files and tuning history file are stored. Default value is: “./nc_workspace/{}/”.format(datetime.datetime.now().strftime(“%Y-%m-%d_%H-%M-%S”))

type: str

default_value: None

searchable_values: None

recipes

Recipes for Intel® Neural Compressor quantization, support list is as below. ‘smooth_quant’: whether do smooth quant ‘smooth_quant_args’: parameters for smooth_quant ‘fast_bias_correction’: whether do fast bias correction ‘weight_correction’: whether do weight correction ‘gemm_to_matmul’: whether convert gemm to matmul and add, only valid for onnx models ‘graph_optimization_level’: support ‘DISABLE_ALL’, ‘ENABLE_BASIC’, ‘ENABLE_EXTENDED’, ‘ENABLE_ALL’ only valid for onnx models ‘first_conv_or_matmul_quantization’: whether quantize the first conv or matmul ‘last_conv_or_matmul_quantization’: whether quantize the last conv or matmul ‘pre_post_process_quantization’: whether quantize the ops in preprocessing and postprocessing ‘add_qdq_pair_to_weight’: whether add QDQ pair for weights, only valid for onnxrt_trt_ep ‘optypes_to_exclude_output_quant’: don’t quantize output of specified optypes ‘dedicated_qdq_pair’: whether dedicate QDQ pair, only valid for onnxrt_trt_ep

type: dict

default_value: {}

searchable_values: None

reduce_range

Whether use 7 bit to quantization.

type: bool

default_value: False

searchable_values: Categorical([True, False])

quant_level

Intel® Neural Compressor allows users to choose different tuning processes by specifying the quantization level (quant_level). Currently 3 quant_levels are supported. 0 is conservative strategy, 1 is basic or user-specified strategy, auto (default) is the combination of 0 and 1. Please refer to https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-process https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-algorithms for more details

type: str

default_value: auto

searchable_values: None

excluded_precisions

Precisions to be excluded, Default value is empty list. Intel® Neural Compressor enable the mixed precision with fp32 + bf16(only when device is ‘gpu’ and backend is ‘onnxrt_cuda_ep’) + int8 by default. If you want to disable bf16 data type, you can specify excluded_precisions = [‘bf16’].

type: list

default_value: []

searchable_values: None

tuning_criterion

Instance of TuningCriterion class. In this class you can set strategy, strategy_kwargs, timeout, max_trials and objective.

type: dict

default_value: {‘strategy’: ‘basic’, ‘strategy_kwargs’: None, ‘timeout’: 0, ‘max_trials’: 5, ‘objective’: ‘performance’}

searchable_values: None

metric

Accuracy metric to generate an evaluation function for Intel® Neural Compressor accuracy aware tuning.

type: olive.evaluator.metric.Metric | None

default_value: None

searchable_values: None

weight_only_config

INC weight only quantization config.

type: dict

default_value: {‘bits’: 4, ‘group_size’: 4, ‘scheme’: ‘asym’, ‘algorithm’: ‘RTN’}

searchable_values: None

op_type_dict

INC weight only quantization config.

type: dict

default_value: {}

searchable_values: None

data_dir

Path to the directory containing the dataset. For local data, it is required if approach is ‘static’ and dataloader_func is provided.

type: str | pathlib.Path | olive.resource_path.ResourcePath | olive.resource_path.ResourcePathConfig | None

default_value: None

searchable_values: None

batch_size

Batch size for calibration, only used if dataloader_func is provided.

type: int

default_value: 1

searchable_values: None

dataloader_func

Function/function name to generate dataloader for calibration, required if approach is ‘static’ and data_config is None.

type: Callable | str

default_value: None

searchable_values: None

dataloader_func_kwargs

Keyword arguments for dataloader_func.

type: Dict[str, Any]

default_value: None

searchable_values: None

data_config

Data config for calibration, required if approach is ‘static’ and dataloader_func is None.

type: olive.data.config.DataConfig | Dict

default_value: None

searchable_values: None

quant_format

Quantization format. Support ‘QDQ’ and ‘QOperator’.

type: str

default_value: QOperator

searchable_values: Categorical([‘QOperator’, ‘QDQ’])

calibration_sampling_size

Number of calibration sample.

type: list | int

default_value: [100]

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

IncQuantization

Quantize ONNX model with Intel® Neural Compressor.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

approach

Intel® Neural Compressor Quantization mode. ‘dynamic’ for dynamic quantization, ‘static’ for static quantization, “weight_only” for 4-bits weight-only quantization.

type: str

default_value: static

searchable_values: Categorical([‘dynamic’, ‘static’, ‘weight_only’])

device

Intel® Neural Compressor quantization device. Support ‘cpu’ and ‘gpu’.

type: str

default_value: cpu

searchable_values: None

backend

Backend for model execution. Support ‘default’, ‘onnxrt_trt_ep’, ‘onnxrt_cuda_ep’

type: str

default_value: default

searchable_values: None

domain

Model domain. Support ‘auto’, ‘cv’, ‘object_detection’, ‘nlp’ and ‘recommendation_system’. Intel® Neural Compressor Adaptor will use specific quantization settings for different domains automatically, and explicitly specified quantization settings will override the automatic setting. If users set domain as auto, automatic detection for domain will be executed.

type: str

default_value: auto

searchable_values: None

workspace

Workspace for Intel® Neural Compressor quantization where intermediate files and tuning history file are stored. Default value is: “./nc_workspace/{}/”.format(datetime.datetime.now().strftime(“%Y-%m-%d_%H-%M-%S”))

type: str

default_value: None

searchable_values: None

recipes

Recipes for Intel® Neural Compressor quantization, support list is as below. ‘smooth_quant’: whether do smooth quant ‘smooth_quant_args’: parameters for smooth_quant ‘fast_bias_correction’: whether do fast bias correction ‘weight_correction’: whether do weight correction ‘gemm_to_matmul’: whether convert gemm to matmul and add, only valid for onnx models ‘graph_optimization_level’: support ‘DISABLE_ALL’, ‘ENABLE_BASIC’, ‘ENABLE_EXTENDED’, ‘ENABLE_ALL’ only valid for onnx models ‘first_conv_or_matmul_quantization’: whether quantize the first conv or matmul ‘last_conv_or_matmul_quantization’: whether quantize the last conv or matmul ‘pre_post_process_quantization’: whether quantize the ops in preprocessing and postprocessing ‘add_qdq_pair_to_weight’: whether add QDQ pair for weights, only valid for onnxrt_trt_ep ‘optypes_to_exclude_output_quant’: don’t quantize output of specified optypes ‘dedicated_qdq_pair’: whether dedicate QDQ pair, only valid for onnxrt_trt_ep

type: dict

default_value: {}

searchable_values: None

reduce_range

Whether use 7 bit to quantization.

type: bool

default_value: False

searchable_values: Categorical([True, False])

quant_level

Intel® Neural Compressor allows users to choose different tuning processes by specifying the quantization level (quant_level). Currently 3 quant_levels are supported. 0 is conservative strategy, 1 is basic or user-specified strategy, auto (default) is the combination of 0 and 1. Please refer to https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-process https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-algorithms for more details

type: str

default_value: auto

searchable_values: None

excluded_precisions

Precisions to be excluded, Default value is empty list. Intel® Neural Compressor enable the mixed precision with fp32 + bf16(only when device is ‘gpu’ and backend is ‘onnxrt_cuda_ep’) + int8 by default. If you want to disable bf16 data type, you can specify excluded_precisions = [‘bf16’].

type: list

default_value: []

searchable_values: None

tuning_criterion

Instance of TuningCriterion class. In this class you can set strategy, strategy_kwargs, timeout, max_trials and objective.

type: dict

default_value: {‘strategy’: ‘basic’, ‘strategy_kwargs’: None, ‘timeout’: 0, ‘max_trials’: 5, ‘objective’: ‘performance’}

searchable_values: None

metric

Accuracy metric to generate an evaluation function for Intel® Neural Compressor accuracy aware tuning.

type: olive.evaluator.metric.Metric | None

default_value: None

searchable_values: None

weight_only_config

INC weight only quantization config.

type: dict

default_value: {‘bits’: 4, ‘group_size’: 4, ‘scheme’: ‘asym’, ‘algorithm’: ‘RTN’}

searchable_values: None

op_type_dict

INC weight only quantization config.

type: dict

default_value: {}

searchable_values: None

data_dir

Path to the directory containing the dataset. For local data, it is required if approach is ‘static’ and dataloader_func is provided.

type: str | pathlib.Path | olive.resource_path.ResourcePath | olive.resource_path.ResourcePathConfig | None

default_value: None

searchable_values: None

batch_size

Batch size for calibration, only used if dataloader_func is provided.

type: int

default_value: 1

searchable_values: None

dataloader_func

Function/function name to generate dataloader for calibration, required if approach is ‘static’ and data_config is None.

type: Callable | str

default_value: None

searchable_values: None

dataloader_func_kwargs

Keyword arguments for dataloader_func.

type: Dict[str, Any]

default_value: None

searchable_values: None

data_config

Data config for calibration, required if approach is ‘static’ and dataloader_func is None.

type: olive.data.config.DataConfig | Dict

default_value: None

searchable_values: None

quant_format

Quantization format. Support ‘QDQ’ and ‘QOperator’.

type: str

default_value: QOperator

searchable_values: Conditional(parents: (‘approach’,), support: {(‘static’,): Categorical([‘QOperator’, ‘QDQ’])}, default: Categorical([‘default’]))

calibration_sampling_size

Number of calibration sample.

type: list | int

default_value: [100]

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

VitisAIQuantization

Quantize ONNX model with onnxruntime. We can search for best parameters for vai_q_onnx quantization at same time.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

quant_mode

Onnx Quantization mode. ‘static’ for vitis ai quantization.

type: str

default_value: static

searchable_values: Categorical([‘static’])

data_dir

Path to the directory containing the dataset.

type: str | pathlib.Path | olive.resource_path.ResourcePath | olive.resource_path.ResourcePathConfig | None

default_value: None

searchable_values: None

batch_size

Batch size for calibration, required.

type: int

default_value: 1

searchable_values: None

dataloader_func

Function/function name to generate dataloader for calibration, required’

type: Callable | str

required: True

dataloader_func_kwargs

Keyword arguments for dataloader_func.

type: Dict[str, Any]

default_value: None

searchable_values: None

weight_type

Data type for quantizing weights which is used in vai_q_onnx quantization. ‘QInt8’ for signed 8-bit integer,

type: str

default_value: QInt8

searchable_values: Categorical([‘QInt8’])

input_nodes

Start node that needs quantization. If None, all quantizable.

type: list

default_value: None

searchable_values: None

output_nodes

End node that needs quantization. If None, all quantizable.

type: list

default_value: None

searchable_values: None

op_types_to_quantize

List of operator types to quantize. If None, all quantizable.

type: list

default_value: None

searchable_values: None

nodes_to_quantize

List of node names to quantize. If None, all quantizable.

type: list

default_value: None

searchable_values: None

nodes_to_exclude

List of node names to exclude from quantization. If None, all quantizable.

type: list

default_value: None

searchable_values: None

per_channel

Quantize weights per channel.

type: bool

default_value: False

searchable_values: Categorical([True, False])

optimize_model

Deprecating Soon in ONNX! Optimize model before quantization. NOT recommended, optimization will change the computation graph, making debugging of quantization loss difficult.

type: bool

default_value: False

searchable_values: Categorical([True, False])

use_external_data_format

option used for large size (>2GB) model. Set to True by default.

type: bool

default_value: True

searchable_values: None

quant_preprocess

Shape inference and model optimization, in preparation for quantization. https://onnxruntime.ai/docs/performance/quantization.html#pre-processing

type: bool

default_value: True

searchable_values: Categorical([True, False])

calibrate_method

Current calibration methods supported are NonOverflow and MinMSE, Please use NonOverflow or MinMSE as options.

type: str

default_value: MinMSE

searchable_values: Categorical([‘NonOverflow’, ‘MinMSE’])

quant_format

QDQ format quantize the model by inserting QuantizeLinear/DeQuantizeLinear on the tensor.

type: str

default_value: QDQ

searchable_values: Categorical([‘QDQ’, ‘QOperator’])

need_layer_fusing

Perform layer fusion for conv-relu type operations

type: bool

default_value: False

searchable_values: Categorical([True, False])

activation_type

Quantization data type of activation.

type: str

default_value: QUInt8

searchable_values: Conditional(parents: (‘quant_format’, ‘weight_type’), support: {(‘QDQ’, ‘QInt8’): Categorical([‘QInt8’]), (‘QDQ’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘QOperator’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘QOperator’, ‘QInt8’): Categorical([<SpecialParamValue.INVALID: ‘OLIVE_INVALID_PARAM_VALUE’>])}, default: Categorical([<SpecialParamValue.INVALID: ‘OLIVE_INVALID_PARAM_VALUE’>]))

enable_dpu

Use QDQ format optimized specifically for DPU.

type: bool

default_value: False

searchable_values: Categorical([True, False])

ActivationSymmetric

symmetrize calibration data for activations

type: bool

default_value: False

searchable_values: None

WeightSymmetric

symmetrize calibration data for weights

type: bool

default_value: True

searchable_values: None

AddQDQPairToWeight

remains floating-point weight and inserts both QuantizeLinear/DeQuantizeLinear nodes to weight

type: bool

default_value: False

searchable_values: None

extra_options

Key value pair dictionary for extra_options in quantization. If an option is one of [‘ActivationSymmetric’, ‘WeightSymmetric’, ‘AddQDQPairToWeight’], it will be overwritten by the corresponding config parameter value.

type: dict

default_value: None

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

AppendPrePostProcessingOps

Add Pre/Post nodes to the input model.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

pre

List of pre-processing commands to add.

type: List[Dict[str, Any]]

default_value: None

searchable_values: None

post

List of post-processing commands to add.

type: List[Dict[str, Any]]

default_value: None

searchable_values: None

tool_command

Composited tool commands to invoke.

type: str

default_value: None

searchable_values: None

tool_command_args

Arguments to pass to tool command or to PrePostProcessor. If it is used for PrePostProcessor, the schema would like: { “name”: “image”, “data_type”: “uint8”, “shape”: [“num_bytes”],

type: Dict[str, Any] | List[olive.passes.onnx.append_pre_post_processing_ops.PrePostProcessorInput]

default_value: None

searchable_values: None

target_opset

The version of the default (ai.onnx) opset to target.

type: int

default_value: 16

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

InsertBeamSearch

Insert Beam Search Op. Only used for whisper models. Uses WhisperBeamSearch contrib op if ORT version >= 1.17.1, else uses BeamSearch contrib op.

Input: handler.base.OliveModelHandler

Output: handler.onnx.ONNXModelHandler

no_repeat_ngram_size

If set to int > 0, all ngrams of that size can only occur once.

type: int

default_value: 0

searchable_values: None

use_vocab_mask

Use vocab_mask as an extra graph input to the beam search op. Only supported in ORT >= 1.16.0

type: bool

default_value: False

searchable_values: None

use_prefix_vocab_mask

Use prefix_vocab_mask as an extra graph input to the beam search op. Only supported in ORT >= 1.16.0

type: bool

default_value: False

searchable_values: None

use_forced_decoder_ids

Use decoder_input_ids as an extra graph input to the beam search op. Only supported in ORT >= 1.16.0

type: bool

default_value: False

searchable_values: None

use_logits_processor

Use logits_processor as an extra graph input to the beam search op. Only supported in ORT >= 1.16.0

type: bool

default_value: False

searchable_values: None

use_temperature

Use temperature as an extra graph input to the beam search op. Only supported in ORT >= 1.17.1

type: bool

default_value: False

searchable_values: None

fp16

Is the model in fp16 precision.

type: bool

default_value: False

searchable_values: None

use_gpu

Use GPU for beam search op.

type: bool

default_value: False

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

ExtractAdapters

Extract adapter weights from model and save them as external weights file. If make_inputs is False, model proto is invalid after this pass as the adapter weights point to non-existent external files. Inference session must be created by first loading the adapter weights using SessionOptions.add_external_initializers. If make_inputs is True, the adapter weights are inputs to the model and must be provided during inference.

Input: handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

make_inputs

Convert adapter weights to inputs. If false, the adapter weights will be set as initializers with external data.

type: bool

default_value: False

searchable_values: None

pack_inputs

Pack adapter weights for the same module type into a single input tensor. Only used if make_inputs is True.

type: bool

default_value: True

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

LoRA

Run LoRA fine-tuning on a Hugging Face PyTorch model. This pass only supports PyTorchModelHandler with hf_config.

Input: handler.pytorch.PyTorchModelHandler

Output: handler.pytorch.PyTorchModelHandler

target_modules

Target modules

type: List[str]

default_value: None

searchable_values: None

use_ort_trainer

Whether or not to use ORTTrainer.

type: bool

default_value: False

searchable_values: None

ortmodule_onnx_opset_version

The opset version to use for ONNX export when using ORTTrainer. Only used if use_ort_trainer is True. 16+ is required when using bfloat16 and model has operators such as Where.

type: int

default_value: 16

searchable_values: None

lora_r

Lora attention dimension.

type: int

default_value: 64

searchable_values: None

lora_alpha

The alpha parameter for Lora scaling.

type: float

default_value: 16

searchable_values: None

lora_dropout

The dropout probability for Lora layers.

type: float

default_value: 0.0

searchable_values: None

bias

Bias type for Lora

type: str

default_value: none

searchable_values: None

modules_to_save

List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint.

type: None

default_value: None

searchable_values: None

torch_dtype

Data type to use for training. Should be one of bfloat16, float16 or float32. If float16 will use fp16 mixed-precision training.

type: str

default_value: bfloat16

searchable_values: None

allow_tf32

Whether or not to allow TF32 on Ampere GPUs. Can be used to speed up training. For more information, see ‘https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices

type: bool

default_value: True

searchable_values: None

train_data_config

Data config for fine-tuning training. If eval_data_config is not provided and eval_dataset_size is not None, the data will be split into train and eval. Otherwise, the data will be used for training only.

type: olive.data.config.DataConfig | Dict

required: True

eval_data_config

Data config for fine-tuning evaluation. Optional if eval_dataset_size is provided or evaluation is not needed.

type: olive.data.config.DataConfig | Dict

default_value: None

searchable_values: None

eval_dataset_size

Size of the validation dataset. Should be either positive and smaller than the number of train sample or a float in the (0, 1) range. If eval_data_config is provided, this parameter will be ignored.

type: float

default_value: None

searchable_values: None

training_args

Training arguments. If None, will use default arguments. See HFTrainingArguments for more details.

type: olive.passes.pytorch.lora.HFTrainingArguments | Dict

default_value: None

searchable_values: None

QLoRA

Run QLoRA fine-tuning on a Hugging Face PyTorch model. This pass only supports PyTorchModelHandler with hf_config.

Input: handler.pytorch.PyTorchModelHandler

Output: handler.pytorch.PyTorchModelHandler

double_quant

Whether to use nested quantization where the quantization constants from the first quantization are quantized again.

type: bool

default_value: False

searchable_values: None

quant_type

Quantization data type to use. Should be one of fp4 or nf4.

type: str

default_value: nf4

searchable_values: None

compute_dtype

Computation data type for the quantized modules. If not provided, will use the same dtype as torch_dtype

type: str

default_value: None

searchable_values: None

use_ort_trainer

Whether or not to use ORTTrainer.

type: bool

default_value: False

searchable_values: None

ortmodule_onnx_opset_version

The opset version to use for ONNX export when using ORTTrainer. Only used if use_ort_trainer is True. 16+ is required when using bfloat16 and model has operators such as Where.

type: int

default_value: 16

searchable_values: None

lora_r

Lora attention dimension.

type: int

default_value: 64

searchable_values: None

lora_alpha

The alpha parameter for Lora scaling.

type: float

default_value: 16

searchable_values: None

lora_dropout

The dropout probability for Lora layers.

type: float

default_value: 0.0

searchable_values: None

bias

Bias type for Lora

type: str

default_value: none

searchable_values: None

modules_to_save

List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint.

type: None

default_value: None

searchable_values: None

torch_dtype

Data type to use for training. Should be one of bfloat16, float16 or float32. If float16 will use fp16 mixed-precision training.

type: str

default_value: bfloat16

searchable_values: None

allow_tf32

Whether or not to allow TF32 on Ampere GPUs. Can be used to speed up training. For more information, see ‘https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices

type: bool

default_value: True

searchable_values: None

train_data_config

Data config for fine-tuning training. If eval_data_config is not provided and eval_dataset_size is not None, the data will be split into train and eval. Otherwise, the data will be used for training only.

type: olive.data.config.DataConfig | Dict

required: True

eval_data_config

Data config for fine-tuning evaluation. Optional if eval_dataset_size is provided or evaluation is not needed.

type: olive.data.config.DataConfig | Dict

default_value: None

searchable_values: None

eval_dataset_size

Size of the validation dataset. Should be either positive and smaller than the number of train sample or a float in the (0, 1) range. If eval_data_config is provided, this parameter will be ignored.

type: float

default_value: None

searchable_values: None

training_args

Training arguments. If None, will use default arguments. See HFTrainingArguments for more details.

type: olive.passes.pytorch.lora.HFTrainingArguments | Dict

default_value: None

searchable_values: None

LoftQ

Run LoftQ fine-tuning on a Hugging Face PyTorch model. This pass only supports PyTorchModelHandler with hf_config.

Input: handler.pytorch.PyTorchModelHandler

Output: handler.pytorch.PyTorchModelHandler

loftq_iter

Number of LoftQ iterations.

type: int

default_value: 1

searchable_values: None

compute_dtype

Computation data type for the quantized modules. If not provided, will use the same dtype as torch_dtype

type: str

default_value: None

searchable_values: None

use_ort_trainer

Whether or not to use ORTTrainer.

type: bool

default_value: False

searchable_values: None

ortmodule_onnx_opset_version

The opset version to use for ONNX export when using ORTTrainer. Only used if use_ort_trainer is True. 16+ is required when using bfloat16 and model has operators such as Where.

type: int

default_value: 16

searchable_values: None

lora_r

Lora attention dimension.

type: int

default_value: 64

searchable_values: None

lora_alpha

The alpha parameter for Lora scaling.

type: float

default_value: 16

searchable_values: None

lora_dropout

The dropout probability for Lora layers.

type: float

default_value: 0.0

searchable_values: None

bias

Bias type for Lora

type: str

default_value: none

searchable_values: None

modules_to_save

List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint.

type: None

default_value: None

searchable_values: None

torch_dtype

Data type to use for training. Should be one of bfloat16, float16 or float32. If float16 will use fp16 mixed-precision training.

type: str

default_value: bfloat16

searchable_values: None

allow_tf32

Whether or not to allow TF32 on Ampere GPUs. Can be used to speed up training. For more information, see ‘https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices

type: bool

default_value: True

searchable_values: None

train_data_config

Data config for fine-tuning training. If eval_data_config is not provided and eval_dataset_size is not None, the data will be split into train and eval. Otherwise, the data will be used for training only.

type: olive.data.config.DataConfig | Dict

required: True

eval_data_config

Data config for fine-tuning evaluation. Optional if eval_dataset_size is provided or evaluation is not needed.

type: olive.data.config.DataConfig | Dict

default_value: None

searchable_values: None

eval_dataset_size

Size of the validation dataset. Should be either positive and smaller than the number of train sample or a float in the (0, 1) range. If eval_data_config is provided, this parameter will be ignored.

type: float

default_value: None

searchable_values: None

training_args

Training arguments. If None, will use default arguments. See HFTrainingArguments for more details.

type: olive.passes.pytorch.lora.HFTrainingArguments | Dict

default_value: None

searchable_values: None

LoRA/QLoRA/LoftQ HFTrainingArguments

pydantic settings olive.passes.pytorch.lora.HFTrainingArguments[source]

Training arguments for transformers.Trainer.

Has the same fields as transformers.TrainingArguments with recommended default values for QLoRA fine-tuning.

field seed: int = 42

Random seed for initialization.

field data_seed: int = 42

Random seed to be used with data samplers.

field optim: str = 'paged_adamw_32bit'

The optimizer to use.

field per_device_train_batch_size: int = 1

The batch size per GPU for training.

field per_device_eval_batch_size: int = 1

The batch size per GPU for evaluation.

field gradient_accumulation_steps: int = 16

Number of updates steps to accumulate the gradients for, before performing a backward/update pass.

field max_steps: int = 10000

The total number of training steps to perform.

field weight_decay: float = 0.0

The L2 weight decay rate of AdamW

field learning_rate: float = 0.0002

The initial learning rate for AdamW.

field gradient_checkpointing: bool = True

Use gradient checkpointing. Recommended.

field lr_scheduler_type: str = 'constant'

Learning rate schedule. Constant a bit better than cosine, and has advantage for analysis.

field warmup_ratio: float = 0.03

Fraction of steps to do a warmup for.

field logging_steps: int = 10

Number of update steps between two logs.

field evaluation_strategy: str = 'no'

The evaluation strategy to use. Will be forced to ‘no’ if there is no eval dataset.

field eval_steps: float = None

Number of update steps between two evaluations if evaluation_strategy=’steps’. Will default to the same value as logging_steps if not set

field group_by_length: bool = True

Whether or not to group samples of roughly the same length together when batching.

field report_to: str | List[str] = 'none'

The list of integrations to report the results and logs to.

field output_dir: str = None

The output dir for logs and checkpoints. If None, will use a temp dir.

field overwrite_output_dir: bool = False

If True, overwrite the content of output_dir. Otherwise, will continue training if output_dir points to a checkpoint directory.

field resume_from_checkpoint: str = None

The path to a folder with a valid checkpoint for the model. Supercedes any checkpoint found in output_dir.

field extra_args: Dict[str, Any] = None

Extra arguments to pass to the trainer. Values can be provided directly to this field as a dict or as keyword arguments to the config. See transformers.TrainingArguments for more details on the available arguments.

create_training_args(use_ort_trainer: bool) TrainingArguments[source]

QuantizationAwareTraining

Run quantization aware training on PyTorch model.

Input: handler.pytorch.PyTorchModelHandler

Output: handler.pytorch.PyTorchModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

train_data_dir

Directory of training data.

type: str

default_value: None

searchable_values: None

val_data_dir

Directory of validation data.

type: str

default_value: None

searchable_values: None

train_dataloader_func

Dataloader function to load training data from given train_data_dir with given train_batch_size.

type: Callable | str

default_value: None

searchable_values: None

training_loop_func

Customized training loop function.

type: Callable | str

default_value: None

searchable_values: None

ptl_module

LightningModule for PyTorch Lightning trainer. It is a way of encapsulating all the logic related to the training, validation, and testing of a PyTorch model. Please refer to https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html for more details.

type: Callable | str

default_value: None

searchable_values: None

ptl_data_module

LightningDataModule for PyTorch Lightning trainer. It is a way of encapsulating all the data-related logic for training, validation, and testing of a PyTorch model. Please refer to https://pytorch-lightning.readthedocs.io/en/stable/data/datamodule.html for more details.

type: Callable | str

default_value: None

searchable_values: None

train_batch_size

Batch size for training.

type: int

default_value: None

searchable_values: None

num_epochs

Maximum number of epochs for training.

type: int

default_value: None

searchable_values: None

num_steps

Maximum number of steps for training.

type: int

default_value: -1

searchable_values: None

do_validate

Whether perform one evaluation epoch over the validation set after training.

type: bool

default_value: False

searchable_values: None

modules_to_fuse

List of list of module names to fuse.

type: List[List[str]]

default_value: None

searchable_values: None

qconfig_func

Customized function to create a QConfig for QAT. Please refer to https://pytorch.org/docs/stable/generated/torch.ao.quantization.qconfig.QConfig.html for details.

type: Callable | str

default_value: None

searchable_values: None

logger

Logger for training.

type: pytorch_lightning.loggers.logger.Logger | Iterable[pytorch_lightning.loggers.logger.Logger] | Callable | bool

default_value: False

searchable_values: None

gpus

Number of GPUs to use.

type: int

default_value: None

searchable_values: None

seed

Random seed for training.

type: int

default_value: None

searchable_values: None

checkpoint_path

Path to save checkpoints.

type: str

default_value: None

searchable_values: None

OpenVINOConversion

Converts PyTorch, ONNX or TensorFlow Model to OpenVino Model.

Input: handler.pytorch.PyTorchModelHandler | handler.onnx.ONNXModelHandler

Output: handler.openvino.OpenVINOModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

input

Set or override shapes for model inputs. It configures dynamic and static dimensions in model inputs depending on your inference requirements.

type: Callable | str | List

default_value: None

searchable_values: None

example_input_func

Function/function name to generate sample of model input in original framework. For PyTorch it can be torch.Tensor. For Tensorflow it can be tf.Tensor or numpy.ndarray.

type: Callable | str

default_value: None

searchable_values: None

compress_to_fp16

Compress weights in output OpenVINO model to FP16. Default is True.

type: bool

default_value: True

searchable_values: None

extra_configs

Extra configurations for OpenVINO model conversion. extra_config can be set by passing a dictionary where key is the parameter name, and the value is the parameter value. Please check Conversion Parameters documentation for more details: https://docs.openvino.ai/2023.3/openvino_docs_OV_Converter_UG_Conversion_Options.html

type: Dict

default_value: None

searchable_values: None

output_model

Name of the output OpenVINO model.

type: str

default_value: ov_model

searchable_values: None

OpenVINOQuantization

Input: handler.openvino.OpenVINOModelHandler

Output: handler.openvino.OpenVINOModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

dataloader_func

Function/function name to generate dataloader for calibration, required if data_config is None.

type: Callable | str

default_value: None

searchable_values: None

dataloader_func_kwargs

Keyword arguments for dataloader_func.

type: Dict[str, Any]

default_value: None

searchable_values: None

data_dir

Path to the directory containing the dataset. For local data, it is required if dataloader_func is provided.

type: str | pathlib.Path | olive.resource_path.ResourcePath | olive.resource_path.ResourcePathConfig | None

default_value: None

searchable_values: None

batch_size

Data config for calibration, required if dataloader_func is None.

type: int

default_value: 1

searchable_values: None

data_config

Data config for calibration, required if dataloader_func is None.

type: olive.data.config.DataConfig | Dict

default_value: None

searchable_values: None

model_type

Used to specify quantization scheme required for specific type of the model. ‘TRANSFORMER’ is the only supported special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, DistilBERT, etc.). None is default.

type: olive.passes.openvino.quantization.ModelTypeEnum

default_value: None

searchable_values: None

preset

Defines quantization scheme for the model. Supported values: ‘PERFORMANCE’, ‘MIXED’.

type: olive.passes.openvino.quantization.PresetEnum

default_value: PERFORMANCE

searchable_values: None

ignored_scope

This parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. Please refer to https://docs.openvino.ai/2023.3/basic_quantization_flow.html#tune-quantization-parameters.

type: str | List[str]

default_value: None

searchable_values: None

ignored_scope_type

Defines the type of the ignored scope. Supported values: ‘names’, ‘types’, ‘patterns’.

type: olive.passes.openvino.quantization.IgnoreScopeTypeEnum

default_value: None

searchable_values: None

target_device

Target device for the model. Supported values: ‘any’, ‘cpu’, ‘gpu’, ‘cpu_spr’, ‘vpu’. Default value is the same as the accelerator type of this workflow run.

type: olive.hardware.accelerator.Device

default_value: cpu

searchable_values: None

extra_configs

Extra configurations for OpenVINO model quantization. Please refer to https://docs.openvino.ai/2023.3/basic_quantization_flow.html#tune-quantization-parameters.

type: List[Dict]

default_value: None

searchable_values: None

SNPEConversion

Convert ONNX or TensorFlow model to SNPE DLC. Uses snpe-tensorflow-to-dlc or snpe-onnx-to-dlc tools from the SNPE SDK.

Input: handler.onnx.ONNXModelHandler | handler.tensorflow.TensorFlowModelHandler

Output: handler.snpe.SNPEModelHandler

input_names

List of input names.

type: List[str]

required: True

input_shapes

List of input shapes. Must be the same length as input_names.

type: List[List[int]]

required: True

output_names

List of output names.

type: List[str]

required: True

input_types

List of input types. If not None, it must be a list of the same length as input_names. List members can be None to use default value. Refer to olive.platform_sdk.qualcomm.constants.InputType for valid values.

type: List[str | None]

default_value: None

searchable_values: None

input_layouts

List of input layouts. If not None, it must be a list of the same length as input_names. List members can be None to use inferred value. Refer to olive.platform_sdk.qualcomm.constants.InputLayout for valid values.

type: List[str | None]

default_value: None

searchable_values: None

extra_args

Extra arguments to pass to snpe conversion tool. Refer to snpe-onnx-to-dlc and snpe-tensorflow-to-dlc at https://developer.qualcomm.com/sites/default/files/docs/snpe/tools.html for more additional arguments. The value is a string that will be passed as is to the tool. e.g.: –enable_cpu_fallback –priority_hint low

type: str

default_value: None

searchable_values: None

SNPEQuantization

Quantize SNPE model. Uses snpe-dlc-quantize tool from the SNPE SDK.

Input: handler.snpe.SNPEModelHandler

Output: handler.snpe.SNPEModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

data_dir

Path to the data directory. Required is data_config is None.

type: str | pathlib.Path | olive.resource_path.ResourcePath | olive.resource_path.ResourcePathConfig | None

default_value: None

searchable_values: None

dataloader_func

Function or function name to create dataloader for quantization. Function should take data directory as an argument and return a FileListDataLoader or torch.data.DataLoader-like object. Required if data_config is None.

type: Callable | str

default_value: None

searchable_values: None

dataloader_func_kwargs

Keyword arguments for dataloader_func.

type: Dict[str, Any]

default_value: None

searchable_values: None

data_config

Data config for quantization, required if dataloader_func is None

type: olive.data.config.DataConfig | Dict

default_value: None

searchable_values: None

use_enhanced_quantizer

Use the enhanced quantizer feature when quantizing the model. Uses an algorithm to determine optimal range instead of min and max range of data. It can be useful for quantizing models that have long tails in the distribution of the data being quantized.

type: bool

default_value: False

searchable_values: Categorical([True, False])

enable_htp

Pack HTP information in quantized DLC, which is not available in Windows.

type: bool

default_value: False

searchable_values: Categorical([True, False])

htp_socs

List of SoCs to generate HTP Offline cache for.

type: List[str]

default_value: None

searchable_values: None

extra_args

Extra arguments to pass to snpe conversion tool. Refer to https://developer.qualcomm.com/sites/default/files/docs/snpe/tools.html#tools_snpe-dlc-quantize for more additional arguments. The value is a string that will be passed as is to the tool. e.g.: –bias_bitwidth 16 –overwrite_cache_records

type: str

default_value: None

searchable_values: None

SNPEtoONNXConversion

Convert a SNPE DLC to ONNX to use with SNPE Execution Provider. Creates a ONNX graph with the SNPE DLC as a node.

Input: handler.snpe.SNPEModelHandler

Output: handler.onnx.ONNXModelHandler

target_device

Target device for the ONNX model. Refer to oliveolive.platform_sdk.qualcomm.constants.SNPEDevice for valid values.

type: str

default_value: cpu

searchable_values: None

target_opset

Target ONNX opset version.

type: int

default_value: 12

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

QNNConversion

Convert ONNX, TensorFlow, or PyTorch model to QNN C++ model. Quantize the model if –input_list is provided as extra_args. Uses qnn-[framework]-converter tool from the QNN SDK.

Input: handler.tensorflow.TensorFlowModelHandler | handler.pytorch.PyTorchModelHandler | handler.onnx.ONNXModelHandler

Output: handler.qnn.QNNModelHandler

input_dim

The names and dimensions of the network input layers specified in the format [input_name comma-separated-dimensions], for example: [“data 1,224,224,3”] Note that the quotes should always be included in order to handle special characters, spaces, etc. For multiple inputs specify multiple –input_dim on the command line like: [“data 1,224,224,3”, “data2 1,224,224,3”] If –input_dim is not specified, the input dimensions will be inferred from the model. If –input_dim is specified, the input dimensions will be used as-is.

type: List[str]

default_value: None

searchable_values: None

out_node

The name of the output node. If not specified, the output node will be inferred from the model. If specified, the output node will be used as-is. Example: [“out_1”, “out_2”]

type: List[str]

default_value: None

searchable_values: None

extra_args

Extra arguments to pass to qnn-[framework]-converter tool, e.g. –show_unconsumed_nodes –custom_io CUSTOM_IO. See the documentation for more details: https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/tools.html

type: str

default_value: None

searchable_values: None

QNNModelLibGenerator

Compile QNN C++ model source code into QNN model library for a specific target. Uses qnn-model-lib-generator tool from the QNN SDK.

Input: handler.qnn.QNNModelHandler

Output: handler.qnn.QNNModelHandler

lib_targets

Specifies the targets to build the models for. Default: aarch64-android x86_64-linux-clang

type: str

default_value: None

searchable_values: None

lib_name

Specifies the name to use for libraries. Default: uses name in <model.bin> if provided, else generic qnn_model.so

type: str

default_value: None

searchable_values: None

QNNContextBinaryGenerator

Create QNN context binary from a QNN model library using a particular backend. Uses qnn-context-binary-generator tool from the QNN SDK.

Input: handler.qnn.QNNModelHandler

Output: handler.qnn.QNNModelHandler

backend

Path to a QNN backend .so library to create the context binary.

type: str

required: True

binary_file

Name of the binary file to save the context binary to. Saved in the same path as –output_dir option with .bin as the binary file extension. If not provided, no backend binary is created.

type: str

default_value: None

searchable_values: None

extra_args

Extra arguments to qnn-context-binary-generator

type: str

default_value: None

searchable_values: None

SparseGPT

Run SparseGPT on a Hugging Face PyTorch model. See https://arxiv.org/abs/2301.00774 for more details on the algorithm. This pass only supports PyTorchModelHandler with hf_config. The transformers model type must be one of [bloom, gpt2, gpt_neox, llama, opt].

Input: handler.pytorch.PyTorchModelHandler

Output: handler.pytorch.PyTorchModelHandler

sparsity

Target sparsity. This can be a float or a list of two integers. Float is the target sparsity per layer. List [n,m] applies semi-structured (n:m) sparsity patterns. Refer to https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/ for more details on 2:4 sparsity pattern.

type: float | List[int]

default_value: None

searchable_values: None

blocksize

Blocksize to use for adaptive mask selection.

type: int

default_value: 128

searchable_values: None

percdamp

Percentage of the average Hessian diagonal to use for dampening. Must be in [0,1].

type: float

default_value: 0.01

searchable_values: None

min_layer

Prune all layers with id >= min_layer.

type: int

default_value: None

searchable_values: None

max_layer

Prune all layers with id < max_layer.

type: int

default_value: None

searchable_values: None

layer_name_filter

Only prune layers whose name contains the given string(s).

type: str | List[str]

default_value: None

searchable_values: None

device

Device to use for performing computations. Can be ‘auto, ‘cpu’, ‘cuda’, ‘cuda:0’, etc. If ‘auto’, will use cuda if available. Does not affect the final model.

type: str

default_value: auto

searchable_values: None

data_config

Data config to use for pruning weights. All samples in the data are expected to be of the same length, most likely the max sequence length of the model.

type: olive.data.config.DataConfig | Dict

required: True

SliceGPT

Run SliceGPT on a Hugging Face PyTorch model. See https://arxiv.org/pdf/2401.15024.pdf for more details on the algorithm. This pass only supports PyTorchModelHandler with hf_config.

Input: handler.pytorch.PyTorchModelHandler

Output: handler.pytorch.PyTorchModelHandler

calibration_data_config

Data config for Dataset to calibrate and calculate perplexity on.

type: olive.data.config.DataConfig | Dict

required: True

calibration_nsamples

Number of samples of the calibration data to load.

type: int

default_value: 128

searchable_values: None

calibration_batch_size

Batch size for loading the calibration data.

type: int

default_value: 16

searchable_values: None

calibration_max_seqlen

Maximum sequence length for the calibration data.

type: int

default_value: 2048

searchable_values: None

varied_seqlen

Varied sequence lengths in the calibration data.

type: bool

default_value: False

searchable_values: None

seed

Seed for sampling the calibration data.

type: int

default_value: 42

searchable_values: None

sparsity

A measure of how much slicing is applied (in the range [0, 1))

type: float

default_value: 0.0

searchable_values: None

round_interval

Interval for rounding the weights (the best value may depend on your hardware)

type: int

default_value: 8

searchable_values: None

final_orientation

Final orientation of the sliced weights. Choices are random or pca.

type: str

default_value: random

searchable_values: None

GptqQuantizer

GPTQ quantization using Hugging Face Optimum and export model with onnxruntime optimized kernel.

Input: handler.pytorch.PyTorchModelHandler

Output: handler.pytorch.PyTorchModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

nsamples

number of samples in calibration dataset to apply quantization. Default value is 128

type: int

default_value: 128

searchable_values: None

bits

quantization bits. Default value is 4

type: int

default_value: 4

searchable_values: None

layers_block_name

Block name to quantize. Default value is model.layers. For models can’t be auto filled, you can refer this link to fill these parameters. https://github.com/AutoGPTQ/AutoGPTQ/blob/896d8204bc89a7cfbda42bf3314e13cf4ce20b02/auto_gptq/modeling/llama.py#L19-L26

type: str

default_value: model.layers

searchable_values: None

outside_layer_modules

Names of other nn modules that in the same level as the transformer layer block. Default value is None.

type: List[str]

default_value: None

searchable_values: None

inside_layer_modules

Names of linear layers in transformer layer module. Default value is None.

type: List[List[str]]

default_value: None

searchable_values: None

group_size

Block size for quantization. Default value is 128.

type: int

default_value: 128

searchable_values: None

batch_size

Batch size for quantization. Default value is 1.

type: int

default_value: 1

searchable_values: None

seed

Random seed for sampling calibration dataset. Default value is 0.

type: int

default_value: 0

searchable_values: None

damp_percent

Damping factor for quantization. Default value is 0.01.

type: float

default_value: 0.01

searchable_values: None

static_groups

Use static groups for quantization. Default value is False.

type: bool

default_value: False

searchable_values: None

true_sequential

Use true sequential for quantization. Default value is False.

type: bool

default_value: False

searchable_values: None

desc_act

Use descriptive activation for quantization. Default value is False.

type: bool

default_value: False

searchable_values: None

sym

Symmetric quantization. Default value is False.

type: bool

default_value: False

searchable_values: None

data_config

Data config for quantization. Default value is None.

type: olive.data.config.DataConfig | Dict

default_value: None

searchable_values: None

dataloader_func

Function/function name to generate dataset for quantization. The returned datasets is a list of tokenized data (e.g. [{ ‘input_ids’: [ 1, 100, 15, … ],’attention_mask’: [ 1, 1, 1, … ]},…]). Default is None.

type: Callable | str

default_value: None

searchable_values: None

dataloader_func_kwargs

Keyword arguments for dataloader_func. Default value is None.

type: Dict[str, Any]

default_value: None

searchable_values: None

TorchTRTConversion

Convert torch.nn.Linear modules in the transformer layers of a HuggingFace PyTorch model to TensorRT modules. The conversion would include fp16 precision and sparse weights, if applicable. The entire model is saved using torch.save and can be loaded using torch.load. Loading the model requires torch-tensorrt and Olive to be installed. This pass only supports PyTorchModelHandler with hf_config. The transformers model type must be one of [bloom, gpt2, gpt_neox, llama, opt].

Input: handler.pytorch.PyTorchModelHandler

Output: handler.pytorch.PyTorchModelHandler

min_layer

Convert all layers with id >= min_layer.

type: int

default_value: None

searchable_values: None

max_layer

Convert all layers with id < max_layer.

type: int

default_value: None

searchable_values: None

layer_name_filter

Only convert layers whose name contains the given string(s).

type: str | List[str]

default_value: None

searchable_values: None

float16

Convert entire model to fp16. If False, only the sparse modules are converted to fp16.

type: bool

default_value: False

searchable_values: None

data_config

Data config to use for compiling module to TensorRT. The batch size of the compiled module is set to the batch size of the first batch of the dataloader.

type: olive.data.config.DataConfig | Dict

required: True

OptimumConversion

Convert a Hugging Face PyTorch model to ONNX model using the Optimum export function.

Input: handler.pytorch.PyTorchModelHandler

Output: handler.onnx.ONNXModelHandler | handler.composite.CompositeModelHandler

script_dir

Directory containing user script dependencies.

type: str

default_value: None

searchable_values: None

user_script

Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.

type: str

default_value: None

searchable_values: None

target_opset

The version of the default (ai.onnx) opset to target.

type: int

default_value: 14

searchable_values: None

components

List of component models to export. E.g. [‘decoder_model’, ‘decoder_with_past_model’]. None means export all components.

type: List[str]

default_value: None

searchable_values: None

fp16

Whether to use fp16 precision to load torch model and then convert it to onnx.

type: bool

default_value: False

searchable_values: None

device

The device to use to do the export. Defaults to ‘cpu’.

type: str

default_value: cpu

searchable_values: None

extra_args

Extra arguments to pass to the optimum.exporters.onnx.main_export function.

type: dict

default_value: None

searchable_values: None

OptimumMerging

Merges a decoder_model with its decoder_with_past_model via the Optimum library.

Input: handler.composite.CompositeModelHandler

Output: handler.onnx.ONNXModelHandler | handler.composite.CompositeModelHandler

strict

When set, the decoder and decoder_with_past are expected to have strictly the same number of outputs. When False, the decoder is allowed to have more outputs that decoder_with_past, in which case constant outputs are added to match the number of outputs.

type: bool

default_value: True

searchable_values: None

save_as_external_data

Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.

type: bool

default_value: False

searchable_values: None

all_tensors_to_one_file

Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.

type: bool

default_value: True

searchable_values: None

external_data_name

Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data

type: str

default_value: None

searchable_values: None

size_threshold

Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.

type: int

default_value: 1024

searchable_values: None

convert_attribute

Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data

type: bool

default_value: False

searchable_values: None

ModelBuilder

Converts a Huggingface generative PyTorch model to ONNX model using the Generative AI builder. See https://github.com/microsoft/onnxruntime-genai

Input: handler.pytorch.PyTorchModelHandler | handler.onnx.ONNXModelHandler

Output: handler.onnx.ONNXModelHandler

precision

Precision of model.

type: olive.passes.onnx.model_builder.ModelBuilder.Precision

required: True

metadata_only

Whether to export the model or generate required metadata only.

type: bool

default_value: False

searchable_values: None

Search options to use for generate loop.

type: Dict[str, Any]

default_value: None

searchable_values: None

int4_block_size

Specify the block_size for int4 quantization. Acceptable values: 16/32/64/128/256.

type: int

default_value: None

searchable_values: None

int4_accuracy_level

Specify the minimum accuracy level for activation of MatMul in int4 quantization.

type: olive.passes.onnx.model_builder.ModelBuilder.AccuracyLevel

default_value: None

searchable_values: None

exclude_embeds

Remove embedding layer from your ONNX model.

type: bool

default_value: False

searchable_values: None

exclude_lm_head

Remove language modeling head from your ONNX model.

type: bool

default_value: False

searchable_values: None

enable_cuda_graph

The model can use CUDA graph capture for CUDA execution provider. If enabled, all nodes being placed on the CUDA EP is the prerequisite for the CUDA graph to be used correctly.

type: bool

default_value: False

searchable_values: None