Passes
The following passes are available in Olive.
Each pass is followed by a description of the pass and a list of the pass’s configuration options.
OnnxConversion
Convert a PyTorch model to ONNX model using torch.onnx.export on CPU.
Input: handler.hf.DistributedHfModelHandler | handler.hf.HfModelHandler | handler.pytorch.PyTorchModelHandler
Output: handler.onnx.DistributedOnnxModelHandler | handler.onnx.ONNXModelHandler
- user_script
Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.
type: pathlib.Path | str
default_value: None
search_defaults: None
- script_dir
Directory containing user script dependencies.
type: pathlib.Path | str
default_value: None
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
- target_opset
The version of the default (ai.onnx) opset to target.
type: int
default_value: 14
search_defaults: None
- use_dynamo_exporter
Whether to use dynamo_export API to export ONNX model.
type: bool
default_value: False
search_defaults: None
- past_key_value_name
The arguments name to point to past key values. For model loaded from huggingface, it is ‘past_key_values’. Basically, it is used only when use_dynamo_exporter is True.
type: str
default_value: past_key_values
search_defaults: None
- device
The device to use for conversion, e.g., ‘cuda’ or ‘cpu’. If not specified, will use ‘cpu’ for PyTorch model and ‘cuda’ for DistributedHfModel.
type: str
default_value: None
search_defaults: None
- torch_dtype
The dtype to cast the model to before conversion, e.g., ‘float32’ or ‘float16’. If not specified, will use the model as is.
type: str
default_value: None
search_defaults: None
- parallel_jobs
Number of parallel jobs. Defaulted to number of CPUs. Set it to 0 to disable.
type: int
default_value: None
search_defaults: None
- merge_adapter_weights
Whether to merge adapter weights before conversion. After merging, the model structure is consistent with base model. That is useful if you cannot run conversion for some fine-tuned models with adapter weights
type: bool
default_value: False
search_defaults: None
- save_metadata_for_token_generation
Whether to save metadata for token generation or not. Includes config.json, generation_config.json, and tokenizer related files.
type: bool
default_value: False
search_defaults: None
OnnxOpVersionConversion
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- target_opset
The version of the default (ai.onnx) opset to target. Default: latest opset version.
type: int
default_value: 22
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
OnnxModelOptimizer
Optimize ONNX model by fusing nodes.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
OrtTransformersOptimization
Use ONNX Transformer Optimizer to optimize transformer based models. Optimize transformer based models in scenarios where ONNX Runtime does not apply the optimization at load time. It is based on onnxruntime.transformers.optimizer.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- model_type
Transformer based model type, including bert (exported by PyTorch), gpt2 (exported by PyTorch), bert_tf (BERT exported by tf2onnx), bert_keras (BERT exported by keras2onnx), and unet/vae/clip (stable diffusion).
type: str
default_value: None
search_defaults: None
- num_heads
Number of attention heads.
type: int
default_value: 0
search_defaults: None
- num_key_value_heads
Number of key/value attention heads.
type: int
default_value: 0
search_defaults: None
Number of hidden nodes.
type: int
default_value: 0
search_defaults: None
- optimization_options
Optimization options that turn on/off some fusions.
type: Dict[str, Any] | onnxruntime.transformers.fusion_options.FusionOptions
default_value: None
search_defaults: None
- opt_level
Graph optimization level of Onnx Runtime: 0 - disable all (default), 1 - basic, 2 - extended, 99 - all.
type: int
default_value: None
search_defaults: None
- use_gpu
Flag for GPU inference.
type: bool
default_value: False
search_defaults: None
- only_onnxruntime
Whether only use onnxruntime to optimize model, and no python fusion. Disable some optimizers that might cause failure in symbolic shape inference or attention fusion, when opt_level > 1.
type: bool
default_value: False
search_defaults: None
- float16
Whether half-precision float will be used.
type: bool
default_value: False
search_defaults: None
- keep_io_types
Keep input and output tensors in their original data type. Only used when float16 is True.
type: bool
default_value: True
search_defaults: None
- force_fp32_ops
Operators that are forced to run in float32. Only used when float16 is True.
type: List[str]
default_value: None
search_defaults: None
- force_fp32_nodes
Nodes that are forced to run in float32. Only used when float16 is True.
type: List[str]
default_value: None
search_defaults: None
- force_fp16_inputs
Force the conversion of the inputs of some operators to float16, even if ‘convert_float_to_float16` tool prefers it to keep them in float32.
type: Dict[str, List[int]]
default_value: None
search_defaults: None
- use_gqa
Replace MultiHeadAttention with GroupQueryAttention. True is only supported when float16 is True.
type: bool
default_value: False
search_defaults: None
- input_int32
Whether int32 tensors will be used as input.
type: bool
default_value: False
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
OrtSessionParamsTuning
Optimize ONNX Runtime inference settings.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- user_script
Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.
type: pathlib.Path | str
default_value: None
search_defaults: None
- script_dir
Directory containing user script dependencies.
type: pathlib.Path | str
default_value: None
search_defaults: None
- data_config
Data config to load data for computing latency.
type: olive.data.config.DataConfig | Dict
default_value: None
search_defaults: None
- device
Device selected for tuning process.
type: str
default_value: cpu
search_defaults: None
- cpu_cores
CPU cores used for thread tuning.
type: int
default_value: None
search_defaults: None
- io_bind
Whether enable IOBinding Search for ONNX Runtime inference.
type: bool
default_value: False
search_defaults: None
- enable_cuda_graph
Whether enable CUDA Graph for CUDA execution provider.
type: bool
default_value: False
search_defaults: None
- providers_list
Execution providers framework list to execute the ONNX models.
type: list
default_value: [‘CPUExecutionProvider’]
search_defaults: None
- execution_mode_list
Parallelism list between operators.
type: list
default_value: None
search_defaults: None
- opt_level_list
Optimization level list for ONNX model.
type: list
default_value: None
search_defaults: None
- trt_fp16_enable
Whether enable FP16 mode for TensorRT execution provider.
type: bool
default_value: False
search_defaults: None
- intra_thread_num_list
List of intra thread number for test.
type: list
default_value: [None]
search_defaults: None
- inter_thread_num_list
List of inter thread number for test.
type: list
default_value: [None]
search_defaults: None
- extra_session_config
Extra customized session options during tuning process.
type: Dict[str, Any]
default_value: None
search_defaults: None
- force_evaluate_other_eps
Whether force to evaluate all execution providers which are different with the associated execution provider.
type: bool
default_value: False
search_defaults: None
- enable_profiling
Whether enable profiling for ONNX Runtime inference.
type: bool
default_value: False
search_defaults: None
OnnxFloatToFloat16
Converts a model to float16. It uses the float16 converter from onnxruntime to convert the model to float16.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- min_positive_val
Constant values will be clipped against this value
type: float
default_value: 1e-07
search_defaults: None
- max_finite_val
Constant values will be clipped against this value
type: float
default_value: 10000.0
search_defaults: None
- keep_io_types
Whether model inputs/outputs should be left as float32
type: bool
default_value: False
search_defaults: None
- use_symbolic_shape_infer
Use symbolic shape inference instead of onnx shape inference. Defaults to True.
type: bool
default_value: True
search_defaults: None
- op_block_list
List of op types to leave as float32
type: List[str]
default_value: None
search_defaults: None
- node_block_list
List of node names to leave as float32
type: List[str]
default_value: None
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
OnnxIOFloat16ToFloat32
Converts float16 model inputs/outputs to float32.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- name_pattern
Only convert inputs/outputs whose name matches this pattern. By defaultlooking for logits names
type: str
default_value: logits
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
OrtMixedPrecision
Convert model to mixed precision.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- op_block_list
List of op types to leave as float32
type: List[str]
default_value: [‘SimplifiedLayerNormalization’, ‘SkipSimplifiedLayerNormalization’, ‘Relu’, ‘Add’]
search_defaults: None
- atol
Absolute tolerance for checking float16 conversion
type: float
default_value: 1e-06
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
QNNPreprocess
Preprocess ONNX model for quantization targeting QNN Execution Provider.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- fuse_layernorm
Whether to fuse ReduceMean sequence into a single LayerNormalization node.
type: bool
default_value: False
search_defaults: None
- inputs_to_make_channel_last
inputs_to_make_channel_last: List of graph input names to transpose to be “channel-last”. For example, if “input0” originally has the shape (N, C, D1, D2, …, Dn), the resulting model will change input0’s shape to (N, D1, D2, …, Dn, C) and add a transpose node after it. Original: input0 (N, C, D1, D2, …, Dn) –> <Nodes> Updated: input0 (N, D1, D2, …, Dn, C) –> Transpose –> input0_chanfirst (N, C, D1, D2, …, Dn) –> <Nodes> This can potentially improve inference latency for QDQ models running on QNN EP because the additional transpose node may allow other transpose nodes inserted during ORT layout transformation to cancel out.
type: list
default_value: None
search_defaults: None
- outputs_to_make_channel_last
List of graph output names to transpose to be “channel-last”. For example, if “output0” originally has the shape (N, C, D1, D2, …, Dn), the resulting model will change output0’s shape to (N, D1, D2, …, Dn, C) and add a transpose node before it. Original: <Nodes> –> output0 (N, C, D1, D2, …, Dn) Updated: <Nodes> –> output0_chanfirst (N, C, D1, D2, …, Dn) –> Transpose –> output0 (N, D1, D2, …, Dn, C) This can potentially improve inference latency for QDQ models running on QNN EP because the additional transpose node may allow other transpose nodes inserted during ORT layout transformation to cancel out.
type: list
default_value: None
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
MixedPrecisionOverrides
Qnn mixed precision overrides pass. Pre-processes the model for mixed precision quantization by resolving constraints that each operator has when being converted to QNN operator Constraints refer to situations where certain tensor cannot be quantized to 16 bits standalone but rather neighboring tensors as well in order to have valid operators. Specific problem that arises here is the situation where certain tensor can be input to multiple nodes and each node requires different precision NOTE: This pass handles just initializer tensors as activation tensors are handled by onnxruntime
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- overrides_config
Path/Dict to mixed precision overrides json, with the format of {tensor_name: quant_type}
type: str | Dict
required: True
- element_wise_binary_ops
List of element wise binary ops, if not provided defaults to [‘Add’, ‘Sub’, ‘Mul’, ‘Div’]
type: list
default_value: None
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
OnnxDynamicQuantization
ONNX Dynamic Quantization Pass.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- quant_mode
dynamic quantization mode
type: str
default_value: dynamic
search_defaults: None
- weight_type
Data type for quantizing weights which is used both in dynamic and static quantization. ‘QInt8’ for signed 8-bit integer, ‘QUInt8’ for unsigned 8-bit integer.
type: str
default_value: QInt8
search_defaults: Categorical([‘QInt8’, ‘QUInt8’])
- op_types_to_quantize
List of operator types to quantize. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- append_first_op_types_to_quantize_list
If True, append operator types which firstly appear in the model to op_types_to_quantize.
type: bool
default_value: False
search_defaults: None
- nodes_to_quantize
List of node names to quantize. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- nodes_to_exclude
List of node names to exclude from quantization. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- per_channel
Quantize weights per channel. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization
type: bool
default_value: False
search_defaults: Categorical([True, False])
- reduce_range
Quantize weights with 7-bits. It may improve the accuracy for some models running on non-VNNI machine, especially for per-channel mode. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization
type: bool
default_value: False
search_defaults: Categorical([True, False])
- quant_preprocess
Shape inference and model optimization, in preparation for quantization. https://onnxruntime.ai/docs/performance/quantization.html#pre-processing
type: bool
default_value: True
search_defaults: Categorical([True, False])
- extra.Sigmoid.nnapi
type: bool
default_value: False
search_defaults: None
- ActivationSymmetric
symmetrize calibration data for activations
type: bool
default_value: False
search_defaults: None
- WeightSymmetric
symmetrize calibration data for weights
type: bool
default_value: True
search_defaults: None
- EnableSubgraph
If enabled, subgraph will be quantized. Dynamic mode currently is supported.
type: bool
default_value: False
search_defaults: None
- ForceQuantizeNoInputCheck
By default, some latent operators like maxpool, transpose, do not quantize if their input is not quantized already. Setting to True to force such operator always quantize input and so generate quantized output. Also the True behavior could be disabled per node using the nodes_to_exclude.
type: bool
default_value: False
search_defaults: None
- MatMulConstBOnly
If enabled, only MatMul with const B will be quantized.
type: bool
default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘dynamic’,): True, (‘static’,): False}, default: OLIVE_INVALID_PARAM_VALUE)
search_defaults: None
- extra_options
Key value pair dictionary for extra_options in quantization. Please refer to https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/quantize.py for details about the supported options. If an option is one of [‘extra.Sigmoid.nnapi’, ‘ActivationSymmetric’, ‘WeightSymmetric’, ‘EnableSubgraph’, ‘ForceQuantizeNoInputCheck’, ‘MatMulConstBOnly’], it will be overwritten by the corresponding config parameter value.
type: dict
default_value: None
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
OnnxStaticQuantization
ONNX Static Quantization Pass.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- quant_mode
static quantization mode
type: str
default_value: static
search_defaults: None
- weight_type
Data type for quantizing weights which is used both in dynamic and static quantization. ‘QInt8’ for signed 8-bit integer, ‘QUInt8’ for unsigned 8-bit integer.
type: str
default_value: QInt8
search_defaults: Categorical([‘QInt8’, ‘QUInt8’])
- op_types_to_quantize
List of operator types to quantize. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- append_first_op_types_to_quantize_list
If True, append operator types which firstly appear in the model to op_types_to_quantize.
type: bool
default_value: False
search_defaults: None
- nodes_to_quantize
List of node names to quantize. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- nodes_to_exclude
List of node names to exclude from quantization. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- per_channel
Quantize weights per channel. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization
type: bool
default_value: False
search_defaults: Categorical([True, False])
- reduce_range
Quantize weights with 7-bits. It may improve the accuracy for some models running on non-VNNI machine, especially for per-channel mode. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization
type: bool
default_value: False
search_defaults: Categorical([True, False])
- quant_preprocess
Shape inference and model optimization, in preparation for quantization. https://onnxruntime.ai/docs/performance/quantization.html#pre-processing
type: bool
default_value: True
search_defaults: Categorical([True, False])
- data_config
Data config for calibration, required if quant_mode is ‘static’
type: olive.data.config.DataConfig | Dict
default_value: None
search_defaults: None
- calibrate_method
Current calibration methods supported are MinMax and Entropy, Please use CalibrationMethod.MinMax or CalibrationMethod.Entropy as options. Percentile is not supported for onnxruntime==1.16.0, please avoid to set/search it.
type: str
default_value: MinMax
search_defaults: Categorical([‘MinMax’, ‘Entropy’, ‘Percentile’])
- quant_format
QOperator format quantizes the model with quantized operators directly. QDQ format quantize the model by inserting QuantizeLinear/DeQuantizeLinear on the tensor.
type: str
default_value: QDQ
search_defaults: Categorical([‘QOperator’, ‘QDQ’])
- activation_type
Quantization data type of activation. Please refer to https://onnxruntime.ai/docs/performance/quantization.html for more details on data type selection
type: str
default_value: QInt8
search_defaults: Conditional(parents: (‘quant_format’, ‘weight_type’), support: {(‘QDQ’, ‘QInt8’): Categorical([‘QInt8’]), (‘QDQ’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘QOperator’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘QOperator’, ‘QInt8’): Categorical([<SpecialParamValue.INVALID: ‘OLIVE_INVALID_PARAM_VALUE’>])}, default: Categorical([<SpecialParamValue.INVALID: ‘OLIVE_INVALID_PARAM_VALUE’>]))
- prepare_qnn_config
Whether to generate a suitable quantization config for the input model. Should be set to True if model is targeted for QNN EP.
type: bool
default_value: False
search_defaults: None
- qnn_extra_options
Extra options for QNN quantization. Please refer to onnxruntime.quantization.execution_providers.qnn.get_qnn_qdq_config. By default, the options are set to None. Options are only used if prepare_qnn_config is set to True. Available options are: - init_overrides:dict = None: Initial tensor-level quantization overrides. Defaults to None. This function updates of a copy of these overrides with any necessary adjustments and includes them in the returned configuration object (i.e., config.extra_options[‘TensorQuantOverrides’]). The key is a tensor name and the value is a list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For per-channel quantization, the list contains either a dictionary for each channel in the tensor or a single dictionary that is assumed to apply to all channels. An ‘axis’ key must be present in the first dictionary for per-channel quantization. Each dictionary contains optional overrides with the following keys and values. ‘quant_type’ = QuantType : The tensor’s quantization data type. ‘axis’ = Int : The per-channel axis. Must be present for per-channel weights. ‘scale’ = Float : The scale value to use. Must also specify zero_point if set. ‘zero_point’ = Int : The zero-point value to use. Must also specify scale is set. ‘symmetric’ = Bool : If the tensor should use symmetric quantization. Invalid if also set scale or zero_point. ‘reduce_range’ = Bool : If the quantization range should be reduced. Invalid if also set scale or zero_point. Only valid for initializers. ‘rmax’ = Float : Override the maximum real tensor value in calibration data. Invalid if also set scale or zero_point. ‘rmin’ = Float : Override the minimum real tensor value in calibration data. Invalid if also set scale or zero_point. ‘convert’ = Dict : A nested dictionary with the same keys for an activation tensor that should be converted to another quantization type. ‘convert[“recv_nodes”] = Set : Set of node names that consume the converted activation, other nodes get the original type. If not specified, assume all consumer nodes get the converted type. - add_qtype_converts: bool = True: True if this function should automatically add “convert” entries to the provided init_overrides to ensure that operators use valid input/output types (activations only). Ex: if you override the output of an Add to 16-bit, this option ensures that the activation inputs of the Add are also up-converted to 16-bit and that data types for surrounding ops are converted appropriately. Refer to the documentation in mixed_precision_overrides_utils.py for additional details. To be noted that the options might be updated in the further version of onnxruntime.
type: dict
default_value: None
search_defaults: None
- extra.Sigmoid.nnapi
type: bool
default_value: False
search_defaults: None
- ActivationSymmetric
symmetrize calibration data for activations
type: bool
default_value: False
search_defaults: None
- WeightSymmetric
symmetrize calibration data for weights
type: bool
default_value: True
search_defaults: None
- EnableSubgraph
If enabled, subgraph will be quantized. Dynamic mode currently is supported.
type: bool
default_value: False
search_defaults: None
- ForceQuantizeNoInputCheck
By default, some latent operators like maxpool, transpose, do not quantize if their input is not quantized already. Setting to True to force such operator always quantize input and so generate quantized output. Also the True behavior could be disabled per node using the nodes_to_exclude.
type: bool
default_value: False
search_defaults: None
- MatMulConstBOnly
If enabled, only MatMul with const B will be quantized.
type: bool
default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘dynamic’,): True, (‘static’,): False}, default: OLIVE_INVALID_PARAM_VALUE)
search_defaults: None
- extra_options
Key value pair dictionary for extra_options in quantization. Please refer to https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/quantize.py for details about the supported options. If an option is one of [‘extra.Sigmoid.nnapi’, ‘ActivationSymmetric’, ‘WeightSymmetric’, ‘EnableSubgraph’, ‘ForceQuantizeNoInputCheck’, ‘MatMulConstBOnly’], it will be overwritten by the corresponding config parameter value.
type: dict
default_value: None
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
OnnxQuantization
Quantize ONNX model with static/dynamic quantization techniques.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- quant_mode
Onnx Quantization mode. ‘dynamic’ for dynamic quantization, ‘static’ for static quantization.
type: str
default_value: static
search_defaults: Categorical([‘dynamic’, ‘static’])
- weight_type
Data type for quantizing weights which is used both in dynamic and static quantization. ‘QInt8’ for signed 8-bit integer, ‘QUInt8’ for unsigned 8-bit integer.
type: str
default_value: QInt8
search_defaults: Categorical([‘QInt8’, ‘QUInt8’])
- op_types_to_quantize
List of operator types to quantize. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- append_first_op_types_to_quantize_list
If True, append operator types which firstly appear in the model to op_types_to_quantize.
type: bool
default_value: False
search_defaults: None
- nodes_to_quantize
List of node names to quantize. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- nodes_to_exclude
List of node names to exclude from quantization. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- per_channel
Quantize weights per channel. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization
type: bool
default_value: False
search_defaults: Categorical([True, False])
- reduce_range
Quantize weights with 7-bits. It may improve the accuracy for some models running on non-VNNI machine, especially for per-channel mode. Tips: When to use reduce_range and per-channel quantization: https://onnxruntime.ai/docs/performance/quantization.html#when-to-use-reduce-range-and-per-channel-quantization
type: bool
default_value: False
search_defaults: Categorical([True, False])
- quant_preprocess
Shape inference and model optimization, in preparation for quantization. https://onnxruntime.ai/docs/performance/quantization.html#pre-processing
type: bool
default_value: True
search_defaults: Categorical([True, False])
- data_config
Data config for calibration, required if quant_mode is ‘static’
type: olive.data.config.DataConfig | Dict
default_value: None
search_defaults: None
- calibrate_method
Current calibration methods supported are MinMax and Entropy, Please use CalibrationMethod.MinMax or CalibrationMethod.Entropy as options. Percentile is not supported for onnxruntime==1.16.0, please avoid to set/search it.
type: str
default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘static’,): ‘MinMax’, (‘dynamic’,): <SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>}, default: OLIVE_INVALID_PARAM_VALUE)
search_defaults: Conditional(parents: (‘quant_mode’,), support: {(‘static’,): Categorical([‘MinMax’, ‘Entropy’, ‘Percentile’])}, default: Categorical([<SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>]))
- quant_format
QOperator format quantizes the model with quantized operators directly. QDQ format quantize the model by inserting QuantizeLinear/DeQuantizeLinear on the tensor.
type: str
default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘static’,): ‘QDQ’, (‘dynamic’,): <SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>}, default: OLIVE_INVALID_PARAM_VALUE)
search_defaults: Conditional(parents: (‘quant_mode’,), support: {(‘static’,): Categorical([‘QOperator’, ‘QDQ’])}, default: Categorical([<SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>]))
- activation_type
Quantization data type of activation. Please refer to https://onnxruntime.ai/docs/performance/quantization.html for more details on data type selection
type: str
default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘static’,): ‘QInt8’, (‘dynamic’,): <SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>}, default: OLIVE_INVALID_PARAM_VALUE)
search_defaults: Conditional(parents: (‘quant_mode’, ‘quant_format’, ‘weight_type’), support: {(‘static’, ‘QDQ’, ‘QInt8’): Categorical([‘QInt8’]), (‘static’, ‘QDQ’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘static’, ‘QOperator’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘static’, ‘QOperator’, ‘QInt8’): Categorical([<SpecialParamValue.INVALID: ‘OLIVE_INVALID_PARAM_VALUE’>])}, default: Categorical([<SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>]))
- prepare_qnn_config
Whether to generate a suitable quantization config for the input model. Should be set to True if model is targeted for QNN EP.
type: bool
default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘static’,): False, (‘dynamic’,): <SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>}, default: OLIVE_INVALID_PARAM_VALUE)
search_defaults: None
- qnn_extra_options
Extra options for QNN quantization. Please refer to onnxruntime.quantization.execution_providers.qnn.get_qnn_qdq_config. By default, the options are set to None. Options are only used if prepare_qnn_config is set to True. Available options are: - init_overrides:dict = None: Initial tensor-level quantization overrides. Defaults to None. This function updates of a copy of these overrides with any necessary adjustments and includes them in the returned configuration object (i.e., config.extra_options[‘TensorQuantOverrides’]). The key is a tensor name and the value is a list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For per-channel quantization, the list contains either a dictionary for each channel in the tensor or a single dictionary that is assumed to apply to all channels. An ‘axis’ key must be present in the first dictionary for per-channel quantization. Each dictionary contains optional overrides with the following keys and values. ‘quant_type’ = QuantType : The tensor’s quantization data type. ‘axis’ = Int : The per-channel axis. Must be present for per-channel weights. ‘scale’ = Float : The scale value to use. Must also specify zero_point if set. ‘zero_point’ = Int : The zero-point value to use. Must also specify scale is set. ‘symmetric’ = Bool : If the tensor should use symmetric quantization. Invalid if also set scale or zero_point. ‘reduce_range’ = Bool : If the quantization range should be reduced. Invalid if also set scale or zero_point. Only valid for initializers. ‘rmax’ = Float : Override the maximum real tensor value in calibration data. Invalid if also set scale or zero_point. ‘rmin’ = Float : Override the minimum real tensor value in calibration data. Invalid if also set scale or zero_point. ‘convert’ = Dict : A nested dictionary with the same keys for an activation tensor that should be converted to another quantization type. ‘convert[“recv_nodes”] = Set : Set of node names that consume the converted activation, other nodes get the original type. If not specified, assume all consumer nodes get the converted type. - add_qtype_converts: bool = True: True if this function should automatically add “convert” entries to the provided init_overrides to ensure that operators use valid input/output types (activations only). Ex: if you override the output of an Add to 16-bit, this option ensures that the activation inputs of the Add are also up-converted to 16-bit and that data types for surrounding ops are converted appropriately. Refer to the documentation in mixed_precision_overrides_utils.py for additional details. To be noted that the options might be updated in the further version of onnxruntime.
type: dict
default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘static’,): None, (‘dynamic’,): <SpecialParamValue.IGNORED: ‘OLIVE_IGNORED_PARAM_VALUE’>}, default: OLIVE_INVALID_PARAM_VALUE)
search_defaults: None
- extra.Sigmoid.nnapi
type: bool
default_value: False
search_defaults: None
- ActivationSymmetric
symmetrize calibration data for activations
type: bool
default_value: False
search_defaults: None
- WeightSymmetric
symmetrize calibration data for weights
type: bool
default_value: True
search_defaults: None
- EnableSubgraph
If enabled, subgraph will be quantized. Dynamic mode currently is supported.
type: bool
default_value: False
search_defaults: None
- ForceQuantizeNoInputCheck
By default, some latent operators like maxpool, transpose, do not quantize if their input is not quantized already. Setting to True to force such operator always quantize input and so generate quantized output. Also the True behavior could be disabled per node using the nodes_to_exclude.
type: bool
default_value: False
search_defaults: None
- MatMulConstBOnly
If enabled, only MatMul with const B will be quantized.
type: bool
default_value: ConditionalDefault(parents: (‘quant_mode’,), support: {(‘dynamic’,): True, (‘static’,): False}, default: OLIVE_INVALID_PARAM_VALUE)
search_defaults: None
- extra_options
Key value pair dictionary for extra_options in quantization. Please refer to https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/quantize.py for details about the supported options. If an option is one of [‘extra.Sigmoid.nnapi’, ‘ActivationSymmetric’, ‘WeightSymmetric’, ‘EnableSubgraph’, ‘ForceQuantizeNoInputCheck’, ‘MatMulConstBOnly’], it will be overwritten by the corresponding config parameter value.
type: dict
default_value: None
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
OnnxMatMul4Quantizer
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- block_size
Block size for quantization. Default value is 32.
type: int
default_value: 32
search_defaults: None
- is_symmetric
Symmetric quantization. Default value is True.
type: bool
default_value: True
search_defaults: None
- nodes_to_exclude
List of node names to exclude from quantization.
type: list
default_value: None
search_defaults: None
- accuracy_level
Available from onnxruntime>=1.17.0 The minimum accuracy level of input A, can be: 0(unset), 1(fp32), 2(fp16), 3(bf16), or 4(int8) (default unset when 0 or None). It is used to control how input A is quantized or downcast internally while doing computation, for example: 0 means input A will not be quantized or downcast while doing computation. 4 means input A can be quantized with the same block_size to int8 internally from type T1. Refer to the MatMulNBits contrib op’s ‘accuracy_level’ attribute for details (https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftmatmulnbits).
type: int
default_value: None
search_defaults: None
- algorithm
If ‘None’, the Matmul node with fp32 const weight will be quantize to int4.1. ‘RTN’ and ‘GPTQ’ are available from onnxruntime>=1.17.0 - For 4b quantize a model with RTN or GPTQ algorithm. Please refer to https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md for more details on weight only quantization using Intel® Neural Compressor. 2. ‘DEFAULT’, ‘HQQ’ are available from onnxruntime>=1.18.0 - DEFAULT takes the same effect as None- For HQQ, please refer to onnxruntime for more details: https://github.com/microsoft/onnxruntime/blob/7e613ee821405b1192d0b71b9434a4f94643f1e4/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py#L102C1-L126C25
type: str
default_value: None
search_defaults: None
- weight_only_quant_configs
Available from onnxruntime>=1.17.0, if None, the default behavior of given algorithm will be used. The config is binding to the algorithm with following map: 1. “algorithm” is “DEFAULT”, by default, the weight_only_quant_configs is: “weight_only_quant_configs”: { “block_size”: 128, “is_symmetric”: False, “accuracy_level”: None } https://github.com/microsoft/onnxruntime/blob/7e613ee821405b1192d0b71b9434a4f94643f1e4/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py#L129C1-L140C45 2. “algorithm” is “HQQ”, by default, the weight_only_quant_configs is: “weight_only_quant_configs”: { “block_size”: 128, // channel number in one block to execute a GPTQ quantization iteration. “bits”: 4, // how many bits to represent weight. “axis”: 1, // 0 or 1. which axis to quantize. https://arxiv.org/pdf/2309.15531.pdf } https://github.com/microsoft/onnxruntime/blob/7e613ee821405b1192d0b71b9434a4f94643f1e4/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py#L129C1-L140C45 3. “algorithm” is “RTN”, by default, the weight_only_quant_configs is: “weight_only_quant_configs”: { “ratios”: None, // type: dict, percentile of clip. Defaults to None. } https://github.com/microsoft/onnxruntime/blob/7e613ee821405b1192d0b71b9434a4f94643f1e4/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py#L42C1-L60C29 4. “algorithm” is “GPTQ”, by default, the weight_only_quant_configs is: “weight_only_quant_configs”: { “percdamp”: 0.01, // percent of the average Hessian diagonal to use for dampening. “block_size”: 128, “actorder”: False, // whether rearrange Hessian matrix considering the diag’s value. “mse”: False, // whether get scale and zero point with mse error. “perchannel”: True, // whether quantize weight per-channel. } For GPTQ’s “calibration_data_reader”, you can provider a dataloader function or a data config like what we do for onnx static quantization. https://github.com/microsoft/onnxruntime/blob/7e613ee821405b1192d0b71b9434a4f94643f1e4/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py#L63C1-L99C37
type: dict
default_value: None
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
- data_config
Data config for calibration, required if quant_mode is ‘static’
type: olive.data.config.DataConfig | Dict
default_value: None
search_defaults: None
MatMulNBitsToQDQ
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- use_transpose_op
Whether to use a Transpose operator after the DequantizeLinear operator. If False, the weight initializer will be transposed instead. Default is False. True might be more efficient on some EPs such as DirectML.
type: bool
default_value: False
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
DynamicToFixedShape
Convert dynamic shape to fixed shape for ONNX model.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- dim_param
Symbolic parameter name. Provide dim_value if specified.
type: List[str]
default_value: None
search_defaults: None
- dim_value
Value to replace dim_param with in the model. Must be > 0.
type: List[int]
default_value: None
search_defaults: None
- input_name
Model input name to replace shape of. Provide input_shape if specified.
type: List[str]
default_value: None
search_defaults: None
- input_shape
Shape to use for input_shape. Provide comma separated list for the shape. All values must be > 0. e.g. [1,3,256,256]
type: List[List[int]]
default_value: None
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
IncDynamicQuantization
Intel® Neural Compressor Dynamic Quantization Pass.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- approach
dynamic quantization mode
type: str
default_value: dynamic
search_defaults: None
- device
Intel® Neural Compressor quantization device. Support ‘cpu’ and ‘gpu’.
type: str
default_value: cpu
search_defaults: None
- backend
Backend for model execution. Support ‘default’, ‘onnxrt_trt_ep’, ‘onnxrt_cuda_ep’
type: str
default_value: default
search_defaults: None
- domain
Model domain. Support ‘auto’, ‘cv’, ‘object_detection’, ‘nlp’ and ‘recommendation_system’. Intel® Neural Compressor Adaptor will use specific quantization settings for different domains automatically, and explicitly specified quantization settings will override the automatic setting. If users set domain as auto, automatic detection for domain will be executed.
type: str
default_value: auto
search_defaults: None
- workspace
Workspace for Intel® Neural Compressor quantization where intermediate files and tuning history file are stored. Default value is: “./nc_workspace/{}/”.format(datetime.datetime.now().strftime(“%Y-%m-%d_%H-%M-%S”))
type: str
default_value: None
search_defaults: None
- recipes
Recipes for Intel® Neural Compressor quantization, support list is as below. ‘smooth_quant’: whether do smooth quant ‘smooth_quant_args’: parameters for smooth_quant ‘fast_bias_correction’: whether do fast bias correction ‘weight_correction’: whether do weight correction ‘gemm_to_matmul’: whether convert gemm to matmul and add, only valid for onnx models ‘graph_optimization_level’: support ‘DISABLE_ALL’, ‘ENABLE_BASIC’, ‘ENABLE_EXTENDED’, ‘ENABLE_ALL’ only valid for onnx models ‘first_conv_or_matmul_quantization’: whether quantize the first conv or matmul ‘last_conv_or_matmul_quantization’: whether quantize the last conv or matmul ‘pre_post_process_quantization’: whether quantize the ops in preprocessing and postprocessing ‘add_qdq_pair_to_weight’: whether add QDQ pair for weights, only valid for onnxrt_trt_ep ‘optypes_to_exclude_output_quant’: don’t quantize output of specified optypes ‘dedicated_qdq_pair’: whether dedicate QDQ pair, only valid for onnxrt_trt_ep
type: dict
default_value: {}
search_defaults: None
- reduce_range
Whether use 7 bit to quantization.
type: bool
default_value: False
search_defaults: Categorical([True, False])
- quant_level
Intel® Neural Compressor allows users to choose different tuning processes by specifying the quantization level (quant_level). Currently 3 quant_levels are supported. 0 is conservative strategy, 1 is basic or user-specified strategy, auto (default) is the combination of 0 and 1. Please refer to https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-process https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-algorithms for more details
type: str
default_value: auto
search_defaults: None
- excluded_precisions
Precisions to be excluded, Default value is empty list. Intel® Neural Compressor enable the mixed precision with fp32 + bf16(only when device is ‘gpu’ and backend is ‘onnxrt_cuda_ep’) + int8 by default. If you want to disable bf16 data type, you can specify excluded_precisions = [‘bf16’].
type: list
default_value: []
search_defaults: None
- tuning_criterion
Instance of TuningCriterion class. In this class you can set strategy, strategy_kwargs, timeout, max_trials and objective.
type: dict
default_value: {‘strategy’: ‘basic’, ‘strategy_kwargs’: None, ‘timeout’: 0, ‘max_trials’: 5, ‘objective’: ‘performance’}
search_defaults: None
- metric
Accuracy metric to generate an evaluation function for Intel® Neural Compressor accuracy aware tuning.
type: olive.evaluator.metric.Metric | None
default_value: None
search_defaults: None
- weight_only_config
INC weight only quantization config.
type: dict
default_value: {}
search_defaults: None
- op_type_dict
INC weight only quantization config.
type: dict
default_value: {}
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
IncStaticQuantization
Intel® Neural Compressor Static Quantization Pass.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- approach
static quantization mode
type: str
default_value: static
search_defaults: None
- device
Intel® Neural Compressor quantization device. Support ‘cpu’ and ‘gpu’.
type: str
default_value: cpu
search_defaults: None
- backend
Backend for model execution. Support ‘default’, ‘onnxrt_trt_ep’, ‘onnxrt_cuda_ep’
type: str
default_value: default
search_defaults: None
- domain
Model domain. Support ‘auto’, ‘cv’, ‘object_detection’, ‘nlp’ and ‘recommendation_system’. Intel® Neural Compressor Adaptor will use specific quantization settings for different domains automatically, and explicitly specified quantization settings will override the automatic setting. If users set domain as auto, automatic detection for domain will be executed.
type: str
default_value: auto
search_defaults: None
- workspace
Workspace for Intel® Neural Compressor quantization where intermediate files and tuning history file are stored. Default value is: “./nc_workspace/{}/”.format(datetime.datetime.now().strftime(“%Y-%m-%d_%H-%M-%S”))
type: str
default_value: None
search_defaults: None
- recipes
Recipes for Intel® Neural Compressor quantization, support list is as below. ‘smooth_quant’: whether do smooth quant ‘smooth_quant_args’: parameters for smooth_quant ‘fast_bias_correction’: whether do fast bias correction ‘weight_correction’: whether do weight correction ‘gemm_to_matmul’: whether convert gemm to matmul and add, only valid for onnx models ‘graph_optimization_level’: support ‘DISABLE_ALL’, ‘ENABLE_BASIC’, ‘ENABLE_EXTENDED’, ‘ENABLE_ALL’ only valid for onnx models ‘first_conv_or_matmul_quantization’: whether quantize the first conv or matmul ‘last_conv_or_matmul_quantization’: whether quantize the last conv or matmul ‘pre_post_process_quantization’: whether quantize the ops in preprocessing and postprocessing ‘add_qdq_pair_to_weight’: whether add QDQ pair for weights, only valid for onnxrt_trt_ep ‘optypes_to_exclude_output_quant’: don’t quantize output of specified optypes ‘dedicated_qdq_pair’: whether dedicate QDQ pair, only valid for onnxrt_trt_ep
type: dict
default_value: {}
search_defaults: None
- reduce_range
Whether use 7 bit to quantization.
type: bool
default_value: False
search_defaults: Categorical([True, False])
- quant_level
Intel® Neural Compressor allows users to choose different tuning processes by specifying the quantization level (quant_level). Currently 3 quant_levels are supported. 0 is conservative strategy, 1 is basic or user-specified strategy, auto (default) is the combination of 0 and 1. Please refer to https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-process https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-algorithms for more details
type: str
default_value: auto
search_defaults: None
- excluded_precisions
Precisions to be excluded, Default value is empty list. Intel® Neural Compressor enable the mixed precision with fp32 + bf16(only when device is ‘gpu’ and backend is ‘onnxrt_cuda_ep’) + int8 by default. If you want to disable bf16 data type, you can specify excluded_precisions = [‘bf16’].
type: list
default_value: []
search_defaults: None
- tuning_criterion
Instance of TuningCriterion class. In this class you can set strategy, strategy_kwargs, timeout, max_trials and objective.
type: dict
default_value: {‘strategy’: ‘basic’, ‘strategy_kwargs’: None, ‘timeout’: 0, ‘max_trials’: 5, ‘objective’: ‘performance’}
search_defaults: None
- metric
Accuracy metric to generate an evaluation function for Intel® Neural Compressor accuracy aware tuning.
type: olive.evaluator.metric.Metric | None
default_value: None
search_defaults: None
- weight_only_config
INC weight only quantization config.
type: dict
default_value: {‘bits’: 4, ‘group_size’: 4, ‘scheme’: ‘asym’, ‘algorithm’: ‘RTN’}
search_defaults: None
- op_type_dict
INC weight only quantization config.
type: dict
default_value: {}
search_defaults: None
- data_config
Data config for calibration, required if approach is ‘static’.
type: olive.data.config.DataConfig | Dict
required: True
- quant_format
Quantization format. Support ‘QDQ’ and ‘QOperator’.
type: str
default_value: QOperator
search_defaults: Categorical([‘QOperator’, ‘QDQ’])
- calibration_sampling_size
Number of calibration sample.
type: list | int
default_value: [100]
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
IncQuantization
Quantize ONNX model with Intel® Neural Compressor.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- approach
Intel® Neural Compressor Quantization mode. ‘dynamic’ for dynamic quantization, ‘static’ for static quantization, “weight_only” for 4-bits weight-only quantization.
type: str
default_value: static
search_defaults: Categorical([‘dynamic’, ‘static’, ‘weight_only’])
- device
Intel® Neural Compressor quantization device. Support ‘cpu’ and ‘gpu’.
type: str
default_value: cpu
search_defaults: None
- backend
Backend for model execution. Support ‘default’, ‘onnxrt_trt_ep’, ‘onnxrt_cuda_ep’
type: str
default_value: default
search_defaults: None
- domain
Model domain. Support ‘auto’, ‘cv’, ‘object_detection’, ‘nlp’ and ‘recommendation_system’. Intel® Neural Compressor Adaptor will use specific quantization settings for different domains automatically, and explicitly specified quantization settings will override the automatic setting. If users set domain as auto, automatic detection for domain will be executed.
type: str
default_value: auto
search_defaults: None
- workspace
Workspace for Intel® Neural Compressor quantization where intermediate files and tuning history file are stored. Default value is: “./nc_workspace/{}/”.format(datetime.datetime.now().strftime(“%Y-%m-%d_%H-%M-%S”))
type: str
default_value: None
search_defaults: None
- recipes
Recipes for Intel® Neural Compressor quantization, support list is as below. ‘smooth_quant’: whether do smooth quant ‘smooth_quant_args’: parameters for smooth_quant ‘fast_bias_correction’: whether do fast bias correction ‘weight_correction’: whether do weight correction ‘gemm_to_matmul’: whether convert gemm to matmul and add, only valid for onnx models ‘graph_optimization_level’: support ‘DISABLE_ALL’, ‘ENABLE_BASIC’, ‘ENABLE_EXTENDED’, ‘ENABLE_ALL’ only valid for onnx models ‘first_conv_or_matmul_quantization’: whether quantize the first conv or matmul ‘last_conv_or_matmul_quantization’: whether quantize the last conv or matmul ‘pre_post_process_quantization’: whether quantize the ops in preprocessing and postprocessing ‘add_qdq_pair_to_weight’: whether add QDQ pair for weights, only valid for onnxrt_trt_ep ‘optypes_to_exclude_output_quant’: don’t quantize output of specified optypes ‘dedicated_qdq_pair’: whether dedicate QDQ pair, only valid for onnxrt_trt_ep
type: dict
default_value: {}
search_defaults: None
- reduce_range
Whether use 7 bit to quantization.
type: bool
default_value: False
search_defaults: Categorical([True, False])
- quant_level
Intel® Neural Compressor allows users to choose different tuning processes by specifying the quantization level (quant_level). Currently 3 quant_levels are supported. 0 is conservative strategy, 1 is basic or user-specified strategy, auto (default) is the combination of 0 and 1. Please refer to https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-process https://github.com/intel/neural-compressor/blob/master/docs/source/tuning_strategies.md#tuning-algorithms for more details
type: str
default_value: auto
search_defaults: None
- excluded_precisions
Precisions to be excluded, Default value is empty list. Intel® Neural Compressor enable the mixed precision with fp32 + bf16(only when device is ‘gpu’ and backend is ‘onnxrt_cuda_ep’) + int8 by default. If you want to disable bf16 data type, you can specify excluded_precisions = [‘bf16’].
type: list
default_value: []
search_defaults: None
- tuning_criterion
Instance of TuningCriterion class. In this class you can set strategy, strategy_kwargs, timeout, max_trials and objective.
type: dict
default_value: {‘strategy’: ‘basic’, ‘strategy_kwargs’: None, ‘timeout’: 0, ‘max_trials’: 5, ‘objective’: ‘performance’}
search_defaults: None
- metric
Accuracy metric to generate an evaluation function for Intel® Neural Compressor accuracy aware tuning.
type: olive.evaluator.metric.Metric | None
default_value: None
search_defaults: None
- weight_only_config
INC weight only quantization config.
type: dict
default_value: {‘bits’: 4, ‘group_size’: 4, ‘scheme’: ‘asym’, ‘algorithm’: ‘RTN’}
search_defaults: None
- op_type_dict
INC weight only quantization config.
type: dict
default_value: {}
search_defaults: None
- data_config
Data config for calibration, required if approach is ‘static’.
type: olive.data.config.DataConfig | Dict
required: True
- quant_format
Quantization format. Support ‘QDQ’ and ‘QOperator’.
type: str
default_value: QOperator
search_defaults: Conditional(parents: (‘approach’,), support: {(‘static’,): Categorical([‘QOperator’, ‘QDQ’])}, default: Categorical([‘default’]))
- calibration_sampling_size
Number of calibration sample.
type: list | int
default_value: [100]
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
VitisAIQuantization
Quantize ONNX model with onnxruntime. We can search for best parameters for vai_q_onnx quantization at same time.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- quant_mode
Onnx Quantization mode. ‘static’ for vitis ai quantization.
type: str
default_value: static
search_defaults: Categorical([‘static’])
- data_config
Data config for calibration.
type: olive.data.config.DataConfig | Dict
required: True
- weight_type
Data type for quantizing weights which is used in vai_q_onnx quantization. ‘QInt8’ for signed 8-bit integer,
type: str
default_value: QInt8
search_defaults: Categorical([‘QInt8’])
- input_nodes
Start node that needs quantization. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- output_nodes
End node that needs quantization. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- op_types_to_quantize
List of operator types to quantize. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- nodes_to_quantize
List of node names to quantize. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- nodes_to_exclude
List of node names to exclude from quantization. If None, all quantizable.
type: list
default_value: None
search_defaults: None
- per_channel
Quantize weights per channel.
type: bool
default_value: False
search_defaults: Categorical([True, False])
- optimize_model
Deprecating Soon in ONNX! Optimize model before quantization. NOT recommended, optimization will change the computation graph, making debugging of quantization loss difficult.
type: bool
default_value: False
search_defaults: Categorical([True, False])
- use_external_data_format
option used for large size (>2GB) model. Set to True by default.
type: bool
default_value: True
search_defaults: None
- quant_preprocess
Shape inference and model optimization, in preparation for quantization. https://onnxruntime.ai/docs/performance/quantization.html#pre-processing
type: bool
default_value: True
search_defaults: Categorical([True, False])
- calibrate_method
Current calibration methods supported are NonOverflow and MinMSE, Please use NonOverflow or MinMSE as options.
type: str
default_value: MinMSE
search_defaults: Categorical([‘NonOverflow’, ‘MinMSE’])
- quant_format
QDQ format quantize the model by inserting QuantizeLinear/DeQuantizeLinear on the tensor.
type: str
default_value: QDQ
search_defaults: Categorical([‘QDQ’, ‘QOperator’])
- need_layer_fusing
Perform layer fusion for conv-relu type operations
type: bool
default_value: False
search_defaults: Categorical([True, False])
- activation_type
Quantization data type of activation.
type: str
default_value: QUInt8
search_defaults: Conditional(parents: (‘quant_format’, ‘weight_type’), support: {(‘QDQ’, ‘QInt8’): Categorical([‘QInt8’]), (‘QDQ’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘QOperator’, ‘QUInt8’): Categorical([‘QUInt8’]), (‘QOperator’, ‘QInt8’): Categorical([<SpecialParamValue.INVALID: ‘OLIVE_INVALID_PARAM_VALUE’>])}, default: Categorical([<SpecialParamValue.INVALID: ‘OLIVE_INVALID_PARAM_VALUE’>]))
- enable_dpu
Use QDQ format optimized specifically for DPU.
type: bool
default_value: False
search_defaults: Categorical([True, False])
- ActivationSymmetric
symmetrize calibration data for activations
type: bool
default_value: False
search_defaults: None
- WeightSymmetric
symmetrize calibration data for weights
type: bool
default_value: True
search_defaults: None
- AddQDQPairToWeight
remains floating-point weight and inserts both QuantizeLinear/DeQuantizeLinear nodes to weight
type: bool
default_value: False
search_defaults: None
- extra_options
Key value pair dictionary for extra_options in quantization. If an option is one of [‘ActivationSymmetric’, ‘WeightSymmetric’, ‘AddQDQPairToWeight’], it will be overwritten by the corresponding config parameter value.
type: dict
default_value: None
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
AppendPrePostProcessingOps
Add Pre/Post nodes to the input model.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- pre
List of pre-processing commands to add.
type: List[Dict[str, Any]]
default_value: None
search_defaults: None
- post
List of post-processing commands to add.
type: List[Dict[str, Any]]
default_value: None
search_defaults: None
- tool_command
Composited tool commands to invoke.
type: str
default_value: None
search_defaults: None
- tool_command_args
Arguments to pass to tool command or to PrePostProcessor. If it is used for PrePostProcessor, the schema would like: { “name”: “image”, “data_type”: “uint8”, “shape”: [“num_bytes”],
type: Dict[str, Any] | List[olive.passes.onnx.append_pre_post_processing_ops.PrePostProcessorInput]
default_value: None
search_defaults: None
- target_opset
The version of the default (ai.onnx) opset to target.
type: int
default_value: 16
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
InsertBeamSearch
Insert Beam Search Op. Only used for whisper models. Uses WhisperBeamSearch contrib op if ORT version >= 1.17.1, else uses BeamSearch contrib op.
Input: handler.base.OliveModelHandler
Output: handler.onnx.ONNXModelHandler
- no_repeat_ngram_size
If set to int > 0, all ngrams of that size can only occur once.
type: int
default_value: 0
search_defaults: None
- use_vocab_mask
Use vocab_mask as an extra graph input to the beam search op. Only supported in ORT >= 1.16.0
type: bool
default_value: False
search_defaults: None
- use_prefix_vocab_mask
Use prefix_vocab_mask as an extra graph input to the beam search op. Only supported in ORT >= 1.16.0
type: bool
default_value: False
search_defaults: None
- use_forced_decoder_ids
Use decoder_input_ids as an extra graph input to the beam search op. Only supported in ORT >= 1.16.0
type: bool
default_value: False
search_defaults: None
- use_logits_processor
Use logits_processor as an extra graph input to the beam search op. Only supported in ORT >= 1.16.0
type: bool
default_value: False
search_defaults: None
- use_temperature
Use temperature as an extra graph input to the beam search op. Only supported in ORT >= 1.17.1
type: bool
default_value: False
search_defaults: None
- fp16
Is the model in fp16 precision.
type: bool
default_value: False
search_defaults: None
- use_gpu
Use GPU for beam search op.
type: bool
default_value: False
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
ExtractAdapters
Extract adapter weights from model and save them as external weights file. If make_inputs is False, model proto is invalid after this pass as the adapter weights point to non-existent external files. Inference session must be created by first loading the adapter weights using SessionOptions.add_external_initializers. If make_inputs is True, the adapter weights are inputs to the model and must be provided during inference.
Input: handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- make_inputs
Convert adapter weights to inputs. If false, the adapter weights will be set as initializers with external data.
type: bool
default_value: True
search_defaults: None
- dynamic_lora_r
Whether the model uses dynamic shape for lora_r. Only used if make_inputs is True. Valid only for float modules.
type: bool
default_value: True
search_defaults: None
- optional_inputs
Create default initializers (empty tensor with lora_r dimension set to 0) for the adapter weights, if inputs not provided during inference. Only used if make_inputs is True. Valid only for float modules.
type: bool
default_value: True
search_defaults: None
- save_format
Format to save the weights in.
type: olive.common.utils.WeightsFileFormat
default_value: numpy
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
CaptureSplitInfo
Capture the split information of the model layers. Only splits the transformer layers.
Input: handler.hf.HfModelHandler | handler.pytorch.PyTorchModelHandler
Output: handler.hf.HfModelHandler | handler.pytorch.PyTorchModelHandler
- num_splits
Number of splits to divide the model layers into.
type: int
required: True
- block_to_split
Name of the model block to split. Children of the block will be divided into the splits. For supported transformers models, the default value is the transformers layer block name. Refer to olive.common.hf.mappings.MODELS_TO_LAYERS_MAPPING for supported models.
type: str
default_value: None
search_defaults: None
SplitModel
Input: handler.onnx.ONNXModelHandler
Output: handler.composite.CompositeModelHandler
- include_all_nodes
Include all nodes in the split model. Nodes outside the splits or before the first split will be assigned to the first split. Nodes after the last split will be assigned to the last split. If False, these nodes will not be included in the split models.
type: bool
default_value: True
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
LoRA
Run LoRA fine-tuning on a Hugging Face PyTorch model.
Input: handler.hf.HfModelHandler
Output: handler.hf.HfModelHandler
- target_modules
Target modules
type: List[str]
default_value: None
search_defaults: None
- lora_r
Lora R dimension.
type: int
default_value: 64
search_defaults: Categorical([16, 32, 64])
- lora_alpha
The alpha parameter for Lora scaling.
type: float
default_value: 16
search_defaults: None
- lora_dropout
The dropout probability for Lora layers.
type: float
default_value: 0.05
search_defaults: None
- modules_to_save
List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint.
type: None
default_value: None
search_defaults: None
- torch_dtype
Data type to use for training. Should be one of bfloat16, float16 or float32. If float16 will use fp16 mixed-precision training.
type: str
default_value: bfloat16
search_defaults: None
- allow_tf32
Whether or not to allow TF32 on Ampere GPUs. Can be used to speed up training. For more information, see ‘https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices’
type: bool
default_value: True
search_defaults: None
- train_data_config
Data config for fine-tuning training.
type: olive.data.config.DataConfig | Dict
required: True
- eval_data_config
Data config for fine-tuning evaluation. Optional if evaluation is not needed.
type: olive.data.config.DataConfig | Dict
default_value: None
search_defaults: None
- training_args
Training arguments. If None, will use default arguments. See HFTrainingArguments for more details.
type: olive.passes.pytorch.lora.HFTrainingArguments | Dict
default_value: None
search_defaults: None
QLoRA
Run QLoRA fine-tuning on a Hugging Face PyTorch model.
Input: handler.hf.HfModelHandler
Output: handler.hf.HfModelHandler
- double_quant
Whether to use nested quantization where the quantization constants from the first quantization are quantized again.
type: bool
default_value: False
search_defaults: None
- quant_type
Quantization data type to use. Should be one of fp4 or nf4.
type: str
default_value: nf4
search_defaults: None
- compute_dtype
Computation data type for the quantized modules. If not provided, will use the same dtype as torch_dtype
type: str
default_value: None
search_defaults: None
- save_quant_config
Whether to save the output model with the bitsandbytes quantization config. If False, the base model will be in the original precision. If True, the base model will be quantized on load.
type: bool
default_value: True
search_defaults: None
- lora_r
Lora R dimension.
type: int
default_value: 64
search_defaults: Categorical([16, 32, 64])
- lora_alpha
The alpha parameter for Lora scaling.
type: float
default_value: 16
search_defaults: None
- lora_dropout
The dropout probability for Lora layers.
type: float
default_value: 0.05
search_defaults: None
- modules_to_save
List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint.
type: None
default_value: None
search_defaults: None
- torch_dtype
Data type to use for training. Should be one of bfloat16, float16 or float32. If float16 will use fp16 mixed-precision training.
type: str
default_value: bfloat16
search_defaults: None
- allow_tf32
Whether or not to allow TF32 on Ampere GPUs. Can be used to speed up training. For more information, see ‘https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices’
type: bool
default_value: True
search_defaults: None
- train_data_config
Data config for fine-tuning training.
type: olive.data.config.DataConfig | Dict
required: True
- eval_data_config
Data config for fine-tuning evaluation. Optional if evaluation is not needed.
type: olive.data.config.DataConfig | Dict
default_value: None
search_defaults: None
- training_args
Training arguments. If None, will use default arguments. See HFTrainingArguments for more details.
type: olive.passes.pytorch.lora.HFTrainingArguments | Dict
default_value: None
search_defaults: None
LoftQ
Run LoftQ fine-tuning on a Hugging Face PyTorch model.
Input: handler.hf.HfModelHandler
Output: handler.hf.HfModelHandler
- loftq_iter
Number of LoftQ iterations.
type: int
default_value: 1
search_defaults: None
- compute_dtype
Computation data type for the quantized modules. If not provided, will use the same dtype as torch_dtype
type: str
default_value: None
search_defaults: None
- save_quant_config
Whether to save the output model with the bitsandbytes quantization config. If False, the base model will be in the original precision. If True, the base model will be quantized on load.
type: bool
default_value: True
search_defaults: None
- lora_r
Lora R dimension.
type: int
default_value: 64
search_defaults: Categorical([16, 32, 64])
- lora_alpha
The alpha parameter for Lora scaling.
type: float
default_value: 16
search_defaults: None
- lora_dropout
The dropout probability for Lora layers.
type: float
default_value: 0.05
search_defaults: None
- modules_to_save
List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint.
type: None
default_value: None
search_defaults: None
- torch_dtype
Data type to use for training. Should be one of bfloat16, float16 or float32. If float16 will use fp16 mixed-precision training.
type: str
default_value: bfloat16
search_defaults: None
- allow_tf32
Whether or not to allow TF32 on Ampere GPUs. Can be used to speed up training. For more information, see ‘https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices’
type: bool
default_value: True
search_defaults: None
- train_data_config
Data config for fine-tuning training.
type: olive.data.config.DataConfig | Dict
required: True
- eval_data_config
Data config for fine-tuning evaluation. Optional if evaluation is not needed.
type: olive.data.config.DataConfig | Dict
default_value: None
search_defaults: None
- training_args
Training arguments. If None, will use default arguments. See HFTrainingArguments for more details.
type: olive.passes.pytorch.lora.HFTrainingArguments | Dict
default_value: None
search_defaults: None
LoRA/QLoRA/LoftQ HFTrainingArguments
- pydantic settings olive.passes.pytorch.lora.HFTrainingArguments[source]
Training arguments for transformers.Trainer.
Has the same fields as transformers.TrainingArguments with recommended default values for QLoRA fine-tuning.
- field optim: str = 'paged_adamw_32bit'
The optimizer to use.
- field learning_rate: float = 0.0002
The initial learning rate for AdamW.
- field gradient_checkpointing: bool = True
Use gradient checkpointing. Recommended.
- field lr_scheduler_type: str = 'constant'
Learning rate schedule. Constant a bit better than cosine, and has advantage for analysis.
- field warmup_ratio: float = 0.03
Fraction of steps to do a warmup for.
- field evaluation_strategy: str = None
The evaluation strategy to use. Forced to ‘no’ if eval_dataset is not provided. Otherwise, ‘steps’ unless set to ‘epoch’.
- field report_to: str | List[str] = 'none'
The list of integrations to report the results and logs to.
- field output_dir: str = None
The output dir for logs and checkpoints. If None, will use a temp dir.
- field overwrite_output_dir: bool = False
If True, overwrite the content of output_dir. Otherwise, will continue training if output_dir points to a checkpoint directory.
- field resume_from_checkpoint: str = None
The path to a folder with a valid checkpoint for the model. Supercedes any checkpoint found in output_dir.
- field deepspeed: bool | str | Dict = None
Use [Deepspeed](https://github.com/microsoft/deepspeed). If True, will use default deepspeed config. Else, it is a path to a deepspeed config file or a dict with deepspeed config.
- field extra_args: Dict[str, Any] = None
Extra arguments to pass to the trainer. Values can be provided directly to this field as a dict or as keyword arguments to the config. See transformers.TrainingArguments for more details on the available arguments.
QuantizationAwareTraining
Run quantization aware training on PyTorch model.
Input: handler.pytorch.PyTorchModelHandler
Output: handler.pytorch.PyTorchModelHandler
- user_script
Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.
type: pathlib.Path | str
default_value: None
search_defaults: None
- script_dir
Directory containing user script dependencies.
type: pathlib.Path | str
default_value: None
search_defaults: None
- train_data_config
Data config for training.
type: olive.data.config.DataConfig | Dict
required: True
- val_data_config
Data config for validation.
type: olive.data.config.DataConfig | Dict
default_value: None
search_defaults: None
- training_loop_func
Customized training loop function.
type: Callable | str
default_value: None
search_defaults: None
- ptl_module
LightningModule for PyTorch Lightning trainer. It is a way of encapsulating all the logic related to the training, validation, and testing of a PyTorch model. Please refer to https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html for more details.
type: Callable | str
default_value: None
search_defaults: None
- ptl_data_module
LightningDataModule for PyTorch Lightning trainer. It is a way of encapsulating all the data-related logic for training, validation, and testing of a PyTorch model. Please refer to https://pytorch-lightning.readthedocs.io/en/stable/data/datamodule.html for more details.
type: Callable | str
default_value: None
search_defaults: None
- num_epochs
Maximum number of epochs for training.
type: int
default_value: None
search_defaults: None
- num_steps
Maximum number of steps for training.
type: int
default_value: -1
search_defaults: None
- do_validate
Whether perform one evaluation epoch over the validation set after training.
type: bool
default_value: False
search_defaults: None
- modules_to_fuse
List of list of module names to fuse.
type: List[List[str]]
default_value: None
search_defaults: None
- qconfig_func
Customized function to create a QConfig for QAT. Please refer to https://pytorch.org/docs/stable/generated/torch.ao.quantization.qconfig.QConfig.html for details.
type: Callable | str
default_value: None
search_defaults: None
- logger
Logger for training.
type: pytorch_lightning.loggers.logger.Logger | Iterable[pytorch_lightning.loggers.logger.Logger] | Callable | bool
default_value: False
search_defaults: None
- gpus
Number of GPUs to use.
type: int
default_value: None
search_defaults: None
- seed
Random seed for training.
type: int
default_value: None
search_defaults: None
- checkpoint_path
Path to save checkpoints.
type: str
default_value: None
search_defaults: None
OpenVINOConversion
Converts PyTorch, ONNX or TensorFlow Model to OpenVino Model.
Input: handler.hf.HfModelHandler | handler.pytorch.PyTorchModelHandler | handler.onnx.ONNXModelHandler
Output: handler.openvino.OpenVINOModelHandler
- user_script
Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.
type: pathlib.Path | str
default_value: None
search_defaults: None
- script_dir
Directory containing user script dependencies.
type: pathlib.Path | str
default_value: None
search_defaults: None
- input
Set or override shapes for model inputs. It configures dynamic and static dimensions in model inputs depending on your inference requirements.
type: Callable | str | List
default_value: None
search_defaults: None
- example_input_func
Function/function name to generate sample of model input in original framework. For PyTorch it can be torch.Tensor. For Tensorflow it can be tf.Tensor or numpy.ndarray.
type: Callable | str
default_value: None
search_defaults: None
- compress_to_fp16
Compress weights in output OpenVINO model to FP16. Default is True.
type: bool
default_value: True
search_defaults: None
- extra_configs
Extra configurations for OpenVINO model conversion. extra_config can be set by passing a dictionary where key is the parameter name, and the value is the parameter value. Please check Conversion Parameters documentation for more details: https://docs.openvino.ai/2023.3/openvino_docs_OV_Converter_UG_Conversion_Options.html
type: Dict
default_value: None
search_defaults: None
- output_model
Name of the output OpenVINO model.
type: str
default_value: ov_model
search_defaults: None
OpenVINOQuantization
Input: handler.openvino.OpenVINOModelHandler
Output: handler.openvino.OpenVINOModelHandler
- user_script
Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.
type: pathlib.Path | str
default_value: None
search_defaults: None
- script_dir
Directory containing user script dependencies.
type: pathlib.Path | str
default_value: None
search_defaults: None
- data_config
Data config for calibration.
type: olive.data.config.DataConfig | Dict
required: True
- model_type
Used to specify quantization scheme required for specific type of the model. ‘TRANSFORMER’ is the only supported special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, DistilBERT, etc.). None is default.
type: olive.passes.openvino.quantization.ModelTypeEnum
default_value: None
search_defaults: None
- preset
Defines quantization scheme for the model. Supported values: ‘PERFORMANCE’, ‘MIXED’.
type: olive.passes.openvino.quantization.PresetEnum
default_value: PERFORMANCE
search_defaults: None
- ignored_scope
This parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. Please refer to https://docs.openvino.ai/2023.3/basic_quantization_flow.html#tune-quantization-parameters.
type: str | List[str]
default_value: None
search_defaults: None
- ignored_scope_type
Defines the type of the ignored scope. Supported values: ‘names’, ‘types’, ‘patterns’.
type: olive.passes.openvino.quantization.IgnoreScopeTypeEnum
default_value: None
search_defaults: None
- target_device
Target device for the model. Supported values: ‘any’, ‘cpu’, ‘gpu’, ‘cpu_spr’, ‘vpu’. Default value is the same as the accelerator type of this workflow run.
type: olive.hardware.accelerator.Device
default_value: cpu
search_defaults: None
- extra_configs
Extra configurations for OpenVINO model quantization. Please refer to https://docs.openvino.ai/2023.3/basic_quantization_flow.html#tune-quantization-parameters.
type: List[Dict]
default_value: None
search_defaults: None
SNPEConversion
Convert ONNX or TensorFlow model to SNPE DLC. Uses snpe-tensorflow-to-dlc or snpe-onnx-to-dlc tools from the SNPE SDK.
Input: handler.onnx.ONNXModelHandler | handler.tensorflow.TensorFlowModelHandler
Output: handler.snpe.SNPEModelHandler
- input_names
List of input names.
type: List[str]
required: True
- input_shapes
List of input shapes. Must be the same length as input_names.
type: List[List[int]]
required: True
- output_names
List of output names.
type: List[str]
required: True
- input_types
List of input types. If not None, it must be a list of the same length as input_names. List members can be None to use default value. Refer to olive.platform_sdk.qualcomm.constants.InputType for valid values.
type: List[str | None]
default_value: None
search_defaults: None
- input_layouts
List of input layouts. If not None, it must be a list of the same length as input_names. List members can be None to use inferred value. Refer to olive.platform_sdk.qualcomm.constants.InputLayout for valid values.
type: List[str | None]
default_value: None
search_defaults: None
- extra_args
Extra arguments to pass to snpe conversion tool. Refer to snpe-onnx-to-dlc and snpe-tensorflow-to-dlc at https://developer.qualcomm.com/sites/default/files/docs/snpe/tools.html for more additional arguments. The value is a string that will be passed as is to the tool. e.g.: –enable_cpu_fallback –priority_hint low
type: str
default_value: None
search_defaults: None
SNPEQuantization
Quantize SNPE model. Uses snpe-dlc-quantize tool from the SNPE SDK.
Input: handler.snpe.SNPEModelHandler
Output: handler.snpe.SNPEModelHandler
- data_config
Data config for quantization
type: olive.data.config.DataConfig | Dict
required: True
- use_enhanced_quantizer
Use the enhanced quantizer feature when quantizing the model. Uses an algorithm to determine optimal range instead of min and max range of data. It can be useful for quantizing models that have long tails in the distribution of the data being quantized.
type: bool
default_value: False
search_defaults: Categorical([True, False])
- enable_htp
Pack HTP information in quantized DLC, which is not available in Windows.
type: bool
default_value: False
search_defaults: Categorical([True, False])
- htp_socs
List of SoCs to generate HTP Offline cache for.
type: List[str]
default_value: None
search_defaults: None
- extra_args
Extra arguments to pass to snpe conversion tool. Refer to https://developer.qualcomm.com/sites/default/files/docs/snpe/tools.html#tools_snpe-dlc-quantize for more additional arguments. The value is a string that will be passed as is to the tool. e.g.: –bias_bitwidth 16 –overwrite_cache_records
type: str
default_value: None
search_defaults: None
SNPEtoONNXConversion
Convert a SNPE DLC to ONNX to use with SNPE Execution Provider. Creates a ONNX graph with the SNPE DLC as a node.
Input: handler.snpe.SNPEModelHandler
Output: handler.onnx.ONNXModelHandler
- target_device
Target device for the ONNX model. Refer to oliveolive.platform_sdk.qualcomm.constants.SNPEDevice for valid values.
type: str
default_value: cpu
search_defaults: None
- target_opset
Target ONNX opset version.
type: int
default_value: 12
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
QNNConversion
Convert ONNX, TensorFlow, or PyTorch model to QNN C++ model. Quantize the model if –input_list is provided as extra_args. Uses qnn-[framework]-converter tool from the QNN SDK.
Input: handler.tensorflow.TensorFlowModelHandler | handler.pytorch.PyTorchModelHandler | handler.onnx.ONNXModelHandler
Output: handler.qnn.QNNModelHandler
- input_dim
The names and dimensions of the network input layers specified in the format [input_name comma-separated-dimensions], for example: [“data 1,224,224,3”] Note that the quotes should always be included in order to handle special characters, spaces, etc. For multiple inputs specify multiple –input_dim on the command line like: [“data 1,224,224,3”, “data2 1,224,224,3”] If –input_dim is not specified, the input dimensions will be inferred from the model. If –input_dim is specified, the input dimensions will be used as-is.
type: List[str]
default_value: None
search_defaults: None
- out_node
The name of the output node. If not specified, the output node will be inferred from the model. If specified, the output node will be used as-is. Example: [“out_1”, “out_2”]
type: List[str]
default_value: None
search_defaults: None
- extra_args
Extra arguments to pass to qnn-[framework]-converter tool, e.g. –show_unconsumed_nodes –custom_io CUSTOM_IO. See the documentation for more details: https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/tools.html
type: str
default_value: None
search_defaults: None
QNNModelLibGenerator
Compile QNN C++ model source code into QNN model library for a specific target. Uses qnn-model-lib-generator tool from the QNN SDK.
Input: handler.qnn.QNNModelHandler
Output: handler.qnn.QNNModelHandler
- lib_targets
Specifies the targets to build the models for. Default: aarch64-android x86_64-linux-clang
type: str
default_value: None
search_defaults: None
- lib_name
Specifies the name to use for libraries. Default: uses name in <model.bin> if provided, else generic qnn_model.so
type: str
default_value: None
search_defaults: None
QNNContextBinaryGenerator
Create QNN context binary from a QNN model library using a particular backend. Uses qnn-context-binary-generator tool from the QNN SDK.
Input: handler.qnn.QNNModelHandler | handler.snpe.SNPEModelHandler
Output: handler.qnn.QNNModelHandler
- backend
Path to a QNN backend .so library to create the context binary.
type: str
required: True
- binary_file
Name of the binary file to save the context binary to. Saved in the same path as –output_dir option with .bin as the binary file extension. If not provided, no backend binary is created.
type: str
default_value: None
search_defaults: None
- extra_args
Extra arguments to qnn-context-binary-generator
type: str
default_value: None
search_defaults: None
MergeAdapterWeights
Merge adapter weights into the base model.
Input: handler.hf.HfModelHandler
Output: handler.hf.HfModelHandler
SparseGPT
Run SparseGPT on a Hugging Face PyTorch model. See https://arxiv.org/abs/2301.00774 for more details on the algorithm. This pass only supports HfModelHandler. The transformers model type must be one of [bloom, gpt2, gpt_neox, llama, opt].
Input: handler.hf.HfModelHandler
Output: handler.hf.HfModelHandler
- sparsity
Target sparsity. This can be a float or a list of two integers. Float is the target sparsity per layer. List [n,m] applies semi-structured (n:m) sparsity patterns. Refer to https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/ for more details on 2:4 sparsity pattern.
type: float | List[int]
default_value: None
search_defaults: None
- blocksize
Blocksize to use for adaptive mask selection.
type: int
default_value: 128
search_defaults: None
- percdamp
Percentage of the average Hessian diagonal to use for dampening. Must be in [0,1].
type: float
default_value: 0.01
search_defaults: None
- min_layer
Prune all layers with id >= min_layer.
type: int
default_value: None
search_defaults: None
- max_layer
Prune all layers with id < max_layer.
type: int
default_value: None
search_defaults: None
- layer_name_filter
Only prune layers whose name contains the given string(s).
type: str | List[str]
default_value: None
search_defaults: None
- device
Device to use for performing computations. Can be ‘auto, ‘cpu’, ‘cuda’, ‘cuda:0’, etc. If ‘auto’, will use cuda if available. Does not affect the final model.
type: str
default_value: auto
search_defaults: None
- data_config
Data config to use for pruning weights. All samples in the data are expected to be of the same length, most likely the max sequence length of the model.
type: olive.data.config.DataConfig | Dict
required: True
SliceGPT
Run SliceGPT on a Hugging Face PyTorch model. See https://arxiv.org/pdf/2401.15024.pdf for more details on the algorithm. This pass only supports HfModelHandler.
Input: handler.hf.HfModelHandler
Output: handler.pytorch.PyTorchModelHandler
- calibration_data_config
Data config for Dataset to calibrate and calculate perplexity on.
type: olive.data.config.DataConfig | Dict
required: True
- calibration_nsamples
Number of samples of the calibration data to load.
type: int
default_value: 128
search_defaults: None
- calibration_batch_size
Batch size for loading the calibration data.
type: int
default_value: 16
search_defaults: None
- seed
Seed for sampling the calibration data.
type: int
default_value: 42
search_defaults: None
- sparsity
A measure of how much slicing is applied (in the range [0, 1))
type: float
default_value: 0.0
search_defaults: None
- round_interval
Interval for rounding the weights (the best value may depend on your hardware)
type: int
default_value: 8
search_defaults: None
- final_orientation
Final orientation of the sliced weights. Choices are random or pca.
type: str
default_value: random
search_defaults: None
QuaRot
A new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end. See https://arxiv.org/pdf/2404.00456 for more details on the algorithm. This pass only supports HfModelHandler.
Input: handler.hf.HfModelHandler
Output: handler.pytorch.PyTorchModelHandler
- input_model_dtype
Input model’s data type.
type: olive.passes.pytorch.quarot.QuaRot.ModelDtype
default_value: fp16
search_defaults: None
- calibration_data_config
Data config for Dataset to calibrate and calculate perplexity on.
type: olive.data.config.DataConfig | Dict
default_value: None
search_defaults: None
- calibration_nsamples
Number of samples of the calibration data to load.
type: int
default_value: 128
search_defaults: None
- calibration_batch_size
Batch size for loading the calibration data.
type: int
default_value: 16
search_defaults: None
- seed
Seed for sampling the calibration data.
type: int
default_value: 42
search_defaults: None
- rotate
Apply QuaRot/Hadamard rotation to the model.
type: bool
default_value: True
search_defaults: None
- rotation_seed
Seed for generating random matrix. Use 0 to replicate paper results.
type: int
default_value: 0
search_defaults: None
- w_rtn
Apply RTN quantization to the weights.
type: bool
default_value: False
search_defaults: None
- w_gptq
Apply GPTQ quantization to the weights. It requires flash_attention_2 which only supports Ampere GPUs or newer.
type: bool
default_value: False
search_defaults: None
- gptq_damping
Damping factor for GPTQ. (ignored for RTN quantization)
type: float
default_value: 0.01
search_defaults: None
- gptq_opt_scales
Optimize scales for GPTQ (ignored for RTN quantization)
type: bool
default_value: False
search_defaults: None
- w_bits
Number of bits for quantizing weights.
type: int
default_value: 16
search_defaults: None
- w_asym
Asymmetric weight quantization (else symmetric by default).
type: bool
default_value: False
search_defaults: None
- w_groupsize
Group size for groupwise weight quantization.
type: int
default_value: None
search_defaults: None
- a_bits
Number of bits for quantizing activations.
type: int
default_value: 16
search_defaults: None
- a_asym
Asymmetric activation quantization (else symmetric by default).
type: bool
default_value: False
search_defaults: None
- a_clip_ratio
Clip ratio for activation quantization: new_max = max * clip_ratio.
type: float
default_value: 1.0
search_defaults: None
- a_groupsize
Group size for groupwise activation quantization, default is None.
type: int
default_value: None
search_defaults: None
- k_bits
Number of bits to quantize the keys to.
type: int
default_value: 16
search_defaults: None
- k_clip_ratio
Clip ratio for keys quantization: new_max = max * clip_ratio.
type: float
default_value: 1.0
search_defaults: None
- k_groupsize
Group size for groupwise key quantization.
type: int
default_value: None
search_defaults: None
- v_bits
Number of bits to quantize the values to.
type: int
default_value: 16
search_defaults: None
- v_clip_ratio
Clip ratio for values quantization: new_max = max * clip_ratio.
type: float
default_value: 1.0
search_defaults: None
- v_groupsize
Group size for groupwise value quantization.
type: int
default_value: None
search_defaults: None
- s_bits
Number of bits to quantize the values to.
type: int
default_value: 16
search_defaults: None
GptqQuantizer
GPTQ quantization using Hugging Face Optimum and export model with onnxruntime optimized kernel.
Input: handler.hf.HfModelHandler | handler.pytorch.PyTorchModelHandler
Output: handler.pytorch.PyTorchModelHandler
- user_script
Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.
type: pathlib.Path | str
default_value: None
search_defaults: None
- script_dir
Directory containing user script dependencies.
type: pathlib.Path | str
default_value: None
search_defaults: None
- bits
quantization bits. Default value is 4
type: int
default_value: 4
search_defaults: None
- layers_block_name
Block name to quantize. For models can’t be auto filled, you can refer this link to fill these parameters. https://github.com/AutoGPTQ/AutoGPTQ/blob/896d8204bc89a7cfbda42bf3314e13cf4ce20b02/auto_gptq/modeling/llama.py#L19-L26
type: str
default_value: None
search_defaults: None
- outside_layer_modules
Names of other nn modules that in the same level as the transformer layer block. Default value is None.
type: List[str]
default_value: None
search_defaults: None
- inside_layer_modules
Names of linear layers in transformer layer module. Default value is None.
type: List[List[str]]
default_value: None
search_defaults: None
- group_size
Block size for quantization. Default value is 128.
type: int
default_value: 128
search_defaults: None
- damp_percent
Damping factor for quantization. Default value is 0.01.
type: float
default_value: 0.01
search_defaults: None
- static_groups
Use static groups for quantization. Default value is False.
type: bool
default_value: False
search_defaults: None
- true_sequential
Use true sequential for quantization. Default value is False.
type: bool
default_value: False
search_defaults: None
- desc_act
Use descriptive activation for quantization. Default value is False.
type: bool
default_value: False
search_defaults: None
- sym
Symmetric quantization. Default value is False.
type: bool
default_value: False
search_defaults: None
- data_config
Data config for quantization. Default value is None.
type: olive.data.config.DataConfig | Dict
default_value: None
search_defaults: None
AutoAWQQuantizer
AWQ quantization.
Input: handler.hf.HfModelHandler
Output: handler.hf.HfModelHandler | handler.pytorch.PyTorchModelHandler
- user_script
Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.
type: pathlib.Path | str
default_value: None
search_defaults: None
- script_dir
Directory containing user script dependencies.
type: pathlib.Path | str
default_value: None
search_defaults: None
- input_model_dtype
The input model data type.
type: olive.passes.pytorch.autoawq.AutoAWQQuantizer.ModelDtype
default_value: fp16
search_defaults: None
- zero_point
Whether to use zero point quantization to calculate the scales and zeros. If False, it use the symmetric quantization.
type: bool
default_value: True
search_defaults: None
- q_group_size
The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization.
type: int
default_value: 128
search_defaults: None
- w_bit
The number of bits to quantize to.
type: int
default_value: 4
search_defaults: None
- version
The version of the quantization algorithm to use. gemm is better for big batch_size (e.g. >= 8) otherwise, gemv is better (e.g. < 8 ). gemm models are compatible with Exllama kernels.
type: str
default_value: gemm
search_defaults: None
- duo_scaling
Whether to scale using both w/x(True) or just x(False).
type: bool
default_value: True
search_defaults: None
- modules_to_not_convert
The list of modules to not quantize, useful for quantizing models that explicitly require to have some modules left in their original precision (e.g. Whisper encoder, Llava encoder, Mixtral gate layers). Please refer to AutoAWQ documentation for quantizing HF models.
type: list
default_value: []
search_defaults: None
- export_compatible
If True, this argument avoids real quantization by only applying the scales quantizing down to FP16.
type: bool
default_value: False
search_defaults: None
- data_config
Data config for quantization. Default value is None.
type: olive.data.config.DataConfig | Dict
default_value: None
search_defaults: None
TorchTRTConversion
Convert torch.nn.Linear modules in the transformer layers of a HuggingFace PyTorch model to TensorRT modules. The conversion would include fp16 precision and sparse weights, if applicable. The entire model is saved using torch.save and can be loaded using torch.load. Loading the model requires torch-tensorrt and Olive to be installed. This pass only supports HfModelHandler. The transformers model type must be one of [bloom, gpt2, gpt_neox, llama, opt].
Input: handler.hf.HfModelHandler
Output: handler.pytorch.PyTorchModelHandler
- min_layer
Convert all layers with id >= min_layer.
type: int
default_value: None
search_defaults: None
- max_layer
Convert all layers with id < max_layer.
type: int
default_value: None
search_defaults: None
- layer_name_filter
Only convert layers whose name contains the given string(s).
type: str | List[str]
default_value: None
search_defaults: None
- float16
Convert entire model to fp16. If False, only the sparse modules are converted to fp16.
type: bool
default_value: False
search_defaults: None
- data_config
Data config to use for compiling module to TensorRT. The batch size of the compiled module is set to the batch size of the first batch of the dataloader.
type: olive.data.config.DataConfig | Dict
required: True
OptimumConversion
Convert a Hugging Face PyTorch model to ONNX model using the Optimum export function.
Input: handler.hf.HfModelHandler
Output: handler.onnx.ONNXModelHandler | handler.composite.CompositeModelHandler
- user_script
Path to user script. The values for other parameters which were assigned function or object names will be imported from this script.
type: pathlib.Path | str
default_value: None
search_defaults: None
- script_dir
Directory containing user script dependencies.
type: pathlib.Path | str
default_value: None
search_defaults: None
- target_opset
The version of the default (ai.onnx) opset to target.
type: int
default_value: 14
search_defaults: None
- components
List of component models to export. E.g. [‘decoder_model’, ‘decoder_with_past_model’]. None means export all components.
type: List[str]
default_value: None
search_defaults: None
- fp16
Whether to use fp16 precision to load torch model and then convert it to onnx.
type: bool
default_value: False
search_defaults: None
- device
The device to use to do the export. Defaults to ‘cpu’.
type: str
default_value: cpu
search_defaults: None
- extra_args
Extra arguments to pass to the optimum.exporters.onnx.main_export function.
type: dict
default_value: None
search_defaults: None
OptimumMerging
Merges a decoder_model with its decoder_with_past_model via the Optimum library.
Input: handler.composite.CompositeModelHandler
Output: handler.onnx.ONNXModelHandler | handler.composite.CompositeModelHandler
- strict
When set, the decoder and decoder_with_past are expected to have strictly the same number of outputs. When False, the decoder is allowed to have more outputs that decoder_with_past, in which case constant outputs are added to match the number of outputs.
type: bool
default_value: True
search_defaults: None
- save_as_external_data
Serializes tensor data to separate files instead of directly in the ONNX file. Large models (>2GB) may be forced to save external data regardless of the value of this parameter.
type: bool
default_value: False
search_defaults: None
- all_tensors_to_one_file
Effective only if save_as_external_data is True. If true, save all tensors to one external file specified by ‘external_data_name’. If false, save each tensor to a file named with the tensor name.
type: bool
default_value: True
search_defaults: None
- external_data_name
Effective only if all_tensors_to_one_file is True and save_as_external_data is True. If not specified, the external data file will be named with <model_path_name>.data
type: str
default_value: None
search_defaults: None
- size_threshold
Effective only if save_as_external_data is True. Threshold for size of data. Only when tensor’s data is >= the size_threshold it will be converted to external data. To convert every tensor with raw data to external data set size_threshold=0.
type: int
default_value: 1024
search_defaults: None
- convert_attribute
Effective only if save_as_external_data is True. If true, convert all tensors to external data If false, convert only non-attribute tensors to external data
type: bool
default_value: False
search_defaults: None
ModelBuilder
Converts a Huggingface generative PyTorch model to ONNX model using the Generative AI builder. See https://github.com/microsoft/onnxruntime-genai
Input: handler.hf.HfModelHandler | handler.onnx.ONNXModelHandler
Output: handler.onnx.ONNXModelHandler
- precision
Precision of model.
type: olive.passes.onnx.model_builder.ModelBuilder.Precision
required: True
- metadata_only
Whether to export the model or generate required metadata only.
type: bool
default_value: False
search_defaults: None
- search
Search options to use for generate loop.
type: Dict[str, Any]
default_value: None
search_defaults: None
- int4_block_size
Specify the block_size for int4 quantization. Acceptable values: 16/32/64/128/256.
type: int
default_value: None
search_defaults: None
- int4_accuracy_level
Specify the minimum accuracy level for activation of MatMul in int4 quantization.
type: olive.passes.onnx.model_builder.ModelBuilder.AccuracyLevel
default_value: None
search_defaults: None
- exclude_embeds
Remove embedding layer from your ONNX model.
type: bool
default_value: False
search_defaults: None
- exclude_lm_head
Remove language modeling head from your ONNX model.
type: bool
default_value: False
search_defaults: None
- enable_cuda_graph
The model can use CUDA graph capture for CUDA execution provider. If enabled, all nodes being placed on the CUDA EP is the prerequisite for the CUDA graph to be used correctly.
type: bool
default_value: None
search_defaults: None