Olive Options#
Olive enables users to easily compose and customize their own model optimization pipelines. Olive provides a set of passes that can be used to compose a pipeline. Olive receives input model, target hardware, performance requirements, and list of optimizations techniques to apply from user in the form of a json dictionary. In this document, we document the options user can set in this dictionary.
Note:
The json schema for the config file can be found here. It can be used in IDEs like VSCode to provide intellisense by adding the following line at the top of the config file:
"$schema": "https://microsoft.github.io/Olive/schema.json"
The config file can also be provided as a YAML file with the extension
.yaml
or.yml
.
The options are organized into following sections:
Workflow id
workflow_id
Azure ML client
azureml_client
Input Model Information
input_model
Systems Information
systems
Evaluators Information
evaluators
Passes Information
passes
Engine Information
engine
Workflow Host
workflow_host
Workflow ID#
You can name the workflow run by specifying workflow_id
section in your config file. Olive will save the cache under <cache_dir>/<workflow_id>
folder, and automatically save the current running config in the cache folder.
Workflow Host#
Workflow host is where the Olive workflow will be run. The default value is None
. If None
set for workflow host, Olive will run workflow locally. It suppurts AzureML
system for now.
Azure ML Client#
If you will use Azure ML resources and assets, you need to provide your Azure ML client configurations. For example:
You have AzureML system for targets or hosts.
You have Azure ML model as input model.
AzureML authentication credentials is needed. Refer to this for more details.
azureml_client: [Dict]
subscription_id: [str]
Azure account subscription id.resource_group: [str]
Azure account resource group name.workspace_name: [str]
Azure ML workspace name.aml_config_path: [str]
The path to Azure config file, if Azure ML client config is in a separate file.read_timeout: [int]
read timeout in seconds for HTTP requests, user can increase if they find the default value too small. The default value from azureml sdk is 3000 which is too large and cause the evaluations and pass runs to sometimes hang for a long time between retries of job stream and download steps.max_operation_retries: [int]
The maximum number of retries for Azure ML operations like resource creation and download. The default value is 3. User can increase if there are network issues and the operations fail.operation_retry_interval: [int]
The initial interval in seconds between retries for Azure ML operations like resource creation and download. The interval doubles after each retry. The default value is 5. User can increase if there are network issues and the operations fail.default_auth_params: Dict[str, Any]
Default auth parameters for AzureML client. Please refer to azure DefaultAzureCredential for more details. For example, if you want to exclude managed identity credential, you can set the following:"azureml_client": { // ... "default_auth_params": { "exclude_managed_identity_credential": true } }
keyvault_name: [str]
The keyvault name to retrieve secrets.
Example#
azureml_client
with aml_config_path
:#
aml_config.json
:#
{
"subscription_id": "<subscription_id>",
"resource_group": "<resource_group>",
"workspace_name": "<workspace_name>",
}
azureml_client
:#
"azureml_client": {
"aml_config_path": "aml_config.json",
"read_timeout" : 4000,
"max_operation_retries" : 4,
"operation_retry_interval" : 5
},
azureml_client
with azureml config fields:#
"azureml_client": {
"subscription_id": "<subscription_id>",
"resource_group": "<resource_group>",
"workspace_name": "<workspace_name>",
"read_timeout" : 4000,
"max_operation_retries" : 4,
"operation_retry_interval" : 5
},
Input Model Information#
input_model: [Dict]
User should specify input model type and configuration using input model
dictionary. It contains following items:
type: [str]
Type of the input model which is case insensitive.. The supported types containHfModelHandler
,PyTorchModelHandler
,ONNXModelHandler
,OpenVINOModelHandler
,SNPEModelHandler
and etc. You can find more details in Olive Models.config: [Dict]
The configuration of the pass. Its fields can be provided directly to the parent dictionary. For example, forHfModelHandler
, the input model config dictionary specifies following items:model_path: [str | Dict]
The model path can be a string or a dictionary. If it is a string, it is a huggingface hub model id or a local directory. If it is a dictionary, it contains information about the model path. Please refer to Configuring Model Path for the more information of the model path dictionary.task: [str]
The task of the model. The default task istext-generation-with-past
which is equivalent to a causal language model with key-value cache enabled.io_config: [Dict]
: The inputs and outputs information of the model. If not provided, Olive will try to infer the input and output information from the model. The dictionary contains following items:input_names: [List[str]]
The input names of the model.input_types: [List[str]]
The input types of the model.input_shapes: [List[List[int]]]
The input shapes of the model.output_names: [List[str]]
The output names of the model.dynamic_axes: [Dict[str, Dict[str, str]]]
The dynamic axes of the model. The key is the name of the input or output and the value is a dictionary that contains the dynamic axes of the input or output. The key of the value dictionary is the index of the dynamic axis and the value is the name of the dynamic axis. For example,{"input": {"0": "batch_size"}, "output": {"0": "batch_size"}}
means the first dimension of the input and output is dynamic and the name of the dynamic axis isbatch_size
.string_to_int_dim_params: List[str]
The list of input names in dynamic axes that need to be converted to int value.kv_cache: Union[bool, Dict[str, str]]
The key value cache configuration. If not provided, it is assumed to beTrue
if thetask
ends with-with-past
.If it is
False
, Olive will not use key value cache.If it is
True
, Olive will infer the cache configuration from the input_names/input_shapes and input model based on defaultkv_cache
.If it is a dictionary, it should contains the key value cache configuration. Here is an default configuration example:
ort_past_key_name
: “past_key_values..key” Template for the past key name. The <id>
will be replaced by the id of the past key.ort_past_value_name
: “past_key_values..value” Template for the past value name. The <id>
will be replaced by the id of the past value.ort_present_key_name
: “present..key” Template for the present key name. The <id>
will be replaced by the id of the present key.ort_present_value_name
: “present..value” Template for the present value name. The <id>
will be replaced by the id of the present value.world_size
: 1 It is only used for distributed models.num_hidden_layers
: null If null, Olive will infer the number of hidden layers from the model.num_attention_heads
: null If null, Olive will infer the number of attention heads from the model.hidden_size
: null If null, Olive will infer the hidden size from the model.past_sequence_length
: null If null, Olive will infer the past sequence length from the model.batch_size
: 0 The batch size of the model. If it is 0, Olive will use the batch size from the input_shapes ifinput_ids
.dtype
: “float32” The data type of the model.shared_kv
: false Whether to share the key value cache between the past and present key value cache. If it is true, the dynamic axes of the past and present key value cache will be the same.sequence_length_idx
: 2 For most of the cases, the input shape for kv_cache is like (batch_size, num_attention_heads/world_size, sequence_length, hidden_size/num_attention_heads). Thesequence_length
is the index of the sequence length in the input shape.past_kv_dynamic_axis
: null The dynamic axis of the past key value cache. If it is null, Olive will infer the dynamic axis.present_kv_dynamic_axis
: null The dynamic axis of the present key value cache. If it is null, Olive will infer the dynamic axis.
load_kwargs: [dict]
: Arguments to pass to thefrom_pretrained
method of the model class. Refer to this documentation.
Please find the detailed config options from following table for each model type:
Model Type |
Description |
---|---|
Hf model |
|
Distributed Hf Model |
|
Pytorch model |
|
ONNX model |
|
ONNX model |
|
OpenVINO IR model |
|
SNPE DLC model |
|
Composite Model |
Example#
"input_model": {
"type": "HfModel",
"model_path": "meta-llama/Llama-2-7b-hf"
}
Systems Information#
systems: [Dict]
This is a dictionary that contains the information of systems that are reference by the engine, passes and evaluators. The key of the dictionary is the name of the system. The value of the dictionary is another dictionary that contains the information of the system. The information of the system contains following items:
type: [str]
The type of the system. The supported types areLocalSystem
,AzureML
andDocker
. There are some built-in system alias which could also be used as type. For example,AzureNDV2System
. Please refer to System alias list for the complete list of system alias.config: [Dict]
The system config dictionary that contains the system specific information. The fields can be provided directly under the parent dictionary.accelerators: [List[str]]
The accelerators that will be used for this workflow.hf_token: [bool]
Whether to use a Huggingface token to access Huggingface resources. If it is set toTrue
, For local system, Docker system, and PythonEnvironment system, Olive will retrieve the token from theHF_TOKEN
environment variable or from the token file located at~/.huggingface/token
. For AzureML system, Olive will retrieve the token from user keyvault secret. If set toFalse
, no token will be utilized during this workflow run. The default value isFalse
.
Please refer to How To Configure System for the more information of the system config dictionary.
Example#
"systems": {
"local_system": {"type": "LocalSystem"},
"aml_system": {
"type": "AzureML",
"aml_compute": "cpu-cluster",
"aml_docker_config": {
"base_image": "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
"conda_file_path": "conda.yaml"
}
}
}
Evaluators Information#
evaluators: [Dict]
This is a dictionary that contains the information of evaluators that are reference by the engine and passes. The key of the dictionary is the name of the evaluator. The value of the dictionary is another dictionary that contains the information of the evaluator. The information of the evaluator contains following items:
metrics: [List]
This is a list of metrics that the evaluator will use to evaluate the model. Each metric is a dictionary that contains following items:name: [str]
The name of the metric. This must be a unique name among all metrics in the evaluator.type: [str]
The type of the metric. The supported types areaccuracy
,latency
,throughput
andcustom
.backend: [str]
The type of metrics’ backend. Olive implementtorch_metrics
andhuggingface_metrics
backends. The default value istorch_metrics
.torch_metrics
backend usestorchmetrics
(>=0.1.0) library to compute metrics. It supportsaccuracy_score
,f1_score
,precision
,recall
andauroc
metrics which are used forbinary
task (equal tometric_config:{"task": "binary"}
) by default. You need alter thetask
if needed. Please refer to torchmetrics for more details.huggingface_metrics
backend uses huggingfaceevaluate
library to compute metrics. The supported metrics can be found at huggingface metrics.
subtypes: [List[Dict]]
The subtypes of the metric. Cannot be null or empty. Each subtype is a dictionary that contains following items:name: str
The name of the subtype. for the supported subtypes. Forcustom
type, if the result of the evaluation is a dictionary, the name of the subtype should be the key of the dictionary. Otherwise, the name of the subtype could be any unique string user gives.metric_config
The parameter config used to measure detailed metrics. Please note that when thebackend
ishuggingface_metrics
, you should see themetric_config
as dictionary of:load_params
: The parameters used to load the metric, run asevaluator = evaluate.load("word_length", **load_params)
.compute_params
The parameters used to compute the metric, run asevaluator.compute(predictions=preds, references=target, **compute_params)
.result_key
The key used to extract the metric result with given format. For example, if the metric result is {‘accuracy’: {‘value’: 0.9}}, then the result_key should be ‘accuracy.value’.”
priority: [int]
The priority of the subtype. The higher priority subtype will be given priority during evaluation. Note that it should be unique among all subtypes in the metric.higher_is_better: [Boolean]
True if the metric is better when it is higher. It istrue
foraccuracy
type andfalse
forlatency
type.goal: [Dict]
The goal of the metric. It is a dictionary that contains following items:type: [str]
The type of the goal. The supported types arethreshold
,min-improvement
,percent-min-improvement
,max-degradation
, andpercent-max-degradation
.value: [float]
The value of the goal. It is the threshold value forthreshold
type. It is the minimum improvement value formin-improvement
type. It is the minimum improvement percentage forpercent-min-improvement
type. It is the maximum degradation value formax-degradation
type. It is the maximum degradation percentage forpercent-max-degradation
type.
user_config: [Dict]
The user config dictionary that contains the user specific information for the metric. The dictionary contains following items:user_script: [str]
The name of the script provided by the user to assist with metric evaluation.script_dir: [str]
The directory that contains dependencies for the user script.inference_settings: [Dict]
Inference settings for the different runtime.evaluate_func: [str]
The name of the function provided by the user to evaluate the model. The function should take the model,data_dir
,batch_size
,device
,execution_providers
as input and return the evaluation result. Only valid forcustom
type.evaluate_func_kwargs: Dict[str, Any]
Keyword arguments forevaluate_func
provided by the user. The functions must be able to take the keyword arguments either through the function signature as keyword/positional parameters after the required positional parameters or through**kwargs
.metric_func: [str]
The name of the function provided by the user to compute metric from the model output. The function should take the post processed output and target as input and return the metric result. Only valid forcustom
type whenevaluate_func
is not provided.metric_func_kwargs: Dict[str, Any]
Keyword arguments formetric_func
provided by the user. The functions must be able to take the keyword arguments either through the function signature as keyword/positional parameters after the required positional parameters or through**kwargs
.
Note that for above
data_dir
config which is related to resource path, Olive supports local file, local folder or AML Datastore. Take AML Datastore as an example, Olive can parse the resource type automatically fromconfig dict
, orurl
. Please refer to our Resnet example for more details."data_dir": { "type": "azureml_datastore", "azureml_client": "azureml_client", "datastore_name": "test", "relative_path": "cifar-10-batches-py" } // provide azureml datastore url "data_dir": "azureml://subscriptions/test/resourcegroups/test/workspaces/test/datastores/test/cifar-10-batches-py"
Example#
"data_configs": [
{
"name": "accuracy_data_config",
"user_script": "user_script.py",
"post_process_data_config": { "type": "post_process" },
"dataloader_config": { "type": "create_dataloader", "params": { "batch_size": 1 } }
}
],
"evaluators": {
"common_evaluator": {
"metrics":[
{
"name": "accuracy",
"type": "accuracy",
"data_config": "accuracy_data_config",
"sub_types": [
{"name": "accuracy_score", "priority": 1, "goal": {"type": "max-degradation", "value": 0.01}},
{"name": "f1_score"},
{"name": "auroc"}
]
},
{
"name": "accuracy",
"type": "accuracy",
"backend": "huggingface_metrics",
"data_config": "accuracy_data_config",
"sub_types": [
{"name": "accuracy", "priority": -1},
{"name": "f1"}
]
},
{
"name": "latency",
"type": "latency",
"sub_types": [
{"name": "avg", "priority": 2, "goal": {"type": "percent-min-improvement", "value": 20}},
{"name": "max"},
{"name": "min"}
],
"user_config":{
"inference_settings" : {
"onnx": {
"session_options": {
"enable_profiling": true
}
}
}
}
}
]
}
}
Passes Information#
passes: [Dict]
This is a dictionary that contains the information of passes that are executed by the engine. The passes are executed
in order of their definition in this dictionary if pass_flows
is not specified. The key of the dictionary is the name
of the pass. The value of the dictionary is another dictionary that contains the information of the pass. The information
of the pass contains following items:
type: [str]
The type of the pass.config: [Dict]
The configuration of the pass. Its fields can be provided directly to the parent dictionary.host: [str | Dict]
The host of the pass. It can be a string or a dictionary. If it is a string, it is the name of a system insystems
. If it is a dictionary, it contains the system information. If not specified, the host of the engine will be used.evaluator: [str | Dict]
The evaluator of the pass. It can be a string or a dictionary. If it is a string, it is the name of an evaluator inevaluators
. If it is a dictionary, it contains the evaluator information. If not specified, the evaluator of the engine will be used.
Please refer to Configuring Pass for more details on type
and config
.
Please also find the detailed options from following table for each pass:
Pass Name |
Description |
---|---|
Convert a PyTorch model to ONNX model |
|
Convert a Onnx model to target op version |
|
Convert a generative PyTorch model to ONNX model using ONNX Runtime Generative AI module |
|
Optimize ONNX model by fusing nodes. |
|
Optimize transformer based models in scenarios where ONNX Runtime does not apply the optimization at load time. It is based on onnxruntime.transformers.optimizer. |
|
Optimize ONNX Runtime inference settings. |
|
Converts a model to float16. It uses the float16 converter from onnxruntime to convert the model to float16. |
|
Converts model inputs/outputs from a source dtype to a target dtype based on a name pattern. |
|
Convert model to mixed precision. |
|
Preprocess ONNX model for quantization targeting QNN Execution Provider. |
|
Pre-processes the model for mixed precision quantization with qnn configs. |
|
ONNX Dynamic Quantization Pass. |
|
ONNX Static Quantization Pass. |
|
Quantize ONNX model with onnxruntime where we can search for best parameters for static/dynamic quantization at same time. |
|
Quantize ONNX models’ MatMul operations to 4-bit weights |
|
ONNX graph surgeries collections. |
|
Convert ONNX MatMulNBits nodes to standard ONNX quantized-dequantized (QDQ) format. |
|
Convert dynamic shape to fixed shape for ONNX model |
|
Intel® Neural Compressor Dynamic Quantization Pass. |
|
Intel® Neural Compressor Static Quantization Pass. |
|
Quantize ONNX model with Intel® Neural Compressor where we can search for best parameters for static/dynamic quantization at same time. |
|
AMD-Xilinx Vitis-AI Quantization Pass. |
|
Add Pre/Post nodes to the input model. |
|
Insert Beam Search Op. Only used for whisper models. Uses WhisperBeamSearch contrib op if ORT version >= 1.17.1, else uses BeamSearch contrib op. |
|
Extract adapters from ONNX model |
|
Capture the split information of the model layers. Only splits the transformer layers. |
|
Split an ONNX model into multiple smaller sub-models based on predefined assignments. |
|
Run LoRA fine-tuning on a Hugging Face PyTorch model. |
|
Run LoHa fine-tuning on a Hugging Face PyTorch model. |
|
Run LoKr fine-tuning on a Hugging Face PyTorch model. |
|
Run QLoRA fine-tuning on a Hugging Face PyTorch model. |
|
Run DoRA fine-tuning on a Hugging Face PyTorch model. |
|
Run LoftQ fine-tuning on a Hugging Face PyTorch model. |
|
Run quantization aware training on PyTorch model. |
|
Converts PyTorch, ONNX or TensorFlow Model to OpenVino Model. |
|
Post-training quantization for OpenVINO model. |
|
Convert ONNX or TensorFlow model to SNPE DLC. Uses snpe-tensorflow-to-dlc or snpe-onnx-to-dlc tools from the SNPE SDK. |
|
Quantize SNPE model. Uses snpe-dlc-quantize tool from the SNPE SDK. |
|
Convert a SNPE DLC to ONNX to use with SNPE Execution Provider. Creates a ONNX graph with the SNPE DLC as a node. |
|
Convert ONNX, TensorFlow, or PyTorch model to QNN C++ model. Quantize the model if –input_list is provided as extra_args. Uses qnn-[framework]-converter tool from the QNN SDK. |
|
Compile QNN C++ model source code into QNN model library for a specific target. Uses qnn-model-lib-generator tool from the QNN SDK. |
|
Merge adapter weights into the base model and save transformer context files. |
|
Run SparseGPT on a Hugging Face PyTorch model. |
|
Run SliceGPT on a Hugging Face PyTorch model. |
|
Rotate model using QuaRot. |
|
GPTQ quantization Pass On Pytorch Model. |
|
AWQ quantization Pass On Pytorch Model. |
|
Convert torch.nn.Linear modules in the transformer layers of a HuggingFace PyTorch model to TensorRT modules. |
|
Convert huggingface models to ONNX via the Optimum library. |
|
Merge 2 models together with an |
Example#
"passes": {
"onnx_conversion": {
"type": "OnnxConversion",
"target_opset": 13
},
"onnx_quantization": {
"type": "OnnxQuantization",
"data_config": "calib_data_coonfig",
"weight_type": "QUInt8"
}
}
Engine Information#
engine: [Dict]
This is a dictionary that contains the information of the engine. Its fields can be provided directly to the parent dictionary. The information of the engine contains following items:
search_strategy: [Dict | Boolean | None]
,None
by default. The search strategy of the engine. It contains the following items:execution_order: [str]
The execution order of the optimizations of passes. The options arepass-by-pass
andjoint
.sampler: [str]
The search sampler to use while traversing. The available search algorithms arerandom
,sequential
andtpe
.sampler_config: [Dict]
The configuration of the sampler. The options depends on the chosen sampler. Its fields can be provided directly to the parent dictionary.stop_when_goals_met: [Boolean]
This decides whether to stop the search when the metric goals, if any, are met. This isfalse
by default.include_pass_params: [Boolean]
Includes individual pass parameter to build the search space. Defaults to true.max_iter: [int]
The maximum number of iterations of the search. Only valid forjoint
execution order. By default, there is no maximum number of iterations.max_time: [int]
The maximum time of the search in seconds. Only valid forjoint
execution order. By default, there is no maximum time.
If
search_strategy
isnull
orfalse
, the engine will run the passes in the order they were registered without searching. Thus, the passes must have empty search spaces. The output of the final pass will be evaluated if there is a valid evaluator. The output of the engine will be the output model of the final pass and its evaluation result.If
search_strategy
istrue
, the search strategy will be the default search strategy. The default search strategy issequential
search sampler withjoint
execution order.evaluate_input_model: [Boolean]
In this mode, the engine will evaluate the input model using the engine’s evaluator and return the results. If the engine has no evaluator, it will skip the evaluation. This istrue
by default.host: [str | Dict | None]
,None
be default. The host of the engine. It can be a string or a dictionary. If it is a string, it is the name of a system insystems
. If it is a dictionary, it contains the system information. If not specified, it is the local system.target: [str | Dict | None]
,None
be default. The target to run model evaluations on. It can be a string or a dictionary. If it is a string, it is the name of a system insystems
. If it is a dictionary, it contains the system information. If not specified, it is the local system.evaluator: [str | Dict | None]
,None
by default. The evaluator of the engine. It can be a string or a dictionary. If it is a string, it is the name of an evaluator inevaluators
. If it is a dictionary, it contains the evaluator information. This evaluator will be used to evaluate the input model if needed. It is also used to evaluate the output models of passes that don’t have their own evaluators. If it is None, skip the evaluation for input model and any output models.cache_dir: [str]
,.olive-cache
by default. The directory to store the cache of the engine. If not specified, the cache will be stored in the.olive-cache
directory under the current working directory.clean_cache: [Boolean]
,false
by default. This decides whether to clean the cache of the engine before running the engine.clean_evaluation_cache: [Boolean]
,false
by default. This decides whether to clean the evaluation cache of the engine before running the engine.plot_pareto_frontier
,false
by default. This decides whether to plot the pareto frontier of the search results.output_dir: [str]
,None
by default. The directory to store the output of the engine. If not specified, the output will be stored in the current working directory. For a run with no search, the output is the output model of the final pass and its evaluation result. For a run with search, the output is a json file with the search results.output_name: [str]
,None
by default. The name of the output. This string will be used as the prefix of the output file name. If not specified, there is no prefix.packaging_config: [PackagingConfig]
,None
by default. Olive artifacts packaging configurations. If not specified, Olive will not package artifacts.log_severity_level: [int]
,1
by default. The log severity level of Olive. The options are0
forVERBOSE
,1
forINFO
,2
forWARNING
,3
forERROR
,4
forFATAL
.ort_log_severity_level: [int]
,3
by default. The log severity level of ONNX Runtime C++ logs. The options are0
forVERBOSE
,1
forINFO
,2
forWARNING
,3
forERROR
,4
forFATAL
.ort_py_log_severity_level: [int]
,3
by default. The log severity level of ONNX Runtime Python logs. The options are0
forVERBOSE
,1
forINFO
,2
forWARNING
,3
forERROR
,4
forFATAL
.log_to_file: [Boolean]
,false
by default. This decides whether to log to file. Iftrue
, the log will be stored in a olive-.log file under the current working directory.
Please find the detailed config options from following table for each search sampler:
Note that if max_samples
is set to zero, each of the below sampler will be exhaustive.
Sampler |
Description |
---|---|
Samples random points from the search space |
|
Iterates over the entire search space sequentially |
|
Sample using TPE (Tree-structured Parzen Estimator) algorithm. |
Example#
"engine": {
"search_strategy": {
"execution_order": "joint",
"sampler": "tpe",
"max_samples": 5,
"seed": 0
},
"evaluator": "common_evaluator",
"host": "local_system",
"target": "local_system",
"clean_cache": true,
"cache_dir": "cache"
}