Olive provides command line tools that can be invoked using the olive command.

Run

Run Olive workflow defined in the input .json configuration file.

usage: olive run [-h] --run-config RUN_CONFIG [--setup] [--packages]
                 [--tempdir TEMPDIR] [--package-config PACKAGE_CONFIG]

Named Arguments

--run-config, --config

Path to json config file

--setup

Setup environment needed to run the workflow

Default: False

--packages

List packages required to run the workflow

Default: False

--tempdir

Root directory for tempfile directories and files

--package-config

For advanced users. Path to optional package (json) config file with location of individual pass module implementation and corresponding dependencies. Configuration might also include user owned/proprietary/private pass implementations.

Finetune

Fine-tune a model on a dataset using HuggingFace peft. Huggingface training arguments can be provided along with the defined options.

usage: olive finetune [-h] -m MODEL_NAME_OR_PATH [-t TASK]
                      [--trust_remote_code]
                      [--is_generative_model IS_GENERATIVE_MODEL]
                      [-o OUTPUT_PATH] [--method {lora,qlora}]
                      [--lora_r LORA_R] [--lora_alpha LORA_ALPHA]
                      [--target_modules TARGET_MODULES]
                      [--torch_dtype {bfloat16,float16,float32}] -d DATA_NAME
                      [--train_subset TRAIN_SUBSET]
                      [--train_split TRAIN_SPLIT] [--eval_subset EVAL_SUBSET]
                      [--eval_split EVAL_SPLIT] [--data_files DATA_FILES]
                      [--text_field TEXT_FIELD | --text_template TEXT_TEMPLATE]
                      [--max_seq_len MAX_SEQ_LEN]
                      [--add_special_tokens ADD_SPECIAL_TOKENS]
                      [--max_samples MAX_SAMPLES] [--batch_size BATCH_SIZE]
                      [--resource_group RESOURCE_GROUP]
                      [--workspace_name WORKSPACE_NAME]
                      [--keyvault_name KEYVAULT_NAME]
                      [--aml_compute AML_COMPUTE]
                      [--account_name ACCOUNT_NAME]
                      [--container_name CONTAINER_NAME]
                      [--log_level LOG_LEVEL]

Named Arguments

-m, --model_name_or_path

Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.

-t, --task

Task for which the huggingface model is used.

--trust_remote_code

Trust remote code when loading a huggingface model.

Default: False

--is_generative_model

Is this a generative model?

Default: True

-o, --output_path

Path to save the command output.

Default: finetuned-adapter

--torch_dtype

Possible choices: bfloat16, float16, float32

The torch dtype to use for training.

Default: “bfloat16”

-d, --data_name

The dataset name.

--train_subset

The subset to use for training.

--train_split

The split to use for training.

Default: “train”

--eval_subset

The subset to use for evaluation.

--eval_split

The dataset split to evaluate on.

Default: “”

--data_files

The dataset files. If multiple files, separate by comma.

--text_field

The text field to use for fine-tuning.

--text_template

Template to generate text field from. E.g. ‘### Question: {prompt} n### Answer: {response}’

--max_seq_len

Maximum sequence length for the data.

Default: 1024

--add_special_tokens

Whether to add special tokens during preprocessing.

Default: False

--max_samples

Maximum samples to select from the dataset.

Default: 256

--batch_size

Batch size.

Default: 1

--resource_group

Resource group for the AzureML workspace to run the workflow remotely.

--workspace_name

Workspace name for the AzureML workspace to run the workflow remotely.

--keyvault_name

The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.

--aml_compute

The compute name to run the workflow on.

--account_name

Azure storage account name for shared cache.

--container_name

Azure storage container name for shared cache.

--log_level

Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL

Default: 3

LoRA options

--method

Possible choices: lora, qlora

The method to use for fine-tuning

Default: “lora”

--lora_r

LoRA R value.

Default: 64

--lora_alpha

LoRA alpha value.

Default: 16

--target_modules

The target modules for LoRA. If multiple, separate by comma.

Auto-Optimization

Automatically optimize the input model for the given target and precision.

usage: olive auto-opt [-h] [-m MODEL_NAME_OR_PATH] [-t TASK]
                      [--trust_remote_code] [-a ADAPTER_PATH]
                      [--model_script MODEL_SCRIPT] [--script_dir SCRIPT_DIR]
                      [--is_generative_model IS_GENERATIVE_MODEL]
                      [-o OUTPUT_PATH] [--device {gpu,cpu,npu}]
                      [--provider {CPUExecutionProvider,CUDAExecutionProvider,DmlExecutionProvider,JsExecutionProvider,MIGraphXExecutionProvider,OpenVINOExecutionProvider,QNNExecutionProvider,ROCMExecutionProvider,TensorrtExecutionProvider,VitisAIExecutionProvider}]
                      [--memory MEMORY] [-d DATA_NAME] [--split SPLIT]
                      [--subset SUBSET]
                      [--input_cols [INPUT_COLS [INPUT_COLS ...]]]
                      [--batch_size BATCH_SIZE]
                      [--precision {fp4,fp8,fp16,fp32,int4,int8,int16,int32,nf4}]
                      [--use_dynamo_exporter] [--use_model_builder]
                      [--use_qdq_encoding]
                      [--dynamic-to-fixed-shape-dim-param [DYNAMIC_TO_FIXED_SHAPE_DIM_PARAM [DYNAMIC_TO_FIXED_SHAPE_DIM_PARAM ...]]]
                      [--dynamic-to-fixed-shape-dim-value [DYNAMIC_TO_FIXED_SHAPE_DIM_VALUE [DYNAMIC_TO_FIXED_SHAPE_DIM_VALUE ...]]]
                      [--num-splits NUM_SPLITS | --cost-model COST_MODEL]
                      [--mixed-precision-overrides-config [MIXED_PRECISION_OVERRIDES_CONFIG [MIXED_PRECISION_OVERRIDES_CONFIG ...]]]
                      [--use_ort_genai]
                      [--enable_search [{exhaustive,tpe,random}]]
                      [--seed SEED] [--resource_group RESOURCE_GROUP]
                      [--workspace_name WORKSPACE_NAME]
                      [--keyvault_name KEYVAULT_NAME]
                      [--aml_compute AML_COMPUTE]
                      [--account_name ACCOUNT_NAME]
                      [--container_name CONTAINER_NAME]
                      [--log_level LOG_LEVEL]

Named Arguments

-m, --model_name_or_path

Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.

-t, --task

Task for which the huggingface model is used.

--trust_remote_code

Trust remote code when loading a huggingface model.

Default: False

-a, --adapter_path

Path to the adapters weights saved after peft fine-tuning. Local folder or huggingface id.

--model_script

The script file containing the model definition. Required for the local PyTorch model.

--script_dir

The directory containing the local PyTorch model script file.See https://microsoft.github.io/Olive/features/cli.html#model-script-file-information for more informsation.

--is_generative_model

Is this a generative model?

Default: True

-o, --output_path

Path to save the command output.

Default: auto-opt-output

--device

Possible choices: gpu, cpu, npu

Target device to run the model. Default is cpu.

Default: “cpu”

--provider

Possible choices: CPUExecutionProvider, CUDAExecutionProvider, DmlExecutionProvider, JsExecutionProvider, MIGraphXExecutionProvider, OpenVINOExecutionProvider, QNNExecutionProvider, ROCMExecutionProvider, TensorrtExecutionProvider, VitisAIExecutionProvider

Execution provider to use for ONNX model. Default is CPUExecutionProvider.

Default: “CPUExecutionProvider”

--memory

Memory limit for the accelerator in bytes. Default is None.

-d, --data_name

The dataset name.

--split

The dataset split to use for evaluation.

--subset

The dataset subset to use for evaluation.

--input_cols

The input columns to use for evaluation.

--batch_size

Batch size for evaluation.

Default: 1

--precision

Possible choices: fp4, fp8, fp16, fp32, int4, int8, int16, int32, nf4

The output precision of the optimized model. If not specified, the default precision is fp32 for cpu and fp16 for gpu

Default: “fp32”

--use_dynamo_exporter

Whether to use dynamo_export API to export ONNX model.

Default: False

--use_model_builder

Whether to use model builder pass for optimization, enable only when the model is supported by model builder

Default: False

--use_qdq_encoding

Whether to use QDQ encoding for quantized operators instead of ONNXRuntime contrib operators like MatMulNBits

Default: False

--dynamic-to-fixed-shape-dim-param

Symbolic parameter names to use for dynamic to fixed shape pass. Required only when using QNNExecutionProvider.

--dynamic-to-fixed-shape-dim-value

Symbolic parameter values to use for dynamic to fixed shape pass. Required only when using QNNExecutionProvider.

--num-splits

Number of splits to use for model splitting. Input model must be an HfModel.

--cost-model

Path to the cost model csv file to use for model splitting. Mutually exclusive with num-splits. Must be a csv with headers module,num_params,num_bytes where each row corresponds to the name or a module (with no children), the number of parameters, and the number of bytes the module uses when in the desired precision.

--mixed-precision-overrides-config

Dictionary of name to precision. Has to be even number of entreis with even entries being the keys and odd entries being the values. Required only when output precision is “fp16” and MixedPrecisionOverrides pass is enabled.

--use_ort_genai

Use OnnxRuntime generate() API to run the model

Default: False

--enable_search

Possible choices: exhaustive, tpe, random

Enable search to produce optimal model for the given criteria. Optionally provide search algorithm from available choices. Use exhastive search algorithm by default.

--seed

Random seed for search algorithm

Default: 0

--resource_group

Resource group for the AzureML workspace to run the workflow remotely.

--workspace_name

Workspace name for the AzureML workspace to run the workflow remotely.

--keyvault_name

The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.

--aml_compute

The compute name to run the workflow on.

--account_name

Azure storage account name for shared cache.

--container_name

Azure storage container name for shared cache.

--log_level

Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL

Default: 3

Quantization

Quantize PyTorch or ONNX model using various Quantization algorithms.

usage: olive quantize [-h] [-m MODEL_NAME_OR_PATH] [-t TASK]
                      [--trust_remote_code] [-a ADAPTER_PATH]
                      [--model_script MODEL_SCRIPT] [--script_dir SCRIPT_DIR]
                      [--is_generative_model IS_GENERATIVE_MODEL]
                      [-o OUTPUT_PATH] --algorithm {awq,dynamic,gptq,hqq,rtn}
                      [--precision {int4,int8,int16,uint4,uint8,uint16,fp4,fp8,fp16,nf4}]
                      [--implementation {awq,bnb4,gptq,inc_dynamic,matmul4,mnb_to_qdq,nvmo,onnx_dynamic,quarot}]
                      [--enable-qdq-encoding] [--quarot_rotate] [-d DATA_NAME]
                      [--subset SUBSET] [--split SPLIT]
                      [--data_files DATA_FILES]
                      [--text_field TEXT_FIELD | --text_template TEXT_TEMPLATE]
                      [--max_seq_len MAX_SEQ_LEN]
                      [--add_special_tokens ADD_SPECIAL_TOKENS]
                      [--max_samples MAX_SAMPLES] [--batch_size BATCH_SIZE]
                      [--resource_group RESOURCE_GROUP]
                      [--workspace_name WORKSPACE_NAME]
                      [--keyvault_name KEYVAULT_NAME]
                      [--aml_compute AML_COMPUTE]
                      [--account_name ACCOUNT_NAME]
                      [--container_name CONTAINER_NAME]
                      [--log_level LOG_LEVEL]

Named Arguments

-m, --model_name_or_path

Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.

-t, --task

Task for which the huggingface model is used.

--trust_remote_code

Trust remote code when loading a huggingface model.

Default: False

-a, --adapter_path

Path to the adapters weights saved after peft fine-tuning. Local folder or huggingface id.

--model_script

The script file containing the model definition. Required for the local PyTorch model.

--script_dir

The directory containing the local PyTorch model script file.See https://microsoft.github.io/Olive/features/cli.html#model-script-file-information for more informsation.

--is_generative_model

Is this a generative model?

Default: True

-o, --output_path

Path to save the command output.

Default: quantized-model

--algorithm

Possible choices: awq, dynamic, gptq, hqq, rtn

List of quantization algorithms to run.

--precision

Possible choices: int4, int8, int16, uint4, uint8, uint16, fp4, fp8, fp16, nf4

The precision of the quantized model.

Default: “int4”

--implementation

Possible choices: awq, bnb4, gptq, inc_dynamic, matmul4, mnb_to_qdq, nvmo, onnx_dynamic, quarot

The specific implementation of quantization algorithms to use.

--enable-qdq-encoding

Use QDQ encoding in ONNX model for the quantized nodes.

Default: False

--quarot_rotate

Apply QuaRot/Hadamard rotation to the model.

Default: False

-d, --data_name

The dataset name.

--subset

The subset of the dataset to use.

--split

The dataset split to use.

--data_files

The dataset files. If multiple files, separate by comma.

--text_field

The text field to use for fine-tuning.

--text_template

Template to generate text field from. E.g. ‘### Question: {prompt} n### Answer: {response}’

--max_seq_len

Maximum sequence length for the data.

Default: 1024

--add_special_tokens

Whether to add special tokens during preprocessing.

Default: False

--max_samples

Maximum samples to select from the dataset.

Default: 256

--batch_size

Batch size.

Default: 1

--resource_group

Resource group for the AzureML workspace to run the workflow remotely.

--workspace_name

Workspace name for the AzureML workspace to run the workflow remotely.

--keyvault_name

The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.

--aml_compute

The compute name to run the workflow on.

--account_name

Azure storage account name for shared cache.

--container_name

Azure storage container name for shared cache.

--log_level

Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL

Default: 3

Capture Onnx Graph

Capture ONNX graph using PyTorch Exporter or Model Builder from the Huggingface model or PyTorch model.

usage: olive capture-onnx-graph [-h] [-m MODEL_NAME_OR_PATH] [-t TASK]
                                [--trust_remote_code] [-a ADAPTER_PATH]
                                [--model_script MODEL_SCRIPT]
                                [--script_dir SCRIPT_DIR]
                                [--is_generative_model IS_GENERATIVE_MODEL]
                                [-o OUTPUT_PATH]
                                [--conversion_device {cpu,gpu}]
                                [--use_dynamo_exporter]
                                [--past_key_value_name PAST_KEY_VALUE_NAME]
                                [--torch_dtype TORCH_DTYPE]
                                [--target_opset TARGET_OPSET]
                                [--use_model_builder]
                                [--precision {fp16,fp32,int4}]
                                [--int4_block_size {16,32,64,128,256}]
                                [--int4_accuracy_level INT4_ACCURACY_LEVEL]
                                [--exclude_embeds EXCLUDE_EMBEDS]
                                [--exclude_lm_head EXCLUDE_LM_HEAD]
                                [--enable_cuda_graph ENABLE_CUDA_GRAPH]
                                [--use_ort_genai]
                                [--resource_group RESOURCE_GROUP]
                                [--workspace_name WORKSPACE_NAME]
                                [--keyvault_name KEYVAULT_NAME]
                                [--aml_compute AML_COMPUTE]
                                [--log_level LOG_LEVEL]
                                [--account_name ACCOUNT_NAME]
                                [--container_name CONTAINER_NAME]

Named Arguments

-m, --model_name_or_path

Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.

-t, --task

Task for which the huggingface model is used.

--trust_remote_code

Trust remote code when loading a huggingface model.

Default: False

-a, --adapter_path

Path to the adapters weights saved after peft fine-tuning. Local folder or huggingface id.

--model_script

The script file containing the model definition. Required for the local PyTorch model.

--script_dir

The directory containing the local PyTorch model script file.See https://microsoft.github.io/Olive/features/cli.html#model-script-file-information for more informsation.

--is_generative_model

Is this a generative model?

Default: True

-o, --output_path

Path to save the command output.

Default: onnx-model

--conversion_device

Possible choices: cpu, gpu

The device used to run the model to capture the ONNX graph.

Default: “cpu”

--use_ort_genai

Use OnnxRuntime generate() API to run the model

Default: False

--resource_group

Resource group for the AzureML workspace to run the workflow remotely.

--workspace_name

Workspace name for the AzureML workspace to run the workflow remotely.

--keyvault_name

The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.

--aml_compute

The compute name to run the workflow on.

--log_level

Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL

Default: 3

--account_name

Azure storage account name for shared cache.

--container_name

Azure storage container name for shared cache.

PyTorch Exporter options

--use_dynamo_exporter

Whether to use dynamo_export API to export ONNX model.

Default: False

--past_key_value_name

The arguments name to point to past key values. For model loaded from huggingface, it is ‘past_key_values’. Basically, it is used only when use_dynamo_exporter is True.

Default: “past_key_values”

--torch_dtype

The dtype to cast the model to before capturing the ONNX graph, e.g., ‘float32’ or ‘float16’. If not specified will use the model as is.

--target_opset

The target opset version for the ONNX model. Default is 17.

Default: 17

Model Builder options

--use_model_builder

Whether to use Model Builder to capture ONNX model.

Default: False

--precision

Possible choices: fp16, fp32, int4

The precision of the ONNX model. This is used by Model Builder

Default: “fp16”

--int4_block_size

Possible choices: 16, 32, 64, 128, 256

Specify the block_size for int4 quantization. Acceptable values: 16/32/64/128/256.

--int4_accuracy_level

Specify the minimum accuracy level for activation of MatMul in int4 quantization.

--exclude_embeds

Remove embedding layer from your ONNX model.

Default: False

--exclude_lm_head

Remove language modeling head from your ONNX model.

Default: False

--enable_cuda_graph

The model can use CUDA graph capture for CUDA execution provider. If enabled, all nodes being placed on the CUDA EP is the prerequisite for the CUDA graph to be used correctly.

Generate Adapters

Generate ONNX model with adapters as inputs.

usage: olive generate-adapter [-h] -m MODEL_NAME_OR_PATH
                              [--is_generative_model IS_GENERATIVE_MODEL]
                              [-o OUTPUT_PATH]
                              [--adapter_format {pt,numpy,safetensors,onnx_adapter}]
                              [--resource_group RESOURCE_GROUP]
                              [--workspace_name WORKSPACE_NAME]
                              [--keyvault_name KEYVAULT_NAME]
                              [--aml_compute AML_COMPUTE]
                              [--log_level LOG_LEVEL]
                              [--account_name ACCOUNT_NAME]
                              [--container_name CONTAINER_NAME]

Named Arguments

-m, --model_name_or_path

Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.

--is_generative_model

Is this a generative model?

Default: True

-o, --output_path

Path to save the command output.

Default: optimized-model

--adapter_format

Possible choices: pt, numpy, safetensors, onnx_adapter

Format to save the weights in. Default is onnx_adapter.

Default: “onnx_adapter”

--resource_group

Resource group for the AzureML workspace to run the workflow remotely.

--workspace_name

Workspace name for the AzureML workspace to run the workflow remotely.

--keyvault_name

The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.

--aml_compute

The compute name to run the workflow on.

--log_level

Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL

Default: 3

--account_name

Azure storage account name for shared cache.

--container_name

Azure storage container name for shared cache.

Convert Adapters

Convert LoRA adapter weights to a file that will be consumed by ONNX models generated by Olive ExtractedAdapters pass.

usage: olive convert-adapters [-h] -a ADAPTER_PATH
                              [--adapter_format {pt,numpy,safetensors,onnx_adapter}]
                              -o OUTPUT_PATH [--dtype {float32,float16}]
                              [--quantize_int4]
                              [--int4_block_size {16,32,64,128,256}]
                              [--int4_quantization_mode {symmetric,asymmetric}]
                              [--log_level LOG_LEVEL]

Named Arguments

-a, --adapter_path

Path to the adapters weights saved after peft fine-tuning. Can be a local folder or huggingface id.

--adapter_format

Possible choices: pt, numpy, safetensors, onnx_adapter

Format to save the weights in. Default is onnx_adapter.

Default: “onnx_adapter”

-o, --output_path

Path to save the exported weights. Will be saved in the adapter_format format.

--dtype

Possible choices: float32, float16

Data type to save float adapter weights as. If quantize_int4 is True, this is the data type of the quantization scales. Default is float32.

Default: “float32”

--quantize_int4

Quantize the adapter weights to int4 using blockwise quantization.

Default: False

--int4_block_size

Possible choices: 16, 32, 64, 128, 256

Block size for int4 quantization of adapter weights. Default is 32.

Default: 32

--int4_quantization_mode

Possible choices: symmetric, asymmetric

Quantization mode for int4 quantization of adapter weights. Default is symmetric.

Default: “symmetric”

--log_level

Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL

Default: 3

Tune OnnxRuntime Session Params

Automatically tune the OnnxRuntime session parameters for a given onnx model. Currently, for onnx model converted from huggingface model and used for generative tasks, user can simply provide the –model onnx_model_path –hf_model_name hf_model_name –device device_type to get the tuned session parameters.

usage: olive tune-session-params [-h] -m MODEL_NAME_OR_PATH
                                 [--is_generative_model IS_GENERATIVE_MODEL]
                                 [-o OUTPUT_PATH] [--cpu_cores CPU_CORES]
                                 [--io_bind] [--enable_cuda_graph]
                                 [--execution_mode_list [EXECUTION_MODE_LIST [EXECUTION_MODE_LIST ...]]]
                                 [--opt_level_list [OPT_LEVEL_LIST [OPT_LEVEL_LIST ...]]]
                                 [--trt_fp16_enable]
                                 [--intra_thread_num_list [INTRA_THREAD_NUM_LIST [INTRA_THREAD_NUM_LIST ...]]]
                                 [--inter_thread_num_list [INTER_THREAD_NUM_LIST [INTER_THREAD_NUM_LIST ...]]]
                                 [--extra_session_config EXTRA_SESSION_CONFIG]
                                 [--disable_force_evaluate_other_eps]
                                 [--enable_profiling]
                                 [--predict_with_kv_cache]
                                 [--device {gpu,cpu,npu}]
                                 [--providers_list [{CPUExecutionProvider,CUDAExecutionProvider,DmlExecutionProvider,JsExecutionProvider,MIGraphXExecutionProvider,OpenVINOExecutionProvider,QNNExecutionProvider,ROCMExecutionProvider,TensorrtExecutionProvider,VitisAIExecutionProvider} [{CPUExecutionProvider,CUDAExecutionProvider,DmlExecutionProvider,JsExecutionProvider,MIGraphXExecutionProvider,OpenVINOExecutionProvider,QNNExecutionProvider,ROCMExecutionProvider,TensorrtExecutionProvider,VitisAIExecutionProvider} ...]]]
                                 [--memory MEMORY]
                                 [--resource_group RESOURCE_GROUP]
                                 [--workspace_name WORKSPACE_NAME]
                                 [--keyvault_name KEYVAULT_NAME]
                                 [--aml_compute AML_COMPUTE]
                                 [--log_level LOG_LEVEL]
                                 [--account_name ACCOUNT_NAME]
                                 [--container_name CONTAINER_NAME]

Named Arguments

-m, --model_name_or_path

Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.

--is_generative_model

Is this a generative model?

Default: True

-o, --output_path

Path to save the command output.

Default: tuned-inference-settings

--cpu_cores

CPU cores used for thread tuning.

--io_bind

Whether enable IOBinding Search for ONNX Runtime inference.

Default: False

--enable_cuda_graph

Whether enable CUDA Graph for CUDA execution provider.

Default: False

--execution_mode_list

Parallelism list between operators.

--opt_level_list

Optimization level list for ONNX Model.

--trt_fp16_enable

Enable TensorRT FP16 mode.

Default: False

--intra_thread_num_list

List of intra thread number for test.

--inter_thread_num_list

List of inter thread number for test.

--extra_session_config

Extra customized session options during tuning process. It should be a json string.E.g. –extra_session_config ‘{“key1”: “value1”, “key2”: “value2”}’

--disable_force_evaluate_other_eps

Whether force to evaluate all execution providers which are different with the associated execution provider.

Default: False

--enable_profiling

Whether enable profiling for ONNX Runtime inference.

Default: False

--predict_with_kv_cache

Whether to use key-value cache for ORT session parameter tuning

Default: False

--device

Possible choices: gpu, cpu, npu

Target device to run the model. Default is cpu.

Default: “cpu”

--providers_list

Possible choices: CPUExecutionProvider, CUDAExecutionProvider, DmlExecutionProvider, JsExecutionProvider, MIGraphXExecutionProvider, OpenVINOExecutionProvider, QNNExecutionProvider, ROCMExecutionProvider, TensorrtExecutionProvider, VitisAIExecutionProvider

List of execution providers to use for ONNX model. They are case sensitive. If not provided, all available providers will be used.

--memory

Memory limit for the accelerator in bytes. Default is None.

--resource_group

Resource group for the AzureML workspace to run the workflow remotely.

--workspace_name

Workspace name for the AzureML workspace to run the workflow remotely.

--keyvault_name

The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.

--aml_compute

The compute name to run the workflow on.

--log_level

Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL

Default: 3

--account_name

Azure storage account name for shared cache.

--container_name

Azure storage container name for shared cache.

Generate Cost Model for Model Splitting

Generate a cost model for a given model and save it as a csv file. This cost model is consumed by the CaptureSplitInfo pass. Only supports HfModel.

usage: olive generate-cost-model [-h] -m MODEL_NAME_OR_PATH [-t TASK]
                                 [--trust_remote_code]
                                 [--is_generative_model IS_GENERATIVE_MODEL]
                                 [-o OUTPUT_PATH]
                                 [-p {fp32,fp16,fp8,int32,uint32,int16,uint16,int8,uint8,int4,uint4,nf4,fp4}]

Named Arguments

-m, --model_name_or_path

Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.

-t, --task

Task for which the huggingface model is used.

--trust_remote_code

Trust remote code when loading a huggingface model.

Default: False

--is_generative_model

Is this a generative model?

Default: True

-o, --output_path

Path to save the command output.

Default: “cost-model.csv”

-p, --weight_precision

Possible choices: fp32, fp16, fp8, int32, uint32, int16, uint16, int8, uint8, int4, uint4, nf4, fp4

Weight precision

Default: “fp16”

Qualcomm SDK

Configure Qualcomm SDK.

usage: olive configure-qualcomm-sdk [-h] --py_version {3.6,3.8} --sdk
                                    {snpe,qnn}

Named Arguments

--py_version

Possible choices: 3.6, 3.8

Python version: Use 3.6 for tensorflow 1.15 and 3.8 otherwise

--sdk

Possible choices: snpe, qnn

Qualcomm SDK: snpe or qnn

AzureML

Manage the AzureML Compute resources.

usage: olive manage-aml-compute [-h] (--create | --delete)
                                [--subscription_id SUBSCRIPTION_ID]
                                [--resource_group RESOURCE_GROUP]
                                [--workspace_name WORKSPACE_NAME]
                                [--aml_config_path AML_CONFIG_PATH]
                                --compute_name COMPUTE_NAME
                                [--vm_size VM_SIZE] [--location LOCATION]
                                [--min_nodes MIN_NODES]
                                [--max_nodes MAX_NODES]
                                [--idle_time_before_scale_down IDLE_TIME_BEFORE_SCALE_DOWN]

Named Arguments

--create, -c

Create new compute

Default: False

--delete, -d

Delete existing compute

Default: False

--subscription_id

Azure subscription ID

--resource_group

Name of the Azure resource group

--workspace_name

Name of the AzureML workspace

--aml_config_path

Path to AzureML config file. If provided, subscription_id, resource_group and workspace_name are ignored

--compute_name

Name of the new compute

--vm_size

VM size of the new compute. This is required if you are creating a compute instance

--location

Location of the new compute. This is required if you are creating a compute instance

--min_nodes

Minimum number of nodes

Default: 0

--max_nodes

Maximum number of nodes

Default: 2

--idle_time_before_scale_down

Idle seconds before scaledown

Default: 120

Shared Cache

Delete Olive model cache stored in the cloud.

usage: olive shared-cache [-h] [--delete] [--all] [-y] --account ACCOUNT
                          --container CONTAINER [--model_hash MODEL_HASH]

Named Arguments

--delete

Delete a model cache from the shared cache.

Default: False

--all

Delete all model cache from the cloud cache.

Default: False

-y, --yes

Confirm the deletion without prompting for confirmation.

Default: False

--account

The account name for the shared cache.

--container

The container name for the shared cache.

--model_hash

The model hash to remove from the shared cache.

Providing Input Models

There are more than one way to supply input model to the Olive commands.

  1. HuggingFace model can be directly used as an input model. For example -m microsoft/Phi-3-mini-4k-instruct.

  2. A model produced by a Olive command can be directly used as an input model. You can specify the model file path using the -m <output_model> option, where <output_model> is the output folder defined by -o <output_model> in the previous Olive command.

  3. Olive commands also accept a local PyTorch model as an input model. You can specify the model file path using the -m model.pt option, and the associated model script using the --model_script script.py option. For example, olive capture-onnx-graph -m model.pt --model_script script.py.

  4. A model from AzureML registry can be directly used as an input model. For example -m azureml://registries/<registry_name>/models/<model_name>/versions/<version>.

  5. An ONNX model available locally can also be used as an input for the Olive commands that accept ONNX model as an input.

Model Script File Information

Olive commands support custom PyTorch model as an input. Olive requires users to define specific functions to load and process the custom PyTorch model. These functions should be defined in your model script you provide.

  • Model Loader Function (`_model_loader`): Loads the PyTorch model. If the model file path is provided using the -m option, it takes higher priority than the model loader function.

    def _model_loader():
        ...
        return model
    
  • IO Config Function (`_io_config`): Returns the IO configuration for the model. Either _io_config or _dummy_inputs is required for the capture-onnx-graph CLI command.

    def _io_config(model: PyTorchModelHandler):
        ...
        return io_config
    
  • Dummy Inputs Function (`_dummy_inputs`): Provides dummy input tensors for the model. Either _io_config or _dummy_inputs is required for the capture-onnx-graph CLI command.

    def _dummy_inputs(model: PyTorchModelHandler):
        ...
        return dummy_inputs
    
  • Model Format Function (`_model_file_format`): Specifies the format of the model. The default value is PyTorch.EntireModel. For more available options, refer to this.

    def _model_file_format():
        ...
        return model_file_format