Command Line Tools

Olive provides command line tools that can be invoked using the olive command. | The command line tools are used to perform various tasks such as running an Olive workflow, | managing AzureML compute, and more.

If olive is not in your PATH, you can run the command line tools by replacing olive with python -m olive.

Input Model

Olive Cli Procuded Model

The Olive command-line tools support using a model produced by Olive CLI as an input model. You can specify the model file path using the -m <output_model> option, where <output_model> is the output folder defined by -o <output_model> in the previous cli command.

Local PyTorch Model

Olive command line tools accept a local PyTorch model as an input model. You can specify the model file path using the -m model.pt option, and the associated model script using the --model_script script.py option.

Olive reserves several function names to provide specific inputs for the PyTorch model. These functions should be defined in your model script:

Available Functions

Below are the functions that Olive expects in the model script and their purposes:

  • Model Loader Function (`_model_loader`): Loads the PyTorch model. If the model file path is provided using the -m option, it takes higher priority than the model loader function.

    def _model_loader():
        ...
        return model
    
  • IO Config Function (`_io_config`): Returns the IO configuration for the model. Either _io_config or _dummy_inputs is required for the capture-onnx-graph CLI command.

    def _io_config(model: PyTorchModelHandler):
        ...
        return io_config
    
  • Dummy Inputs Function (`_dummy_inputs`): Provides dummy input tensors for the model. Either _io_config or _dummy_inputs is required for the capture-onnx-graph CLI command.

    def _dummy_inputs(model: PyTorchModelHandler):
        ...
        return dummy_inputs
    
  • Model Format Function (`_model_file_format`): Specifies the format of the model. The default value is PyTorch.EntireModel. For more available options, refer to this.

    def _model_file_format():
        ...
        return model_file_format
    

Example Usage

To use the Olive CLI with a local PyTorch model:

  1. Provide the model path and the script:

    python -m olive capture-onnx-graph -m model.pt --model_script script.py
    
  2. Ensure that the script contains the above functions to handle loading, input/output configuration, dummy inputs, and model format specification as needed.

Argparse Documentation

Below is the argparse documentation for the Olive command-line interface:

usage: olive

Sub-commands

capture-onnx-graph

Capture ONNX graph using PyTorch Exporter or Model Builder from the Huggingface model.

olive capture-onnx-graph [-h] [--log_level LOG_LEVEL] [-m MODEL_NAME_OR_PATH]
                         [--trust_remote_code] [-t TASK]
                         [--model_script MODEL_SCRIPT]
                         [--script_dir SCRIPT_DIR] [--device {cpu,gpu}]
                         [-o OUTPUT_PATH] [--tempdir TEMPDIR]
                         [--use_dynamo_exporter] [--use_ort_genai]
                         [--past_key_value_name PAST_KEY_VALUE_NAME]
                         [--torch_dtype TORCH_DTYPE]
                         [--target_opset TARGET_OPSET] [--use_model_builder]
                         [--precision {fp16,fp32,int4}]
                         [--int4_block_size {16,32,64,128,256}]
                         [--int4_accuracy_level INT4_ACCURACY_LEVEL]
                         [--exclude_embeds EXCLUDE_EMBEDS]
                         [--exclude_lm_head EXCLUDE_LM_HEAD]
                         [--enable_cuda_graph ENABLE_CUDA_GRAPH]
                         [--resource_group RESOURCE_GROUP]
                         [--workspace_name WORKSPACE_NAME]
                         [--keyvault_name KEYVAULT_NAME]
                         [--aml_compute AML_COMPUTE]
Named Arguments
--device

Possible choices: cpu, gpu

The device to use to convert the model to ONNX.If ‘gpu’ is selected, the execution_providers will be set to CUDAExecutionProvider.If ‘cpu’ is selected, the execution_providers will be set to CPUExecutionProvider.For PyTorch Exporter, the device is used to cast the model to before capturing the ONNX graph.

Default: “cpu”

-o, --output_path

Output path

Default: “onnx-model”

--tempdir

Root directory for tempfile directories and files

logging options
--log_level

Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL

Default: 3

Model options
-m, --model_name_or_path

The model checkpoint for weights initialization. If using an AzureML Registry model, provide the model path as ‘registry_name:model_name:version’.

--trust_remote_code

Trust remote code when loading a model.

Default: False

-t, --task

Task for which the model is used.

--model_script

The script file containing the model definition. Required for PyTorch model.

--script_dir

The directory containing the model script file.

PyTorch Exporter options
--use_dynamo_exporter

Whether to use dynamo_export API to export ONNX model.

Default: False

--use_ort_genai

Use OnnxRuntie generate() API to run the model

Default: False

--past_key_value_name

The arguments name to point to past key values. For model loaded from huggingface, it is ‘past_key_values’. Basically, it is used only when use_dynamo_exporter is True.

Default: “past_key_values”

--torch_dtype

The dtype to cast the model to before capturing the ONNX graph, e.g., ‘float32’ or ‘float16’. If not specified will use the model as is.

--target_opset

The target opset version for the ONNX model. Default is 17.

Default: 17

Model Builder options
--use_model_builder

Whether to use Model Builder to capture ONNX model.

Default: False

--precision

Possible choices: fp16, fp32, int4

The precision of the ONNX model. This is used by Model Builder

Default: “fp16”

--int4_block_size

Possible choices: 16, 32, 64, 128, 256

Specify the block_size for int4 quantization. Acceptable values: 16/32/64/128/256.

--int4_accuracy_level

Specify the minimum accuracy level for activation of MatMul in int4 quantization.

--exclude_embeds

Remove embedding layer from your ONNX model.

Default: False

--exclude_lm_head

Remove language modeling head from your ONNX model.

Default: False

--enable_cuda_graph

The model can use CUDA graph capture for CUDA execution provider. If enabled, all nodes being placed on the CUDA EP is the prerequisite for the CUDA graph to be used correctly.

remote options
--resource_group

Resource group for the AzureML workspace.

--workspace_name

Workspace name for the AzureML workspace.

--keyvault_name

The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.

--aml_compute

The compute name to run the workflow on.

run

Run an olive workflow

olive run [-h] [--package-config PACKAGE_CONFIG] --run-config RUN_CONFIG
          [--setup] [--packages] [--tempdir TEMPDIR]
Named Arguments
--package-config

For advanced users. Path to optional package (json) config file with location of individual pass module implementation and corresponding dependencies. Configuration might also include user owned/proprietary/private pass implementations.

--run-config, --config

Path to json config file

--setup

Whether run environment setup

Default: False

--packages

List required packages

Default: False

--tempdir

Root directory for tempfile directories and files

finetune

Fine-tune a model on a dataset using peft and optimize the model for ONNX Runtime with adapters as inputs. Huggingface training arguments can be provided along with the defined options.

olive finetune [-h] [--log_level LOG_LEVEL] [--precision {float16,float32}]
               [-m MODEL_NAME_OR_PATH] [--trust_remote_code] [-t TASK]
               [--model_script MODEL_SCRIPT] [--script_dir SCRIPT_DIR]
               [--torch_dtype {bfloat16,float16,float32}] [--use_ort_genai] -d
               DATA_NAME [--data_files DATA_FILES] [--train_split TRAIN_SPLIT]
               [--eval_split EVAL_SPLIT]
               (--text_field TEXT_FIELD | --text_template TEXT_TEMPLATE)
               [--max_seq_len MAX_SEQ_LEN] [--method {lora,qlora}]
               [--lora_r LORA_R] [--lora_alpha LORA_ALPHA]
               [--target_modules TARGET_MODULES] [-o OUTPUT_PATH]
               [--tempdir TEMPDIR] [--clean] [--resource_group RESOURCE_GROUP]
               [--workspace_name WORKSPACE_NAME]
               [--keyvault_name KEYVAULT_NAME] [--aml_compute AML_COMPUTE]
Named Arguments
--precision

Possible choices: float16, float32

The precision of the optimized model and adapters.

Default: “float16”

--torch_dtype

Possible choices: bfloat16, float16, float32

The torch dtype to use for training.

Default: “bfloat16”

--use_ort_genai

Use OnnxRuntie generate() API to run the model

Default: False

-o, --output_path

Output path

Default: “optimized-model”

--tempdir

Root directory for tempfile directories and files

--clean

Run in a clean cache directory

Default: False

logging options
--log_level

Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL

Default: 3

Model options
-m, --model_name_or_path

The model checkpoint for weights initialization. If using an AzureML Registry model, provide the model path as ‘registry_name:model_name:version’.

--trust_remote_code

Trust remote code when loading a model.

Default: False

-t, --task

Task for which the model is used.

--model_script

The script file containing the model definition. Required for PyTorch model.

--script_dir

The directory containing the model script file.

dataset options
-d, --data_name

The dataset name.

--data_files

The dataset files. If multiple files, separate by comma.

--train_split

The split to use for training.

Default: “train”

--eval_split

The dataset split to evaluate on.

Default: “”

--text_field

The text field to use for fine-tuning.

--text_template

Template to generate text field from. E.g. ‘### Question: {prompt} n### Answer: {response}’

--max_seq_len

Maximum sequence length for the data.

Default: 1024

lora options
--method

Possible choices: lora, qlora

The method to use for fine-tuning

Default: “lora”

--lora_r

LoRA R value.

Default: 64

--lora_alpha

LoRA alpha value.

Default: 16

--target_modules

The target modules for LoRA. If multiple, separate by comma.

remote options
--resource_group

Resource group for the AzureML workspace.

--workspace_name

Workspace name for the AzureML workspace.

--keyvault_name

The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.

--aml_compute

The compute name to run the workflow on.

export-adapters

Export lora adapter weights to a file that will be consumed by ONNX models generated by Olive ExtractedAdapters pass.

olive export-adapters [-h] --adapter_path ADAPTER_PATH
                      [--save_format {pt,numpy,safetensors}] --output_path
                      OUTPUT_PATH [--dtype {float32,float16}] [--pack_weights]
                      [--quantize_int4] [--int4_block_size {16,32,64,128,256}]
                      [--int4_quantization_mode {symmetric,asymmetric}]
Named Arguments
--adapter_path

Path to the adapters weights saved after peft fine-tuning. Can be a local folder or huggingface id.

--save_format

Possible choices: pt, numpy, safetensors

Format to save the weights in. Default is numpy.

Default: “numpy”

--output_path

Path to save the exported weights. Will be saved in the save_format format.

--dtype

Possible choices: float32, float16

Data type to save float weights as. If quantize_int4 is True, this is the data type of the quantization scales. Default is float32.

Default: “float32”

--pack_weights

Whether to pack the weights. If True, the weights for each module type will be packed into a single array.

Default: False

--quantize_int4

Quantize the weights to int4 using blockwise quantization.

Default: False

int4 quantization options
--int4_block_size

Possible choices: 16, 32, 64, 128, 256

Block size for int4 quantization. Default is 32.

Default: 32

--int4_quantization_mode

Possible choices: symmetric, asymmetric

Quantization mode for int4 quantization. Default is symmetric.

Default: “symmetric”

configure-qualcomm-sdk

Configure Qualcomm SDK for Olive

olive configure-qualcomm-sdk [-h] --py_version {3.6,3.8} --sdk {snpe,qnn}
Named Arguments
--py_version

Possible choices: 3.6, 3.8

Python version: Use 3.6 for tensorflow 1.15 and 3.8 otherwise

--sdk

Possible choices: snpe, qnn

Qualcomm SDK: snpe or qnn

manage-aml-compute

Create new compute in your AzureML workspace

olive manage-aml-compute [-h] (--create | --delete)
                         [--subscription_id SUBSCRIPTION_ID]
                         [--resource_group RESOURCE_GROUP]
                         [--workspace_name WORKSPACE_NAME]
                         [--aml_config_path AML_CONFIG_PATH] --compute_name
                         COMPUTE_NAME [--vm_size VM_SIZE]
                         [--location LOCATION] [--min_nodes MIN_NODES]
                         [--max_nodes MAX_NODES]
                         [--idle_time_before_scale_down IDLE_TIME_BEFORE_SCALE_DOWN]
Named Arguments
--create, -c

Create new compute

Default: False

--delete, -d

Delete existing compute

Default: False

--subscription_id

Azure subscription ID

--resource_group

Name of the Azure resource group

--workspace_name

Name of the AzureML workspace

--aml_config_path

Path to AzureML config file. If provided, subscription_id, resource_group and workspace_name are ignored

--compute_name

Name of the new compute

--vm_size

VM size of the new compute. This is required if you are creating a compute instance

--location

Location of the new compute. This is required if you are creating a compute instance

--min_nodes

Minimum number of nodes

Default: 0

--max_nodes

Maximum number of nodes

Default: 2

--idle_time_before_scale_down

Idle seconds before scaledown

Default: 120

tune-session-params

Automatically tune the session parameters for a given onnx model. Currently, for onnx model converted from huggingface model and used for generative tasks, user can simply provide the –model onnx_model_path –hf_model_name hf_model_name –device device_type to get the tuned session parameters.

olive tune-session-params [-h] [--log_level LOG_LEVEL] [-m MODEL_NAME_OR_PATH]
                          [--trust_remote_code] [-t TASK]
                          [--model_script MODEL_SCRIPT]
                          [--script_dir SCRIPT_DIR]
                          [--data_config_path DATA_CONFIG_PATH]
                          [--predict_with_kv_cache]
                          [--hf_model_name HF_MODEL_NAME]
                          [--batch_size BATCH_SIZE] [--seq_len SEQ_LEN]
                          [--past_seq_len PAST_SEQ_LEN]
                          [--max_seq_len MAX_SEQ_LEN] [--shared_kv]
                          [--generative]
                          [--ort_past_key_name ORT_PAST_KEY_NAME]
                          [--ort_past_value_name ORT_PAST_VALUE_NAME]
                          [--max_samples MAX_SAMPLES]
                          [--fields_no_batch [FIELDS_NO_BATCH [FIELDS_NO_BATCH ...]]]
                          [--device {gpu,cpu}] [--cpu_cores CPU_CORES]
                          [--io_bind] [--enable_cuda_graph]
                          [--providers_list [PROVIDERS_LIST [PROVIDERS_LIST ...]]]
                          [--execution_mode_list [EXECUTION_MODE_LIST [EXECUTION_MODE_LIST ...]]]
                          [--opt_level_list [OPT_LEVEL_LIST [OPT_LEVEL_LIST ...]]]
                          [--trt_fp16_enable]
                          [--intra_thread_num_list [INTRA_THREAD_NUM_LIST [INTRA_THREAD_NUM_LIST ...]]]
                          [--inter_thread_num_list [INTER_THREAD_NUM_LIST [INTER_THREAD_NUM_LIST ...]]]
                          [--extra_session_config EXTRA_SESSION_CONFIG]
                          [--disable_force_evaluate_other_eps]
                          [--enable_profiling] [--output_path OUTPUT_PATH]
                          [--tempdir TEMPDIR]
                          [--resource_group RESOURCE_GROUP]
                          [--workspace_name WORKSPACE_NAME]
                          [--keyvault_name KEYVAULT_NAME]
                          [--aml_compute AML_COMPUTE]
Named Arguments
--output_path

Path to save the tuned inference settings.

Default: “perf_tuning_output”

--tempdir

Root directory for tempfile directories and files

logging options
--log_level

Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL

Default: 3

Model options
-m, --model_name_or_path

The model checkpoint for weights initialization. If using an AzureML Registry model, provide the model path as ‘registry_name:model_name:version’.

--trust_remote_code

Trust remote code when loading a model.

Default: False

-t, --task

Task for which the model is used.

--model_script

The script file containing the model definition. Required for PyTorch model.

--script_dir

The directory containing the model script file.

dataset options, which mutually exclusive with huggingface dataset options
--data_config_path

Path to the data config file. It allows to customize the data config(json/yaml) for the model.

huggingface dataset options, if dataset options are not provided, user should provide the following options to modify the default data config. Please refer to olive.data.container.TransformersTokenDummyDataContainer for more details.
--predict_with_kv_cache

Whether to use key-value cache for perf_tuning

Default: False

--hf_model_name

Huggingface model name used to load model configs from huggingface.

--batch_size

Batch size of the input data.

--seq_len

Sequence length to use for the input data.

--past_seq_len

Past sequence length to use for the input data.

--max_seq_len

Max sequence length to use for the input data.

--shared_kv

Whether to enable share kv cache in the input data.

Default: False

--generative

Whether to enable generative mode in the input data.

Default: False

--ort_past_key_name

Past key name for the input data.

--ort_past_value_name

Past value name for the input data.

--max_samples

Max samples to use for the input data.

--fields_no_batch

List of fields that should not be batched.

pass options
--device

Possible choices: gpu, cpu

Device to use for the model.

Default: “cpu”

--cpu_cores

CPU cores used for thread tuning.

--io_bind

Whether enable IOBinding Search for ONNX Runtime inference.

Default: False

--enable_cuda_graph

Whether enable CUDA Graph for CUDA execution provider.

Default: False

--providers_list

List of execution providers to use for ONNX model. They are case sensitive. If not provided, all available providers will be used.

--execution_mode_list

Parallelism list between operators.

--opt_level_list

Optimization level list for ONNX Model.

--trt_fp16_enable

Enable TensorRT FP16 mode.

Default: False

--intra_thread_num_list

List of intra thread number for test.

--inter_thread_num_list

List of inter thread number for test.

--extra_session_config

Extra customized session options during tuning process. It should be a json string.E.g. –extra_session_config ‘{“key1”: “value1”, “key2”: “value2”}’

--disable_force_evaluate_other_eps

Whether force to evaluate all execution providers which are different with the associated execution provider.

Default: False

--enable_profiling

Whether enable profiling for ONNX Runtime inference.

Default: False

remote options
--resource_group

Resource group for the AzureML workspace.

--workspace_name

Workspace name for the AzureML workspace.

--keyvault_name

The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.

--aml_compute

The compute name to run the workflow on.

cloud-cache

Cloud cache model operations

olive cloud-cache [-h] [--delete] --account ACCOUNT --container CONTAINER
                  --model_hash MODEL_HASH
Named Arguments
--delete

Delete a model cache from the cloud cache.

Default: False

--account

The account name for the cloud cache.

--container

The container name for the cloud cache.

--model_hash

The model hash to remove from the cloud cache.