Olive provides command line tools that can be invoked using the olive
command.
Run
Run Olive workflow defined in the input .json configuration file.
usage: olive run [-h] --run-config RUN_CONFIG [--setup] [--packages]
[--tempdir TEMPDIR] [--package-config PACKAGE_CONFIG]
Named Arguments
- --run-config, --config
Path to json config file
- --setup
Setup environment needed to run the workflow
Default: False
- --packages
List packages required to run the workflow
Default: False
- --tempdir
Root directory for tempfile directories and files
- --package-config
For advanced users. Path to optional package (json) config file with location of individual pass module implementation and corresponding dependencies. Configuration might also include user owned/proprietary/private pass implementations.
Finetune
Fine-tune a model on a dataset using HuggingFace peft. Huggingface training arguments can be provided along with the defined options.
usage: olive finetune [-h] -m MODEL_NAME_OR_PATH [-t TASK]
[--trust_remote_code]
[--is_generative_model IS_GENERATIVE_MODEL]
[-o OUTPUT_PATH] [--method {lora,qlora}]
[--lora_r LORA_R] [--lora_alpha LORA_ALPHA]
[--target_modules TARGET_MODULES]
[--torch_dtype {bfloat16,float16,float32}] -d DATA_NAME
[--train_subset TRAIN_SUBSET]
[--train_split TRAIN_SPLIT] [--eval_subset EVAL_SUBSET]
[--eval_split EVAL_SPLIT] [--data_files DATA_FILES]
[--text_field TEXT_FIELD | --text_template TEXT_TEMPLATE]
[--max_seq_len MAX_SEQ_LEN]
[--add_special_tokens ADD_SPECIAL_TOKENS]
[--max_samples MAX_SAMPLES] [--batch_size BATCH_SIZE]
[--resource_group RESOURCE_GROUP]
[--workspace_name WORKSPACE_NAME]
[--keyvault_name KEYVAULT_NAME]
[--aml_compute AML_COMPUTE]
[--account_name ACCOUNT_NAME]
[--container_name CONTAINER_NAME]
[--log_level LOG_LEVEL]
Named Arguments
- -m, --model_name_or_path
Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.
- -t, --task
Task for which the huggingface model is used.
- --trust_remote_code
Trust remote code when loading a huggingface model.
Default: False
- --is_generative_model
Is this a generative model?
Default: True
- -o, --output_path
Path to save the command output.
Default: finetuned-adapter
- --torch_dtype
Possible choices: bfloat16, float16, float32
The torch dtype to use for training.
Default: “bfloat16”
- -d, --data_name
The dataset name.
- --train_subset
The subset to use for training.
- --train_split
The split to use for training.
Default: “train”
- --eval_subset
The subset to use for evaluation.
- --eval_split
The dataset split to evaluate on.
Default: “”
- --data_files
The dataset files. If multiple files, separate by comma.
- --text_field
The text field to use for fine-tuning.
- --text_template
Template to generate text field from. E.g. ‘### Question: {prompt} n### Answer: {response}’
- --max_seq_len
Maximum sequence length for the data.
Default: 1024
- --add_special_tokens
Whether to add special tokens during preprocessing.
Default: False
- --max_samples
Maximum samples to select from the dataset.
Default: 256
- --batch_size
Batch size.
Default: 1
- --resource_group
Resource group for the AzureML workspace to run the workflow remotely.
- --workspace_name
Workspace name for the AzureML workspace to run the workflow remotely.
- --keyvault_name
The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.
- --aml_compute
The compute name to run the workflow on.
- --account_name
Azure storage account name for shared cache.
- --container_name
Azure storage container name for shared cache.
- --log_level
Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL
Default: 3
LoRA options
- --method
Possible choices: lora, qlora
The method to use for fine-tuning
Default: “lora”
- --lora_r
LoRA R value.
Default: 64
- --lora_alpha
LoRA alpha value.
Default: 16
- --target_modules
The target modules for LoRA. If multiple, separate by comma.
Auto-Optimization
Automatically optimize the input model for the given target and precision.
usage: olive auto-opt [-h] [-m MODEL_NAME_OR_PATH] [-t TASK]
[--trust_remote_code] [-a ADAPTER_PATH]
[--model_script MODEL_SCRIPT] [--script_dir SCRIPT_DIR]
[--is_generative_model IS_GENERATIVE_MODEL]
[-o OUTPUT_PATH] [--device {gpu,cpu,npu}]
[--provider {CPUExecutionProvider,CUDAExecutionProvider,DmlExecutionProvider,JsExecutionProvider,MIGraphXExecutionProvider,OpenVINOExecutionProvider,QNNExecutionProvider,ROCMExecutionProvider,TensorrtExecutionProvider,VitisAIExecutionProvider}]
[--memory MEMORY] [-d DATA_NAME] [--split SPLIT]
[--subset SUBSET]
[--input_cols [INPUT_COLS [INPUT_COLS ...]]]
[--batch_size BATCH_SIZE]
[--precision {fp4,fp8,fp16,fp32,int4,int8,int16,int32,nf4}]
[--use_dynamo_exporter] [--use_model_builder]
[--use_qdq_encoding]
[--dynamic-to-fixed-shape-dim-param [DYNAMIC_TO_FIXED_SHAPE_DIM_PARAM [DYNAMIC_TO_FIXED_SHAPE_DIM_PARAM ...]]]
[--dynamic-to-fixed-shape-dim-value [DYNAMIC_TO_FIXED_SHAPE_DIM_VALUE [DYNAMIC_TO_FIXED_SHAPE_DIM_VALUE ...]]]
[--num-splits NUM_SPLITS | --cost-model COST_MODEL]
[--mixed-precision-overrides-config [MIXED_PRECISION_OVERRIDES_CONFIG [MIXED_PRECISION_OVERRIDES_CONFIG ...]]]
[--use_ort_genai]
[--enable_search [{exhaustive,tpe,random}]]
[--seed SEED] [--resource_group RESOURCE_GROUP]
[--workspace_name WORKSPACE_NAME]
[--keyvault_name KEYVAULT_NAME]
[--aml_compute AML_COMPUTE]
[--account_name ACCOUNT_NAME]
[--container_name CONTAINER_NAME]
[--log_level LOG_LEVEL]
Named Arguments
- -m, --model_name_or_path
Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.
- -t, --task
Task for which the huggingface model is used.
- --trust_remote_code
Trust remote code when loading a huggingface model.
Default: False
- -a, --adapter_path
Path to the adapters weights saved after peft fine-tuning. Local folder or huggingface id.
- --model_script
The script file containing the model definition. Required for the local PyTorch model.
- --script_dir
The directory containing the local PyTorch model script file.See https://microsoft.github.io/Olive/features/cli.html#model-script-file-information for more informsation.
- --is_generative_model
Is this a generative model?
Default: True
- -o, --output_path
Path to save the command output.
Default: auto-opt-output
- --device
Possible choices: gpu, cpu, npu
Target device to run the model. Default is cpu.
Default: “cpu”
- --provider
Possible choices: CPUExecutionProvider, CUDAExecutionProvider, DmlExecutionProvider, JsExecutionProvider, MIGraphXExecutionProvider, OpenVINOExecutionProvider, QNNExecutionProvider, ROCMExecutionProvider, TensorrtExecutionProvider, VitisAIExecutionProvider
Execution provider to use for ONNX model. Default is CPUExecutionProvider.
Default: “CPUExecutionProvider”
- --memory
Memory limit for the accelerator in bytes. Default is None.
- -d, --data_name
The dataset name.
- --split
The dataset split to use for evaluation.
- --subset
The dataset subset to use for evaluation.
- --input_cols
The input columns to use for evaluation.
- --batch_size
Batch size for evaluation.
Default: 1
- --precision
Possible choices: fp4, fp8, fp16, fp32, int4, int8, int16, int32, nf4
The output precision of the optimized model. If not specified, the default precision is fp32 for cpu and fp16 for gpu
Default: “fp32”
- --use_dynamo_exporter
Whether to use dynamo_export API to export ONNX model.
Default: False
- --use_model_builder
Whether to use model builder pass for optimization, enable only when the model is supported by model builder
Default: False
- --use_qdq_encoding
Whether to use QDQ encoding for quantized operators instead of ONNXRuntime contrib operators like MatMulNBits
Default: False
- --dynamic-to-fixed-shape-dim-param
Symbolic parameter names to use for dynamic to fixed shape pass. Required only when using QNNExecutionProvider.
- --dynamic-to-fixed-shape-dim-value
Symbolic parameter values to use for dynamic to fixed shape pass. Required only when using QNNExecutionProvider.
- --num-splits
Number of splits to use for model splitting. Input model must be an HfModel.
- --cost-model
Path to the cost model csv file to use for model splitting. Mutually exclusive with num-splits. Must be a csv with headers module,num_params,num_bytes where each row corresponds to the name or a module (with no children), the number of parameters, and the number of bytes the module uses when in the desired precision.
- --mixed-precision-overrides-config
Dictionary of name to precision. Has to be even number of entreis with even entries being the keys and odd entries being the values. Required only when output precision is “fp16” and MixedPrecisionOverrides pass is enabled.
- --use_ort_genai
Use OnnxRuntime generate() API to run the model
Default: False
- --enable_search
Possible choices: exhaustive, tpe, random
Enable search to produce optimal model for the given criteria. Optionally provide search algorithm from available choices. Use exhastive search algorithm by default.
- --seed
Random seed for search algorithm
Default: 0
- --resource_group
Resource group for the AzureML workspace to run the workflow remotely.
- --workspace_name
Workspace name for the AzureML workspace to run the workflow remotely.
- --keyvault_name
The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.
- --aml_compute
The compute name to run the workflow on.
- --account_name
Azure storage account name for shared cache.
- --container_name
Azure storage container name for shared cache.
- --log_level
Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL
Default: 3
Quantization
Quantize PyTorch or ONNX model using various Quantization algorithms.
usage: olive quantize [-h] [-m MODEL_NAME_OR_PATH] [-t TASK]
[--trust_remote_code] [-a ADAPTER_PATH]
[--model_script MODEL_SCRIPT] [--script_dir SCRIPT_DIR]
[--is_generative_model IS_GENERATIVE_MODEL]
[-o OUTPUT_PATH] --algorithm {awq,dynamic,gptq,hqq,rtn}
[--precision {int4,int8,int16,uint4,uint8,uint16,fp4,fp8,fp16,nf4}]
[--implementation {awq,bnb4,gptq,inc_dynamic,matmul4,mnb_to_qdq,nvmo,onnx_dynamic,quarot}]
[--enable-qdq-encoding] [--quarot_rotate] [-d DATA_NAME]
[--subset SUBSET] [--split SPLIT]
[--data_files DATA_FILES]
[--text_field TEXT_FIELD | --text_template TEXT_TEMPLATE]
[--max_seq_len MAX_SEQ_LEN]
[--add_special_tokens ADD_SPECIAL_TOKENS]
[--max_samples MAX_SAMPLES] [--batch_size BATCH_SIZE]
[--resource_group RESOURCE_GROUP]
[--workspace_name WORKSPACE_NAME]
[--keyvault_name KEYVAULT_NAME]
[--aml_compute AML_COMPUTE]
[--account_name ACCOUNT_NAME]
[--container_name CONTAINER_NAME]
[--log_level LOG_LEVEL]
Named Arguments
- -m, --model_name_or_path
Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.
- -t, --task
Task for which the huggingface model is used.
- --trust_remote_code
Trust remote code when loading a huggingface model.
Default: False
- -a, --adapter_path
Path to the adapters weights saved after peft fine-tuning. Local folder or huggingface id.
- --model_script
The script file containing the model definition. Required for the local PyTorch model.
- --script_dir
The directory containing the local PyTorch model script file.See https://microsoft.github.io/Olive/features/cli.html#model-script-file-information for more informsation.
- --is_generative_model
Is this a generative model?
Default: True
- -o, --output_path
Path to save the command output.
Default: quantized-model
- --algorithm
Possible choices: awq, dynamic, gptq, hqq, rtn
List of quantization algorithms to run.
- --precision
Possible choices: int4, int8, int16, uint4, uint8, uint16, fp4, fp8, fp16, nf4
The precision of the quantized model.
Default: “int4”
- --implementation
Possible choices: awq, bnb4, gptq, inc_dynamic, matmul4, mnb_to_qdq, nvmo, onnx_dynamic, quarot
The specific implementation of quantization algorithms to use.
- --enable-qdq-encoding
Use QDQ encoding in ONNX model for the quantized nodes.
Default: False
- --quarot_rotate
Apply QuaRot/Hadamard rotation to the model.
Default: False
- -d, --data_name
The dataset name.
- --subset
The subset of the dataset to use.
- --split
The dataset split to use.
- --data_files
The dataset files. If multiple files, separate by comma.
- --text_field
The text field to use for fine-tuning.
- --text_template
Template to generate text field from. E.g. ‘### Question: {prompt} n### Answer: {response}’
- --max_seq_len
Maximum sequence length for the data.
Default: 1024
- --add_special_tokens
Whether to add special tokens during preprocessing.
Default: False
- --max_samples
Maximum samples to select from the dataset.
Default: 256
- --batch_size
Batch size.
Default: 1
- --resource_group
Resource group for the AzureML workspace to run the workflow remotely.
- --workspace_name
Workspace name for the AzureML workspace to run the workflow remotely.
- --keyvault_name
The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.
- --aml_compute
The compute name to run the workflow on.
- --account_name
Azure storage account name for shared cache.
- --container_name
Azure storage container name for shared cache.
- --log_level
Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL
Default: 3
Capture Onnx Graph
Capture ONNX graph using PyTorch Exporter or Model Builder from the Huggingface model or PyTorch model.
usage: olive capture-onnx-graph [-h] [-m MODEL_NAME_OR_PATH] [-t TASK]
[--trust_remote_code] [-a ADAPTER_PATH]
[--model_script MODEL_SCRIPT]
[--script_dir SCRIPT_DIR]
[--is_generative_model IS_GENERATIVE_MODEL]
[-o OUTPUT_PATH]
[--conversion_device {cpu,gpu}]
[--use_dynamo_exporter]
[--past_key_value_name PAST_KEY_VALUE_NAME]
[--torch_dtype TORCH_DTYPE]
[--target_opset TARGET_OPSET]
[--use_model_builder]
[--precision {fp16,fp32,int4}]
[--int4_block_size {16,32,64,128,256}]
[--int4_accuracy_level INT4_ACCURACY_LEVEL]
[--exclude_embeds EXCLUDE_EMBEDS]
[--exclude_lm_head EXCLUDE_LM_HEAD]
[--enable_cuda_graph ENABLE_CUDA_GRAPH]
[--use_ort_genai]
[--resource_group RESOURCE_GROUP]
[--workspace_name WORKSPACE_NAME]
[--keyvault_name KEYVAULT_NAME]
[--aml_compute AML_COMPUTE]
[--log_level LOG_LEVEL]
[--account_name ACCOUNT_NAME]
[--container_name CONTAINER_NAME]
Named Arguments
- -m, --model_name_or_path
Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.
- -t, --task
Task for which the huggingface model is used.
- --trust_remote_code
Trust remote code when loading a huggingface model.
Default: False
- -a, --adapter_path
Path to the adapters weights saved after peft fine-tuning. Local folder or huggingface id.
- --model_script
The script file containing the model definition. Required for the local PyTorch model.
- --script_dir
The directory containing the local PyTorch model script file.See https://microsoft.github.io/Olive/features/cli.html#model-script-file-information for more informsation.
- --is_generative_model
Is this a generative model?
Default: True
- -o, --output_path
Path to save the command output.
Default: onnx-model
- --conversion_device
Possible choices: cpu, gpu
The device used to run the model to capture the ONNX graph.
Default: “cpu”
- --use_ort_genai
Use OnnxRuntime generate() API to run the model
Default: False
- --resource_group
Resource group for the AzureML workspace to run the workflow remotely.
- --workspace_name
Workspace name for the AzureML workspace to run the workflow remotely.
- --keyvault_name
The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.
- --aml_compute
The compute name to run the workflow on.
- --log_level
Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL
Default: 3
- --account_name
Azure storage account name for shared cache.
- --container_name
Azure storage container name for shared cache.
PyTorch Exporter options
- --use_dynamo_exporter
Whether to use dynamo_export API to export ONNX model.
Default: False
- --past_key_value_name
The arguments name to point to past key values. For model loaded from huggingface, it is ‘past_key_values’. Basically, it is used only when use_dynamo_exporter is True.
Default: “past_key_values”
- --torch_dtype
The dtype to cast the model to before capturing the ONNX graph, e.g., ‘float32’ or ‘float16’. If not specified will use the model as is.
- --target_opset
The target opset version for the ONNX model. Default is 17.
Default: 17
Model Builder options
- --use_model_builder
Whether to use Model Builder to capture ONNX model.
Default: False
- --precision
Possible choices: fp16, fp32, int4
The precision of the ONNX model. This is used by Model Builder
Default: “fp16”
- --int4_block_size
Possible choices: 16, 32, 64, 128, 256
Specify the block_size for int4 quantization. Acceptable values: 16/32/64/128/256.
- --int4_accuracy_level
Specify the minimum accuracy level for activation of MatMul in int4 quantization.
- --exclude_embeds
Remove embedding layer from your ONNX model.
Default: False
- --exclude_lm_head
Remove language modeling head from your ONNX model.
Default: False
- --enable_cuda_graph
The model can use CUDA graph capture for CUDA execution provider. If enabled, all nodes being placed on the CUDA EP is the prerequisite for the CUDA graph to be used correctly.
Generate Adapters
Generate ONNX model with adapters as inputs.
usage: olive generate-adapter [-h] -m MODEL_NAME_OR_PATH
[--is_generative_model IS_GENERATIVE_MODEL]
[-o OUTPUT_PATH]
[--adapter_format {pt,numpy,safetensors,onnx_adapter}]
[--resource_group RESOURCE_GROUP]
[--workspace_name WORKSPACE_NAME]
[--keyvault_name KEYVAULT_NAME]
[--aml_compute AML_COMPUTE]
[--log_level LOG_LEVEL]
[--account_name ACCOUNT_NAME]
[--container_name CONTAINER_NAME]
Named Arguments
- -m, --model_name_or_path
Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.
- --is_generative_model
Is this a generative model?
Default: True
- -o, --output_path
Path to save the command output.
Default: optimized-model
- --adapter_format
Possible choices: pt, numpy, safetensors, onnx_adapter
Format to save the weights in. Default is onnx_adapter.
Default: “onnx_adapter”
- --resource_group
Resource group for the AzureML workspace to run the workflow remotely.
- --workspace_name
Workspace name for the AzureML workspace to run the workflow remotely.
- --keyvault_name
The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.
- --aml_compute
The compute name to run the workflow on.
- --log_level
Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL
Default: 3
- --account_name
Azure storage account name for shared cache.
- --container_name
Azure storage container name for shared cache.
Convert Adapters
Convert LoRA adapter weights to a file that will be consumed by ONNX models generated by Olive ExtractedAdapters pass.
usage: olive convert-adapters [-h] -a ADAPTER_PATH
[--adapter_format {pt,numpy,safetensors,onnx_adapter}]
-o OUTPUT_PATH [--dtype {float32,float16}]
[--quantize_int4]
[--int4_block_size {16,32,64,128,256}]
[--int4_quantization_mode {symmetric,asymmetric}]
[--log_level LOG_LEVEL]
Named Arguments
- -a, --adapter_path
Path to the adapters weights saved after peft fine-tuning. Can be a local folder or huggingface id.
- --adapter_format
Possible choices: pt, numpy, safetensors, onnx_adapter
Format to save the weights in. Default is onnx_adapter.
Default: “onnx_adapter”
- -o, --output_path
Path to save the exported weights. Will be saved in the adapter_format format.
- --dtype
Possible choices: float32, float16
Data type to save float adapter weights as. If quantize_int4 is True, this is the data type of the quantization scales. Default is float32.
Default: “float32”
- --quantize_int4
Quantize the adapter weights to int4 using blockwise quantization.
Default: False
- --int4_block_size
Possible choices: 16, 32, 64, 128, 256
Block size for int4 quantization of adapter weights. Default is 32.
Default: 32
- --int4_quantization_mode
Possible choices: symmetric, asymmetric
Quantization mode for int4 quantization of adapter weights. Default is symmetric.
Default: “symmetric”
- --log_level
Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL
Default: 3
Tune OnnxRuntime Session Params
Automatically tune the OnnxRuntime session parameters for a given onnx model. Currently, for onnx model converted from huggingface model and used for generative tasks, user can simply provide the –model onnx_model_path –hf_model_name hf_model_name –device device_type to get the tuned session parameters.
usage: olive tune-session-params [-h] -m MODEL_NAME_OR_PATH
[--is_generative_model IS_GENERATIVE_MODEL]
[-o OUTPUT_PATH] [--cpu_cores CPU_CORES]
[--io_bind] [--enable_cuda_graph]
[--execution_mode_list [EXECUTION_MODE_LIST [EXECUTION_MODE_LIST ...]]]
[--opt_level_list [OPT_LEVEL_LIST [OPT_LEVEL_LIST ...]]]
[--trt_fp16_enable]
[--intra_thread_num_list [INTRA_THREAD_NUM_LIST [INTRA_THREAD_NUM_LIST ...]]]
[--inter_thread_num_list [INTER_THREAD_NUM_LIST [INTER_THREAD_NUM_LIST ...]]]
[--extra_session_config EXTRA_SESSION_CONFIG]
[--disable_force_evaluate_other_eps]
[--enable_profiling]
[--predict_with_kv_cache]
[--device {gpu,cpu,npu}]
[--providers_list [{CPUExecutionProvider,CUDAExecutionProvider,DmlExecutionProvider,JsExecutionProvider,MIGraphXExecutionProvider,OpenVINOExecutionProvider,QNNExecutionProvider,ROCMExecutionProvider,TensorrtExecutionProvider,VitisAIExecutionProvider} [{CPUExecutionProvider,CUDAExecutionProvider,DmlExecutionProvider,JsExecutionProvider,MIGraphXExecutionProvider,OpenVINOExecutionProvider,QNNExecutionProvider,ROCMExecutionProvider,TensorrtExecutionProvider,VitisAIExecutionProvider} ...]]]
[--memory MEMORY]
[--resource_group RESOURCE_GROUP]
[--workspace_name WORKSPACE_NAME]
[--keyvault_name KEYVAULT_NAME]
[--aml_compute AML_COMPUTE]
[--log_level LOG_LEVEL]
[--account_name ACCOUNT_NAME]
[--container_name CONTAINER_NAME]
Named Arguments
- -m, --model_name_or_path
Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.
- --is_generative_model
Is this a generative model?
Default: True
- -o, --output_path
Path to save the command output.
Default: tuned-inference-settings
- --cpu_cores
CPU cores used for thread tuning.
- --io_bind
Whether enable IOBinding Search for ONNX Runtime inference.
Default: False
- --enable_cuda_graph
Whether enable CUDA Graph for CUDA execution provider.
Default: False
- --execution_mode_list
Parallelism list between operators.
- --opt_level_list
Optimization level list for ONNX Model.
- --trt_fp16_enable
Enable TensorRT FP16 mode.
Default: False
- --intra_thread_num_list
List of intra thread number for test.
- --inter_thread_num_list
List of inter thread number for test.
- --extra_session_config
Extra customized session options during tuning process. It should be a json string.E.g. –extra_session_config ‘{“key1”: “value1”, “key2”: “value2”}’
- --disable_force_evaluate_other_eps
Whether force to evaluate all execution providers which are different with the associated execution provider.
Default: False
- --enable_profiling
Whether enable profiling for ONNX Runtime inference.
Default: False
- --predict_with_kv_cache
Whether to use key-value cache for ORT session parameter tuning
Default: False
- --device
Possible choices: gpu, cpu, npu
Target device to run the model. Default is cpu.
Default: “cpu”
- --providers_list
Possible choices: CPUExecutionProvider, CUDAExecutionProvider, DmlExecutionProvider, JsExecutionProvider, MIGraphXExecutionProvider, OpenVINOExecutionProvider, QNNExecutionProvider, ROCMExecutionProvider, TensorrtExecutionProvider, VitisAIExecutionProvider
List of execution providers to use for ONNX model. They are case sensitive. If not provided, all available providers will be used.
- --memory
Memory limit for the accelerator in bytes. Default is None.
- --resource_group
Resource group for the AzureML workspace to run the workflow remotely.
- --workspace_name
Workspace name for the AzureML workspace to run the workflow remotely.
- --keyvault_name
The azureml keyvault name with huggingface token to use for remote run. Refer to https://microsoft.github.io/Olive/features/huggingface_model_optimization.html#huggingface-login for more details.
- --aml_compute
The compute name to run the workflow on.
- --log_level
Logging level. Default is 3. level 0: DEBUG, 1: INFO, 2: WARNING, 3: ERROR, 4: CRITICAL
Default: 3
- --account_name
Azure storage account name for shared cache.
- --container_name
Azure storage container name for shared cache.
Generate Cost Model for Model Splitting
Generate a cost model for a given model and save it as a csv file. This cost model is consumed by the CaptureSplitInfo pass. Only supports HfModel.
usage: olive generate-cost-model [-h] -m MODEL_NAME_OR_PATH [-t TASK]
[--trust_remote_code]
[--is_generative_model IS_GENERATIVE_MODEL]
[-o OUTPUT_PATH]
[-p {fp32,fp16,fp8,int32,uint32,int16,uint16,int8,uint8,int4,uint4,nf4,fp4}]
Named Arguments
- -m, --model_name_or_path
Path to the input model. See https://microsoft.github.io/Olive/features/cli.html#providing-input-models for more informsation.
- -t, --task
Task for which the huggingface model is used.
- --trust_remote_code
Trust remote code when loading a huggingface model.
Default: False
- --is_generative_model
Is this a generative model?
Default: True
- -o, --output_path
Path to save the command output.
Default: “cost-model.csv”
- -p, --weight_precision
Possible choices: fp32, fp16, fp8, int32, uint32, int16, uint16, int8, uint8, int4, uint4, nf4, fp4
Weight precision
Default: “fp16”
Qualcomm SDK
Configure Qualcomm SDK.
usage: olive configure-qualcomm-sdk [-h] --py_version {3.6,3.8} --sdk
{snpe,qnn}
Named Arguments
- --py_version
Possible choices: 3.6, 3.8
Python version: Use 3.6 for tensorflow 1.15 and 3.8 otherwise
- --sdk
Possible choices: snpe, qnn
Qualcomm SDK: snpe or qnn
AzureML
Manage the AzureML Compute resources.
usage: olive manage-aml-compute [-h] (--create | --delete)
[--subscription_id SUBSCRIPTION_ID]
[--resource_group RESOURCE_GROUP]
[--workspace_name WORKSPACE_NAME]
[--aml_config_path AML_CONFIG_PATH]
--compute_name COMPUTE_NAME
[--vm_size VM_SIZE] [--location LOCATION]
[--min_nodes MIN_NODES]
[--max_nodes MAX_NODES]
[--idle_time_before_scale_down IDLE_TIME_BEFORE_SCALE_DOWN]
Named Arguments
- --create, -c
Create new compute
Default: False
- --delete, -d
Delete existing compute
Default: False
- --subscription_id
Azure subscription ID
- --resource_group
Name of the Azure resource group
- --workspace_name
Name of the AzureML workspace
- --aml_config_path
Path to AzureML config file. If provided, subscription_id, resource_group and workspace_name are ignored
- --compute_name
Name of the new compute
- --vm_size
VM size of the new compute. This is required if you are creating a compute instance
- --location
Location of the new compute. This is required if you are creating a compute instance
- --min_nodes
Minimum number of nodes
Default: 0
- --max_nodes
Maximum number of nodes
Default: 2
- --idle_time_before_scale_down
Idle seconds before scaledown
Default: 120
Providing Input Models
There are more than one way to supply input model to the Olive commands.
HuggingFace model can be directly used as an input model. For example
-m microsoft/Phi-3-mini-4k-instruct
.A model produced by a Olive command can be directly used as an input model. You can specify the model file path using the
-m <output_model>
option, where<output_model>
is the output folder defined by-o <output_model>
in the previous Olive command.Olive commands also accept a local PyTorch model as an input model. You can specify the model file path using the
-m model.pt
option, and the associated model script using the--model_script script.py
option. For example,olive capture-onnx-graph -m model.pt --model_script script.py
.A model from AzureML registry can be directly used as an input model. For example
-m azureml://registries/<registry_name>/models/<model_name>/versions/<version>
.An ONNX model available locally can also be used as an input for the Olive commands that accept ONNX model as an input.
Model Script File Information
Olive commands support custom PyTorch model as an input. Olive requires users to define specific functions to load and process the custom PyTorch model. These functions should be defined in your model script you provide.
Model Loader Function (`_model_loader`): Loads the PyTorch model. If the model file path is provided using the -m option, it takes higher priority than the model loader function.
def _model_loader(): ... return model
IO Config Function (`_io_config`): Returns the IO configuration for the model. Either _io_config or _dummy_inputs is required for the capture-onnx-graph CLI command.
def _io_config(model: PyTorchModelHandler): ... return io_config
Dummy Inputs Function (`_dummy_inputs`): Provides dummy input tensors for the model. Either _io_config or _dummy_inputs is required for the capture-onnx-graph CLI command.
def _dummy_inputs(model: PyTorchModelHandler): ... return dummy_inputs
Model Format Function (`_model_file_format`): Specifies the format of the model. The default value is PyTorch.EntireModel. For more available options, refer to this.
def _model_file_format(): ... return model_file_format