Python Interface#
Olive provides Python API to transform models. All functions are available directly from the olive
package.
run(...)
#
This is the most generic way to run an Olive workflow from a configuration file.
Arguments:
run_config
(str, Path, or dict): Path to config file or config dictionary.setup
(bool): Setup environment needed to run the workflow. Defaults toFalse
.package_config
(str, Path, dict, optional): Path to optional package config file.tempdir
(str or Path, optional): Root directory for tempfile directories and files.packages
(bool): List packages required to run the workflow. Defaults toFalse
.
from olive import run
# Run workflow from a configuration file
workflow_output = run("config.json")
The rest of the functions are specialized workflows for common tasks.
optimize(...)
#
Optimize the input model with comprehensive pass scheduling.
Arguments:
model_name_or_path
(str): Path to the input model (file path or HuggingFace model name).task
(str, optional): Task for which the huggingface model is used. Default task is text-generation-with-past.trust_remote_code
(bool): Trust remote code when loading a huggingface model. Defaults toFalse
.adapter_path
(str, optional): Path to the adapters weights saved after peft fine-tuning. Local folder or huggingface id.model_script
(str, optional): The script file containing the model definition. Required for the local PyTorch model.script_dir
(str, optional): The directory containing the local PyTorch model script file.output_path
(str): Output directory path. Defaults to"optimized-model"
.provider
(str): Execution provider (“CPUExecutionProvider”, “CUDAExecutionProvider”, “QNNExecutionProvider”, “VitisAIExecutionProvider”, “OpenVINOExecutionProvider”). Defaults to"CPUExecutionProvider"
.device
(str, optional): Target device (“cpu”, “gpu”, “npu”).precision
(str): Target precision (“int4”, “int8”, “int16”, “int32”, “uint4”, “uint8”, “uint16”, “uint32”, “fp4”, “fp8”, “fp16”, “fp32”, “nf4”). Defaults to"fp32"
.act_precision
(str, optional): Activation precision for quantization.num_split
(int, optional): Number of splits for model splitting.memory
(int, optional): Available device memory in MB.exporter
(str, optional): Exporter to use (“model_builder”, “dynamo_exporter”, “torchscript_exporter”, “optimum_exporter”).dim_param
(str, optional): Dynamic parameter names for dynamic to fixed shape conversion.dim_value
(str, optional): Fixed dimension values for dynamic to fixed shape conversion.use_qdq_format
(bool): Use QDQ format for quantization. Defaults toFalse
.surgeries
(list[str], optional): List of graph surgeries to apply.block_size
(int, optional): Block size for quantization. Use -1 for per-channel quantization.modality
(str): Model modality (“text”). Defaults to"text"
.enable_aot
(bool): Enable Ahead-of-Time (AOT) compilation. Defaults toFalse
.qnn_env_path
(str, optional): Path to QNN environment directory (required when using AOT with QNN).log_level
(int): Logging level (0-4: DEBUG, INFO, WARNING, ERROR, CRITICAL). Defaults to3
.save_config_file
(bool): Generate and save the config file for the command. Defaults toFalse
.
from olive import optimize
workflow_output = optimize(model_name_or_path="path/to/model")
quantize(...)
#
Quantize a PyTorch or ONNX model.
Arguments:
model_name_or_path
(str): Path to the input model (file path or HuggingFace model name).task
(str, optional): Task for which the huggingface model is used. Default task is text-generation-with-past.trust_remote_code
(bool): Trust remote code when loading a huggingface model. Defaults toFalse
.adapter_path
(str, optional): Path to the adapters weights saved after peft fine-tuning. Local folder or huggingface id.model_script
(str, optional): The script file containing the model definition. Required for the local PyTorch model.script_dir
(str, optional): The directory containing the local PyTorch model script file.output_path
(str): Output directory path. Defaults to"quantized-model"
.algorithm
(str): Quantization algorithm (“awq”, “gptq”, “hqq”, “rtn”, “spinquant”, “quarot”). Defaults to"rtn"
.precision
(str): Quantization precision (int4, int8, int16, int32, uint4, uint8, uint16, uint32, fp4, fp8, fp16, fp32, nf4, PrecisionBits.BITS4, PrecisionBits.BITS8, PrecisionBits.BITS16, PrecisionBits.BITS32). Defaults to"int8"
.act_precision
(str): Activation precision for static quantization (int4, int8, int16, int32, uint4, uint8, uint16, uint32, fp4, fp8, fp16, fp32, nf4). Defaults to"int8"
.implementation
(str, optional): Specific implementation of quantization algorithms to use.use_qdq_encoding
(bool): Use QDQ encoding in ONNX model for quantized nodes. Defaults toFalse
.data_name
(str, optional): Dataset name for static quantization.subset
(str, optional): The subset of the dataset to use.split
(str, optional): The dataset split to use.data_files
(str, optional): The dataset files (comma-separated if multiple).text_field
(str, optional): The text field to use for fine-tuning.text_template
(str, optional): Template to generate text field from (e.g., ‘### Question: {prompt} \n### Answer: {response}’).use_chat_template
(bool): Use chat template for text field. Defaults toFalse
.max_seq_len
(int, optional): Maximum sequence length for the data.add_special_tokens
(bool, optional): Whether to add special tokens during preprocessing.max_samples
(int, optional): Maximum samples to select from the dataset.batch_size
(int, optional): Batch size.input_cols
(list[str], optional): List of input column names.account_name
(str, optional): Azure storage account name for shared cache.container_name
(str, optional): Azure storage container name for shared cache.log_level
(int): Logging level (0-4: DEBUG, INFO, WARNING, ERROR, CRITICAL). Defaults to3
.save_config_file
(bool): Generate and save the config file for the command. Defaults toFalse
.
from olive import quantize
workflow_output = quantize(model_name_or_path="path/to/model")
capture_onnx_graph(...)
#
Capture ONNX graph for a PyTorch model.
Arguments:
model_name_or_path
(str): Path to the input model (file path or HuggingFace model name).task
(str, optional): Task for which the huggingface model is used. Default task is text-generation-with-past.trust_remote_code
(bool): Trust remote code when loading a huggingface model. Defaults toFalse
.adapter_path
(str, optional): Path to the adapters weights saved after peft fine-tuning. Local folder or huggingface id.model_script
(str, optional): The script file containing the model definition. Required for the local PyTorch model.script_dir
(str, optional): The directory containing the local PyTorch model script file.output_path
(str): Output directory path. Defaults to"captured-model"
.conversion_device
(str): The device used to run the model to capture the ONNX graph (“cpu”, “gpu”). Defaults to"cpu"
.use_ort_genai
(bool): Use OnnxRuntime generate() API to run the model. Defaults toFalse
.account_name
(str, optional): Azure storage account name for shared cache.container_name
(str, optional): Azure storage container name for shared cache.log_level
(int): Logging level (0-4: DEBUG, INFO, WARNING, ERROR, CRITICAL). Defaults to3
.save_config_file
(bool): Generate and save the config file for the command. Defaults toFalse
.use_dynamo_exporter
(bool): Use dynamo export API to export ONNX model. Defaults toFalse
.fixed_param_dict
(str, optional): Fix dynamic input shapes by providing dimension names and values (e.g., ‘batch_size=1,max_length=128’).past_key_value_name
(str, optional): The arguments name to point to past key values (used with dynamo exporter).torch_dtype
(str, optional): The dtype to cast the model to before capturing ONNX graph (e.g., ‘float32’, ‘float16’).target_opset
(int): The target opset version for the ONNX model. Defaults to17
.use_model_builder
(bool): Use Model Builder to capture ONNX model. Defaults toFalse
.precision
(str): The precision of the ONNX model for Model Builder (“fp16”, “fp32”, “int4”). Defaults to"fp32"
.int4_block_size
(int): Block size for int4 quantization (16, 32, 64, 128, 256). Defaults to32
.int4_accuracy_level
(int, optional): Minimum accuracy level for activation of MatMul in int4 quantization.exclude_embeds
(bool): Remove embedding layer from ONNX model. Defaults toFalse
.exclude_lm_head
(bool): Remove language modeling head from ONNX model. Defaults toFalse
.enable_cuda_graph
(bool): Enable CUDA graph capture for CUDA execution provider. Defaults toFalse
.
from olive import capture_onnx_graph
workflow_output = capture_onnx_graph(model_name_or_path="path/to/model")
finetune(...)
#
Fine-tune a model using LoRA or QLoRA.
Arguments:
model_name_or_path
(str): Path to the input model (file path or HuggingFace model name).task
(str, optional): Task for which the huggingface model is used. Default task is text-generation-with-past.trust_remote_code
(bool): Trust remote code when loading a huggingface model. Defaults toFalse
.output_path
(str): Output directory path. Defaults to"finetuned-adapter"
.method
(str): Fine-tuning method (“lora”, “qlora”). Defaults to"lora"
.lora_r
(int): LoRA R value. Defaults to64
.lora_alpha
(int): LoRA alpha value. Defaults to16
.target_modules
(str, optional): Target modules for LoRA (comma-separated).torch_dtype
(str): PyTorch dtype for training (“bfloat16”, “float16”, “float32”). Defaults to"bfloat16"
.data_name
(str): Dataset name (required).train_subset
(str, optional): The subset to use for training.train_split
(str, optional): The split to use for training.eval_subset
(str, optional): The subset to use for evaluation.eval_split
(str, optional): The dataset split to evaluate on.data_files
(str, optional): The dataset files (comma-separated if multiple).text_field
(str, optional): The text field to use for fine-tuning.text_template
(str, optional): Template to generate text field from (e.g., ‘### Question: {prompt} \n### Answer: {response}’).use_chat_template
(bool): Use chat template for text field. Defaults toFalse
.max_seq_len
(int, optional): Maximum sequence length for the data.add_special_tokens
(bool, optional): Whether to add special tokens during preprocessing.max_samples
(int, optional): Maximum samples to select from the dataset.batch_size
(int, optional): Batch size.input_cols
(list[str], optional): List of input column names.account_name
(str, optional): Azure storage account name for shared cache.container_name
(str, optional): Azure storage container name for shared cache.log_level
(int): Logging level (0-4: DEBUG, INFO, WARNING, ERROR, CRITICAL). Defaults to3
.save_config_file
(bool): Generate and save the config file for the command. Defaults toFalse
.**training_kwargs
: HuggingFace training arguments.
from olive import finetune
workflow_output = finetune(model_name_or_path="hf_model_name")
generate_adapter(...)
#
Generate adapter for an ONNX model.
Arguments:
model_name_or_path
(str): Path to the input model (file path or HuggingFace model name).output_path
(str): Output directory path. Defaults to"generated-adapter"
.adapter_type
(str): Type of adapters to extract (“lora”, “dora”, “loha”). Defaults to"lora"
.adapter_format
(str): Format to save weights in (“pt”, “numpy”, “safetensors”, “onnx_adapter”). Defaults to"onnx_adapter"
.account_name
(str, optional): Azure storage account name for shared cache.container_name
(str, optional): Azure storage container name for shared cache.log_level
(int): Logging level (0-4: DEBUG, INFO, WARNING, ERROR, CRITICAL). Defaults to3
.save_config_file
(bool): Generate and save the config file for the command. Defaults toFalse
.
from olive import generate_adapter
workflow_output = generate_adapter(model_name_or_path="path/to/onnx/model")
tune_session_params
#
Tune ONNX Runtime session parameters for optimal performance.
Arguments:
model_name_or_path
(str): Path to the input model (file path or HuggingFace model name).output_path
(str): Output directory path. Defaults to"tuned-params"
.cpu_cores
(int, optional): CPU cores used for thread tuning.io_bind
(bool): Enable IOBinding search for ONNX Runtime inference. Defaults toFalse
.enable_cuda_graph
(bool): Enable CUDA graph for CUDA execution provider. Defaults toFalse
.execution_mode_list
(list[int], optional): Parallelism list between operators.opt_level_list
(list[int], optional): Optimization level list for ONNX Model.trt_fp16_enable
(bool): Enable TensorRT FP16 mode. Defaults toFalse
.intra_thread_num_list
(list[int], optional): List of intra thread number for test.inter_thread_num_list
(list[int], optional): List of inter thread number for test.extra_session_config
(str, optional): Extra customized session options during tuning process (JSON string).disable_force_evaluate_other_eps
(bool): Whether to disable force evaluation of all execution providers. Defaults toFalse
.enable_profiling
(bool): Enable profiling for ONNX Runtime inference. Defaults toFalse
.predict_with_kv_cache
(bool): Use key-value cache for ORT session parameter tuning. Defaults toFalse
.device
(str): Target device (“gpu”, “cpu”, “npu”). Defaults to"cpu"
.providers_list
(list[str], optional): List of execution providers to use for ONNX model (case sensitive).memory
(int, optional): Memory limit for the accelerator in bytes.account_name
(str, optional): Azure storage account name for shared cache.container_name
(str, optional): Azure storage container name for shared cache.log_level
(int): Logging level (0-4: DEBUG, INFO, WARNING, ERROR, CRITICAL). Defaults to3
.save_config_file
(bool): Generate and save the config file for the command. Defaults toFalse
.
from olive import tune_session_params
workflow_output = tune_session_params(model_name_or_path="path/to/onnx/model")
Utility Functions#
There are also utility functions that don’t produce a WorkflowOutput
:
convert_adapters(...)
#
Convert lora adapter weights to a file that will be consumed by ONNX models generated by Olive ExtractedAdapters pass.
Arguments:
adapter_path
(str): Path to the adapters weights saved after peft fine-tuning. Can be a local folder or huggingface id.adapter_format
(str): Format to save weights in (“pt”, “numpy”, “safetensors”, “onnx_adapter”). Defaults to"onnx_adapter"
.output_path
(str): Path to save the exported weights. Will be saved in the specified format.dtype
(str): Data type to save float adapter weights as (“float32”, “float16”). If quantize_int4 is True, this is the data type of the quantization scales. Defaults to"float32"
.quantize_int4
(bool): Quantize the adapter weights to int4 using blockwise quantization. Defaults toFalse
.int4_block_size
(int): Block size for int4 quantization of adapter weights (16, 32, 64, 128, 256). Defaults to32
.int4_quantization_mode
(str): Quantization mode for int4 quantization (“symmetric”, “asymmetric”). Defaults to"symmetric"
.log_level
(int): Logging level (0-4: DEBUG, INFO, WARNING, ERROR, CRITICAL). Defaults to3
.
extract_adapters(...)
#
Extract LoRA adapters from PyTorch model to separate files.
Arguments:
model_name_or_path
(str): Path to the PyTorch model. Can be a local folder or Hugging Face id.format
(str): Format to save the LoRAs in (“pt”, “numpy”, “safetensors”, “onnx_adapter”).output
(str): Output folder to save the LoRAs in the requested format.dtype
(str): Data type to save LoRAs as (“float32”, “float16”). Defaults to"float32"
.cache_dir
(str, optional): Cache dir to store temporary files in. Default is Hugging Face’s default cache dir.log_level
(int): Logging level (0-4: DEBUG, INFO, WARNING, ERROR, CRITICAL). Defaults to3
.
generate_cost_model(...)
#
Generate a cost model for a given model and save it as a csv file. This cost model is consumed by the CaptureSplitInfo pass. Only supports HfModel.
Arguments:
model_name_or_path
(str): Path to the input model (file path or HuggingFace model name).task
(str, optional): Task for which the huggingface model is used. Default task is text-generation-with-past.trust_remote_code
(bool): Trust remote code when loading a huggingface model. Defaults toFalse
.output_path
(str): Output directory path. Defaults to current directory.weight_precision
(str): Weight precision (“fp32”, “fp16”, “fp8”, “int32”, “uint32”, “int16”, “uint16”, “int8”, “uint8”, “int4”, “uint4”, “nf4”, “fp4”). Defaults to"fp32"
.
Accessing API Results#
Most API functions return a WorkflowOutput
object. You can use it to access the optimized models and their metrics.
from olive import run, WorkflowOutput
workflow_output: WorkflowOutput = run("config.json")
# Check if optimization produced any results
if workflow_output.has_output_model():
# Get the best model overall
best_model = workflow_output.get_best_candidate()
print(f"Model path: {best_model.model_path}")
print(f"Model type: {best_model.model_type}")
print(f"Device: {best_model.from_device()}")
print(f"Execution provider: {best_model.from_execution_provider()}")
print(f"Metrics: {best_model.metrics_value}")
# Get the best model for CPU
best_cpu_model = workflow_output.get_best_candidate_by_device("CPU")
# Get all models for GPU
gpu_models = workflow_output.get_output_models_by_device("GPU")
Output Class Hierarchy#
Olive organizes optimization results in a hierarchical structure:
WorkflowOutput
: Top-level container for all results across devicesContains multiple
DeviceOutput
instances (one per device)Each
DeviceOutput
contains multipleModelOutput
instances
Output Classes in Detail#
WorkflowOutput#
The WorkflowOutput
class organizes results from an entire optimization workflow, containing outputs across different hardware devices and execution providers.
from olive import WorkflowOutput
Key Methods#
get_input_model_metrics()
- Get the metrics for the input modelget_available_devices()
- Get a list of devices that the workflow ran onhas_output_model()
- Check if any optimized models are availableget_output_models_by_device(device)
- Get all optimized models for a specific deviceget_output_model_by_id(model_id)
- Get a specific model by its IDget_output_models()
- Get all optimized models sorted by metricsget_best_candidate_by_device(device)
- Get the best model for a specific deviceget_best_candidate()
- Get the best model across all devicestrace_back_run_history(model_id)
- Get the optimization history for a specific model
DeviceOutput#
The DeviceOutput
class groups model outputs for a specific device, containing results for different execution providers on that device.
from olive import DeviceOutput
Key Attributes#
device
- The device type (e.g., “cpu”, “gpu”)
Key Methods#
has_output_model()
- Check if any model outputs are available for this deviceget_output_models()
- Get all model outputs for this deviceget_best_candidate()
- Get the best model output for this device based on metricsget_best_candidate_by_execution_provider(execution_provider)
- Get the best model for a specific execution provider
ModelOutput#
The ModelOutput
class represents an individual optimized model, containing its path, metrics, and configuration.
from olive import ModelOutput
Key Attributes#
metrics
- Dictionary containing the model’s performance metricsmetrics_value
- Simplified version of metrics with just the valuesmodel_path
- Path to the optimized model filemodel_id
- Unique identifier for this modelmodel_type
- Type of the model (e.g., “onnxmodel”)model_config
- Configuration details for the model
Key Methods#
from_device()
- Get the device this model was optimized forfrom_execution_provider()
- Get the execution provider this model was optimized forfrom_pass()
- Get the Olive optimization pass that generated this modelget_parent_model_id()
- Get the ID of the parent model this was derived fromuse_ort_extension()
- Check if the model uses the ONNXRuntime extensionget_inference_config()
- Get the model’s inference configuration