Run Olive workflows#
The Olive run
command allows you to execute any of the 40+ optimizations available in Olive in a sequence you define in a YAML/JSON file called a workflow.
Quickstart#
In this quickstart, you’ll execute the following Olive workflow:
graph LR A[/"`Llama-3.2-1B-Instruct (from Hugging Face)`"/] C["`IncDynamicQuantization`"] A --> B[OnnxConversion] B --> C C --> E[OrtSessionParamsTuning] E --> F[/ZipFile/]
The input into the workflow is the Llama-3.2-1B-Instruct model from Hugging Face. The workflow has the following of passes (steps):
Convert the model into the ONNX format using the
OnnxConversion
pass.Quantize using the
IncDynamicQuantization
pass (Intel® Neural Compressor Dynamic Quantization).Optimize the ONNX Runtime inference settings using the
OrtSessionParamsTuning
pass.
The output of the workflow is a Zip file containing the ONNX model and ORT configuration settings.
Define the workflow in a YAML file#
First, define the “quickstart workflow” in a YAML file:
# quickstart-workflow.yaml
input_model:
type: HfModel
model_path: meta-llama/Llama-3.2-1B-Instruct
systems:
local_system:
type: LocalSystem
accelerators:
- device: cpu
execution_providers:
- CPUExecutionProvider
data_configs:
- name: transformer_token_dummy_data
type: TransformersTokenDummyDataContainer
passes:
conversion:
type: OnnxConversion
target_opset: 16
save_as_external_data: true
all_tensors_to_one_file: true
save_metadata_for_token_generation: true
quantize:
type: IncDynamicQuantization
session_params_tuning:
type: OrtSessionParamsTuning
data_config: transformer_token_dummy_data
io_bind: true
packaging_config:
- type: Zipfile
name: OutputModel
log_severity_level: 0
host: local_system
target: local_system
cache_dir: cache
output_dir: null
Run the workflow#
The workflow is executed using the run
command:
olive run --config quickstart-workflow.yaml