Run Olive workflows#

The Olive run command allows you to execute any of the 40+ optimizations available in Olive in a sequence you define in a YAML/JSON file called a workflow.

Quickstart#

In this quickstart, you’ll execute the following Olive workflow:

        graph LR
    A[/"`Llama-3.2-1B-Instruct (from Hugging Face)`"/]
    C["`IncDynamicQuantization`"]
    A --> B[OnnxConversion]
    B --> C
    C --> E[OrtSessionParamsTuning]
    E --> F[/ZipFile/]
    

The input into the workflow is the Llama-3.2-1B-Instruct model from Hugging Face. The workflow has the following of passes (steps):

  1. Convert the model into the ONNX format using the OnnxConversion pass.

  2. Quantize using the IncDynamicQuantization pass (Intel® Neural Compressor Dynamic Quantization).

  3. Optimize the ONNX Runtime inference settings using the OrtSessionParamsTuning pass.

The output of the workflow is a Zip file containing the ONNX model and ORT configuration settings.

Define the workflow in a YAML file#

First, define the “quickstart workflow” in a YAML file:

# quickstart-workflow.yaml
input_model:
  type: HfModel
  model_path: meta-llama/Llama-3.2-1B-Instruct
systems:
  local_system:
    type: LocalSystem
    accelerators:
      - device: cpu
        execution_providers:
          - CPUExecutionProvider
data_configs:
  - name: transformer_token_dummy_data
    type: TransformersTokenDummyDataContainer
passes:
  conversion:
    type: OnnxConversion
    target_opset: 16
    save_as_external_data: true
    all_tensors_to_one_file: true
    save_metadata_for_token_generation: true
  quantize:
    type: IncDynamicQuantization
  session_params_tuning:
    type: OrtSessionParamsTuning
    data_config: transformer_token_dummy_data
    io_bind: true
packaging_config:
  - type: Zipfile
    name: OutputModel
log_severity_level: 0
host: local_system
target: local_system
cache_dir: cache
output_dir: null

Run the workflow#

The workflow is executed using the run command:

olive run --config quickstart-workflow.yaml