How To use finetune
Command#
The olive finetune
command will finetune a PyTorch/Hugging Face model and output a Hugging Face PEFT adapter. If you want to convert the PEFT adapter into a format for the ONNX Runtime, you can execute the olive generate-adapter
command after finetuning.
Quickstart#
The following example shows how to finetune Llama-3.2-1B-Instruct from Hugging Face either using your local computer (if you have a GPU device) or using remote compute via Azure AI integration with Olive.
Note
You’ll need a GPU device on your local machine to fine-tune a model.
olive finetune \
--model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
--trust_remote_code \
--output_path models/llama/ft \
--data_name xxyyzzz/phrase_classification \
--text_template "<|start_header_id|>user<|end_header_id|>\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{tone}" \
--method lora \
--max_steps 100 \
--log_level 1
Auto-Optimize the model and adapters#
If you would like your fine-tuned model to run on the ONNX Runtime, you’ll can execute the olive auto-opt
command to produce an optimized ONNX model and adapter, using
olive auto-opt \
--model_name_or_path models/llama/ft/model \
--adapter_path models/llama/ft/adapter \
--device cpu \
--provider CPUExecutionProvider \
--use_ort_genai \
--output_path models/llama/onnx \
--log_level 1
Once the olive auto-opt
command has successfully completed, you’ll have:
The base model in an optimized ONNX format.
The adapter weights in a format for ONNX Runtime.
Olive and the ONNX runtime support the multi-LoRA model serving pattern, which greatly reduces the compute footprint of serving many adapters:

Multi-LoRA serving versus single-LoRA serving#