PEFT Adapters#

Parameter Efficient Finetuning (PEFT) techniques, such as LoRA enables user to efficiently finetune a model.

LoRA#

Low-Rank Adaptation, or LoRA, is a fine-tuning approach which freezes the pre-trained model weights and injects trainable rank decomposition matrices (called adapters) into the layers of the model. It is based on the LoRA paper.

The output model is the input transformers model along with the fine-tuned LoRA adapters. The adapters can be loaded and/or merged into the original model using the peft library from Hugging Face.

This pass only supports HfModels. Please refer to LoRA for more details about the pass and its config parameters.

Example Configuration#

{
    "type": "LoRA",
    "alpha": 16,
    "train_data_config": // ...,
    "training_args": {
        "learning_rate": 0.0002,
        // ...
    }
}

Please refer to LoRA HFTrainingArguments for more details on supported the "training_args" and their default values.

QLoRA#

QLoRA is an efficient finetuning approach that reduces memory usage by backpropagating gradients through a frozen, 4-bit quantized pretrained model into Low Rank Adapters (LoRA). It is based on the QLoRA paper and code. More information on LoRA can be found in the paper.

The output model is the input transformers model along with the quantization config and the fine-tuned LoRA adapters. The adapters can be loaded and/or merged into the original model using the peft library from Hugging Face.

This pass only supports HfModels. Please refer to QLoRA for more details about the pass and its config parameters.

Note: QLoRA requires a GPU to run.

Example Configuration#

{
    "type": "QLoRA",
    "compute_dtype": "bfloat16",
    "quant_type": "nf4",
    "training_args": {
        "learning_rate": 0.0002,
        // ...
    },
    "train_data_config": // ...,
}

Please refer to QLoRA HFTrainingArguments for more details on supported the "training_args" and their default values.

LoftQ#

LoftQ is a quantization framework which simultaneously quantizes and finds a proper low-rank initialization for LoRA fine-tuning. It is based on the LoftQ paper and code. More information on LoRA can be found in the paper.

The LoftQ pass initializes the quantized LoRA model using the LoftQ initialization method and then fine-tunes the adapters. The output model has new quantization aware master weights and the fine-tuned LoRA adapters.

This pass only supports HfModels. Please refer to LoftQ for more details about the pass and its config parameters.

Note: LoftQ requires a GPU to run.

{
    "type": "LoftQ",
    "compute_dtype": "bfloat16",
    "training_args": {
        "learning_rate": 0.0002,
        // ...
    },
    "train_data_config": // ...,
}

Please refer to LoftQ HFTrainingArguments for more details on supported the "training_args" and their default values.

MergeAdapterWeights#

Merge Lora weights into a complete model. After running the LoRA pass, the model will only have LoRA adapters. This pass merges the LoRA adapters into the original model and download the context(config/generation_config/tokenizer) of the model.

Example Configuration#

{
    "type": "MergeAdapterWeights"
}

Extract Adapters#

LoRA, QLoRA and related techniques allow us to fine-tune a pre-trained model by adding a small number of trainable matrices called adapters. The same base model can be used for multiple tasks by adding different adapters for each task. To support using multiple adapters with the same optimized onnx model, the ExtractAdapters pass extracts the adapters weights from the model and saves them to a separate file. The model graph is then modified in one of the following ways:

Adapters weights are set as external tensors pointing to a non-existent file. The onnx model is thus invalid by itself as it cannot be loaded. In order to create an inference session using this model, the adapter weights must be added to a sessions options object using add_initializer or add_external_initializers.
Adapter weights are converted into model inputs. The onnx model is valid. During inference, the adapter weights must be provided as part of the inputs. We call them constant inputs here since these weights don’t change between runs when using the one set of adapters.

Example Configuration#

a. As external initializers

{
    "type": "ExtractAdapters",
    "make_inputs": false
}

b. As constant inputs with packed weights

{
    "type": "ExtractAdapters",
    "make_inputs": true,
    "pack_inputs": true
}

Please refer to ExtractAdapters for more details about the pass and its config parameters.

Olive also provides a command line tool to convert adapters saved after peft fine-tuning to a format compatible with a model that has been optimized with the ExtractAdapters pass. More details on the olive convert-adapters command can be found at Command Line Tools.