PyTorch

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.

LoRA

Low-Rank Adaptation, or LoRA, is a fine-tuning approach which freezes the pre-trained model weights and injects trainable rank decomposition matrices (called adapters) into the layers of the model. It is based on the LoRA paper.

The output model is the input transformers model along with the fine-tuned LoRA adapters. The adapters can be loaded and/or merged into the original model using the peft library from Hugging Face.

This pass only supports Hugging Face transformers PyTorch models. Please refer to LoRA for more details about the pass and its config parameters.

Example Configuration

{
    "type": "LoRA",
    "config": {
        "lora_alpha": 16,
        "train_data_config": // ...,
        "training_args": {
            "learning_rate": 0.0002,
            // ...
        }
    }
}

Please refer to LoRA HFTrainingArguments for more details on supported the "training_args" and their default values.

QLoRA

QLoRA is an efficient finetuning approach that reduces memory usage by backpropagating gradients through a frozen, 4-bit quantized pretrained model into Low Rank Adapters (LoRA). It is based on the QLoRA paper and code. More information on LoRA can be found in the paper.

The output model is the input transformers model along with the quantization config and the fine-tuned LoRA adapters. The adapters can be loaded and/or merged into the original model using the peft library from Hugging Face.

This pass only supports Hugging Face transformers PyTorch models. Please refer to QLoRA for more details about the pass and its config parameters.

Note: QLoRA requires a GPU to run.

Example Configuration

{
    "type": "QLoRA",
    "config": {
        "compute_dtype": "bfloat16",
        "quant_type": "nf4",
        "training_args": {
            "learning_rate": 0.0002,
            // ...
        },
        "train_data_config": // ...,
    }
}

Please refer to QLoRA HFTrainingArguments for more details on supported the "training_args" and their default values.

LoftQ

LoftQ is a quantization framework which simultaneously quantizes and finds a proper low-rank initialization for LoRA fine-tuning. It is based on the LoftQ paper and code. More information on LoRA can be found in the paper.

The LoftQ pass initializes the quantized LoRA model using the LoftQ initialization method and then fine-tunes the adapters. The output model has new quantization aware master weights and the fine-tuned LoRA adapters.

This pass only supports Hugging Face transformers PyTorch models. Please refer to LoftQ for more details about the pass and its config parameters.

Note: LoftQ requires a GPU to run.

{
    "type": "LoftQ",
    "config": {
        "compute_dtype": "bfloat16",
        "training_args": {
            "learning_rate": 0.0002,
            // ...
        },
        "train_data_config": // ...,
    }
}

Please refer to LoftQ HFTrainingArguments for more details on supported the "training_args" and their default values.

Quantization Aware Training

The Quantization Aware Training (QAT) technique is used to improve the performance and efficiency of deep learning models by quantizing their weights and activations to lower bit-widths. The technique is applied during training, where the weights and activations are fake quantized to lower bit-widths using the specified QConfig.

Olive provides QuantizationAwareTraining that performs QAT on a PyTorch model.

Please refer to QuantizationAwareTraining for more details about the pass and its config parameters.

Example Configuration

Olive provides the 3 ways to run QAT training process:

a. Run QAT training with customized training loop.

{
    "type": "QuantizationAwareTraining",
    "config":{
        "user_script": "user_script.py",
        "training_loop_func": "training_loop_func"
    }
}

Check out this file for an example implementation of "user_script.py" and "training_loop_func".

b. Run QAT training with PyTorch Lightning.

{
    "type": "QuantizationAwareTraining",
    "config":{
        "user_script": "user_script.py",
        "num_epochs": 5,
        "ptl_data_module": "PTLDataModule",
        "ptl_module": "PTLModule",
    }
}

Check out this file for an example implementation of "user_script.py", "PTLDataModule" and "PTLModule".

c. Run QAT training with default training loop.

{
    "type": "QuantizationAwareTraining",
    "config":{
        "user_script": "user_script.py",
        "num_epochs": 5,
        "train_dataloader_func": "create_train_dataloader",
    }
}

Check out this file for an example implementation of "user_script.py" and "create_train_dataloader".

AutoGPTQ

Olive also integrates AutoGPTQ for quantization.

AutoGPTQ is an easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization). With GPTQ quantization, you can quantize your favorite language model to 8, 4, 3 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.

Olive consolidates the GPTQ quantization into a single pass called GptqQuantizer which supports tune GPTQ quantization with hyperparameters for trade-off between accuracy and speed.

Please refer to GptqQuantizer for more details about the pass and its config parameters.

Example Configuration

{
    "type": "GptqQuantizer",
    "config": {
        "data_config": "wikitext2_train"
    }
}

Check out this file for an example implementation of "wikitext2_train".

AutoAWQ

AutoAWQ is an easy-to-use package for 4-bit quantized models and it speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Olive integrates AutoAWQ for quantization and make it possible to convert the AWQ quantized torch model to onnx model. You can enable pack_model_for_onnx_conversion to pack the model for onnx conversion.

Please refer to AutoAWQQuantizer for more details about the pass and its config parameters.

Example Configuration

{
    "type": "AutoAWQQuantizer",
    "config": {
        "w_bit": 4,
        "pack_model_for_onnx_conversion": true
    }
}

SparseGPT

SparseGPT prunes GPT like models using a pruning method called SparseGPT. This one-shot pruning method can perform unstructured sparsity up to 60% on large models like OPT-175B and BLOOM-176B efficiently with negligible perplexity increase. It also supports semi-structured sparsity patterns such as 2:4 and 4:8 patterns.

Please refer to the original paper linked above for more details on the algorithm and performance results for different models, sparsities and datasets.

This pass only supports Hugging Face transformers PyTorch models. Please refer to SparseGPT for more details on the types of transformers models supported.

Note: TensorRT can accelerate inference on 2:4 sparse models as described in this blog.

Example Configuration

{
    "type": "SparseGPT",
    "config": {"sparsity": 0.5}
}
{
    "type": "SparseGPT",
    "config": {"sparsity": [2,4]}
}

SliceGPT

SliceGPT is post-training sparsification scheme that makes transformer networks smaller by applying orthogonal transformations to each transformer layer that reduces the model size by slicing off the least-significant rows and columns of the weight matrices. This results in speedups and a reduced memory footprint.

Please refer to the original paper for more details on the algorithm and expected results for different models, sparsities and datasets.

This pass only supports HuggingFace transformer PyTorch models. Please refer to SliceGPT for more details on the types of transformers models supported.

Example Configuration

{
    "type": "SliceGPT",
    "config": {
        "sparsity": 0.4,
        "calibration_data_config": "wikitext2"
    }
}

TorchTRTConversion

TorchTRTConversion converts the torch.nn.Linear modules in the transformer layers in a Hugging Face PyTorch model to TRTModules from torch_tensorrt with fp16 precision and sparse weights, if applicable. torch_tensorrt is an extension to torch where TensorRT compiled engines can be used like regular torch.nn.Modules. This pass can be used to accelerate inference on transformer models with sparse weights by taking advantage of the 2:4 structured sparsity pattern supported by TensorRT.

This pass only supports Hugging Face transformers PyTorch models. Please refer to TorchTRTConversion for more details on the types of transformers models supported.

Example Configuration

{
    "type": "TorchTRTConversion"
}