PyTorch

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.

AutoGPTQ

Olive also integrates AutoGPTQ for quantization.

AutoGPTQ is an easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization). With GPTQ quantization, you can quantize your favorite language model to 8, 4, 3 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.

Olive consolidates the GPTQ quantization into a single pass called GptqQuantizer which supports tune GPTQ quantization with hyperparameters for trade-off between accuracy and speed.

Please refer to GptqQuantizer for more details about the pass and its config parameters.

Example Configuration

{
    "type": "GptqQuantizer",
    "data_config": "wikitext2_train"
}

Check out this file for an example implementation of "wikitext2_train".

AutoAWQ

AutoAWQ is an easy-to-use package for 4-bit quantized models and it speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Olive integrates AutoAWQ for quantization and make it possible to convert the AWQ quantized torch model to onnx model.

Please refer to AutoAWQQuantizer for more details about the pass and its config parameters.

Example Configuration

{
    "type": "AutoAWQQuantizer",
    "w_bit": 4
}

QuaRot

QuaRot is a quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model. It is based on the QuaRot paper.

This pass only supports HuggingFace transformer PyTorch models. Please refer to QuaRot for more details on the types of transformers models supported.

Example Configuration

{
    "type": "QuaRot",
    "w_rtn": true,
    "rotate": true,
    "w_bits": 4,
    "a_bits": 4,
    "k_bits": 4,
    "v_bits": 4
}