Quantize#

Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating point precision. A quantized model executes some or all of the operations on tensors with reduced precision rather than full precision (floating point) values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms.

Olive encapsulates all the latest cutting edge quantization techniques into a single command line tool that enables you to easily experiment/test the impact of different techniques.

Supported quantization techniques#

Currently, Olive supports the following techniques:

Note

Some methods require a GPU and/or a calibration dataset.

Method

Description

GPU required

Calibration dataset required

Input model format(s)

Output model format

AWQ

Activation-aware Weight Quantization (AWQ) creates 4-bit quantized models and it speeds up models by 3x and reduces memory requirements by 3x compared to FP16.

✔️

PyTorch
HuggingFace

PyTorch

GPTQ

Generative Pre-trained Transformer Quantization (GPTQ) is a one-shot weight quantization method. You can quantize your favorite language model to 8, 4, 3 or even 2 bits.

✔️

✔️

PyTorch
HuggingFace

PyTorch

QuaRot

Quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model.

✔️

✔️

HuggingFace

PyTorch

bnb4

Is a MatMul with weight quantized with N bits (e.g., 2, 3, 4, 5, 6, 7).

ONNX

ONNX

ONNX Dynamic

Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically.

ONNX

ONNX

INC Dynamic

Intel® Neural Compressor model compression tool.

ONNX

ONNX

NVMO

NVIDIA TensorRT Model Optimizer is a library comprising state-of-the-art model optimization techniques including quantization, sparsity, distillation, and pruning to compress models.

ONNX

ONNX

Quickstart#

To use AWQ quantization on Llama-3.2-1B-Instruct run the following command:

Note

  • You’ll need to execute this command on a GPU machine.

  • If you want to quantize a different model, update the --model_name_or_path to a different Hugging Face Repo ID (`{username}/{model})

olive quantize \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --algorithm awq \
    --output_path models/llama/awq \
    --log_level 1

Quantization with ONNX Optimizations#

As articulated in Supported quantization techniques, you may wish to take the PyTorch/Hugging Face output of AWQ/GPTQ/QuaRot quantization methods and convert into an optimized ONNX format so that you can inference using the ONNX runtime.

You can use Olive’s automatic optimizer (auto-opt) to create an optimized ONNX model from a quantized model:

# Step 1: AWQ (will output a PyTorch model)
olive quantize \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --algorithm awq \
    --output_path models/llama/awq \
    --log_level 1

# Step 2: Create an optimized ONNX model
olive auto-opt \
   --model_name_or_path models/llama/awq \
   --device cpu \
   --provider CPUExecutionProvider \
   --use_ort_genai \
   --output_path models/llama/onnx \
   --log_level 1

Pre-processing for Finetuning#

Quantizing a model as a pre-processing step for finetuning rather than as a post-processing step leads to more accurate quantized models because the loss due to quantization can be recovered during fine-tuning. The chain of Olive CLI commands required to quantize, finetune and output an ONNX model for the ONNX runtime are:

# Step 1: AWQ (will output a PyTorch model)
olive quantize \
   --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
   --trust_remote_code \
   --algorithm awq \
   --output_path models/llama/awq \
   --log_level 1

# Step 2: Finetune (will output a PEFT adapter)
olive finetune \
    --method lora \
    --model_name_or_path models/llama/awq \
    --data_name xxyyzzz/phrase_classification \
    --text_template "<|start_header_id|>user<|end_header_id|>\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{tone}" \
    --max_steps 100 \
    --output_path ./models/llama/ft \
    --log_level 1

# Step 3: Optimized ONNX model (will output an ONNX Model)
olive auto-opt \
   --model_name_or_path models/llama/ft/model \
   --adapter_path models/llama/ft/adapter \
   --device cpu \
   --provider CPUExecutionProvider \
   --use_ort_genai \
   --output_path models/llama/onnx \
   --log_level 1

Once the automatic optimizer has successfully completed, you’ll have:

  1. The base model in an optimized ONNX format.

  2. The adapter weights in a format for ONNX Runtime.