VPTQ

# VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models [![License](https://img.shields.io/badge/license-mit-blue)](https://github.com/microsoft/VPTQ/blob/main/LICENSE) [![PyPi](https://img.shields.io/pypi/v/vptq)](https://pypi.org/project/vptq/) [![Algorithm](https://img.shields.io/badge/Algorithm-OpenSource-blue)](https://github.com/microsoft/VPTQ/tree/algorithm) **Efficient, Flexible and Compressing LLM in less than 2bits** [Get Started](#installation) | [Technical Report](https://arxiv.org/pdf/2409.17066)

TL;DR

Vector Post-Training Quantization (VPTQ) is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.

News

Installation

Dependencies

Install VPTQ on your machine

Recommend: For saving your time to build the package, Please install VPTQ from the latest Release directly

pip install vptq

or from

https://github.com/microsoft/VPTQ/releases

build from source

[Not Available if Release package]

Preparation steps that might be needed: Set up CUDA_HOME and PATH.

Set cuda-12 to your own CUDA version and environment. Run nvcc --version to find out your version, and which nvcc to check your CUDA PATH.

# example
export CUDA_HOME=/usr/local/cuda-12
export PATH=/usr/local/cuda-12/bin/:$PATH  # set dependent on your environment

Will Take several minutes to compile CUDA kernels, please be patient. Current compilation builds on SM 7.0, 7.5, 8.0, 8,6, 9.0 to reduce the compilation time. You can set TORCH_CUDA_ARCH_LIST to your specific architecture.

pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation

You can configure the required CUDA architectures and the number of nvcc compile threads by setting

TORCH_CUDA_ARCH_LIST=8.0,9.0 NVCC_THREADS=16 pip install -e . --no-build-isolation

to reduce compilation time.

Example: Run Llama 3.1 70b on RTX4090 (24G @ ~2bits) in real time Llama3 1-70b-prompt


VPTQ is an ongoing project. If the open-source community is interested in optimizing and expanding VPTQ, please feel free to submit an issue or DM.


Evaluation

Models from Open Source Community

⚠️ The repository only provides a method of model quantization algorithm.

⚠️ The open-source community VPTQ-community provides models based on the technical report and quantization algorithm.

⚠️ The repository cannot guarantee the performance of those models.

Quick Estimation of Model Bitwidth (Excluding Codebook Overhead):

Model Series Collections (Estimated) Bit per weight
Llama 3.3 70B Instruct HF 🤗 4 bits 3 bits 2 bits (1) 2 bits (2) 1.875 bits 1.625 bits
Llama 3.1 Nemotron 70B Instruct HF HF 🤗 4 bits 3 bits 2 bits (1) 2 bits (2) 1.875 bits 1.625 bits 1.5 bits
Llama 3.1 8B Instruct HF 🤗 4 bits 3.5 bits 3 bits 2.3 bits
Llama 3.1 70B Instruct HF 🤗 4 bits 3 bits 2.25 bits 2 bits (1) 2 bits (2) 1.93 bits 1.875 bits 1.75 bits
Llama 3.1 405B Instruct HF 🤗 4 bits 3 bits 2 bits 1.875 bits 1.625 bits 1.5 bits (1) 1.5 bits (2) 1.43 bits 1.375 bits
Mistral Large Instruct 2407 (123B) HF 🤗 4 bits 3 bits 2 bits (1) 2 bits (2) 1.875 bits 1.75 bits 1.625 bits 1.5 bits
Qwen 2.5 7B Instruct HF 🤗 4 bits 3 bits 2 bits (1) 2 bits (2) 2 bits (3)
Qwen 2.5 14B Instruct HF 🤗 4 bits 3 bits 2 bits (1) 2 bits (2) 2 bits (3)
Qwen 2.5 32B Instruct HF 🤗 4 bits 3 bits 2 bits (1) 2 bits (2) 2 bits (3)
Qwen 2.5 72B Instruct HF 🤗 4 bits 3 bits 2.38 bits 2.25 bits (1) 2.25 bits (2) 2 bits (1) 2 bits (2) 1.94 bits
Reproduced from the tech report HF 🤗 Results from the open source community for reference only, please use them responsibly.
Hessian and Inverse Hessian Matrix HF 🤗 Collected from RedPajama-Data-1T-Sample, following Quip#

Language Generation Example

To generate text using the pre-trained model, you can use the following code snippet:

The model VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft (~2 bit) is provided by open source community. The repository cannot guarantee the performance of those models.

python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft --prompt="Explain: Do Not Go Gentle into That Good Night"

Llama3 1-70b-prompt

Terminal Chatbot Example

Launching a chatbot: Note that you must use a chat model for this to work

python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft --chat

Llama3 1-70b-chat

Huggingface Transformers API Example:

Now, huggingface transformers main branch supports VPTQ:

#! pip install git+https://github.com/huggingface/transformers.git -U
#! pip install vptq -U

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "VPTQ-community/Meta-Llama-3.3-70B-Instruct-v16-k65536-65536-woft"
# Load VPTQ-quantized model directly from HuggingFace Hub
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Simple inference
prompt = "Explain: Do not go gentle into that good night."
output = model.generate(
    **tokenizer(prompt, return_tensors="pt").to(model.device)
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Python API Example from VPTQ package:

Using the Python API from VPTQ package:

import vptq
import transformers

model_name = "VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k65536-0-woft"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
m = vptq.AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

prompt = "Explain: Do Not Go Gentle into That Good Night"
out = m.generate(
    **tokenizer(prompt, return_tensors="pt").to("cuda"),
    max_new_tokens=100,
    pad_token_id=2
)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Gradio Web App Example

An environment variable is available to control share link or not. export SHARE_LINK=1

python -m vptq.app

VPTQ Algorithm Early-released

VPTQ algorithm early-released at algorithm branch, and checkout the tutorial.

Tech Report

VPTQ_tech_report

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.

Read tech report at Tech Report and arXiv Paper

Early Results from Tech Report

VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.

Model bitwidth W2↓ C4↓ AvgQA↑ tok/s↑ mem(GB) cost/h↓
LLaMA-2 7B 2.02 6.13 8.07 58.2 39.9 2.28 2
  2.26 5.95 7.87 59.4 35.7 2.48 3.1
LLaMA-2 13B 2.02 5.32 7.15 62.4 26.9 4.03 3.2
  2.18 5.28 7.04 63.1 18.5 4.31 3.6
LLaMA-2 70B 2.07 3.93 5.72 68.6 9.7 19.54 19
  2.11 3.92 5.71 68.7 9.7 20.01 19

Road Map

Project main members:

Acknowledgement

Publication

EMNLP 2024 Main

@inproceedings{
  vptq,
  title={VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models},
  author={Yifei Liu and
          Jicheng Wen and
          Yang Wang and
          Shengyu Ye and
          Li Lyna Zhang and
          Ting Cao and
          Cheng Li and
          Mao Yang},
  booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
  year={2024},
}

Star History

Star History Chart


Limitation of VPTQ

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.