Quantizing Models with PyTorch#

Quantizing an NLP-based model in PyTorch involves reducing the precision of the model’s parameters to improve its inference speed and reduce its memory footprint. The process involves converting floating-point parameters to integers and can be implemented by adding a few lines of code.

Loading the Model#

The first step is to load any NLP-related model. In this notebook, we will be using a pre-trained GPT-2 model from the Hugging Face’s Hub.

[1]:

from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained("gpt2")

Post-Training Quantization (PTQ)#

Post-Training Quantization (PTQ) is a technique of quantizing a pre-trained model, where dynamic quantization is used to adjust the quantization levels during runtime to ensure optimal accuracy and performance.

Archai’s offer a wrapper function, denoted as dynamic_quantization_torch(), which takes care of dynamically quantizing the pre-trained model.

Note that we set PyTorch’s number of threads to 1 because quantized models will only use a single thread.

[2]:

import torch
from archai.quantization.ptq import dynamic_quantization_torch

torch.set_num_threads(1)
model_qnt = dynamic_quantization_torch(model)

2023-03-21 15:18:12,480 - archai.quantization.ptq — INFO —  Quantizing model ...

Comparing Default and Quantized Models#

Finally, we can compare the size of default and quantized models, as well as their logits different. Nevertheless, please note that if the model has not been pre-trained with Quantization Aware Training (QAT), it might produce different logits and have its performance diminished.

[3]:

from archai.common.file_utils import calculate_torch_model_size

print(f"Model: {calculate_torch_model_size(model)}MB")
print(f"Model-QNT: {calculate_torch_model_size(model_qnt)}MB")

inputs = {"input_ids": torch.randint(1, 10, (1, 192))}
logits = model(**inputs).logits
logits_qnt = model_qnt(**inputs).logits

print(f"Difference between logits: {logits_qnt - logits}")

Model: 510.391647MB
Model-QNT: 431.250044MB
Difference between logits: tensor([[[-0.2147, -0.0618, -0.2794,  ...,  1.0471,  1.0807, -0.8749],
         [-1.4394, -1.5974, -5.1243,  ..., -3.5922, -2.7616, -1.6151],
         [-4.1445, -3.5687, -6.8751,  ..., -3.9694, -4.0689, -3.0092],
         ...,
         [-2.2967, -4.1277, -9.3187,  ..., -1.6556, -3.2380, -1.3445],
         [-2.0462, -4.3560, -9.2828,  ..., -2.0148, -2.9403, -1.1727],
         [-1.5593, -4.3758, -8.6710,  ..., -0.7250, -2.5097, -0.7405]]],
       grad_fn=<SubBackward0>)

Quantizing Models with PyTorch

Sections

Quantizing Models with PyTorch#

Loading the Model#

Post-Training Quantization (PTQ)#

Comparing Default and Quantized Models#