Quantizing Models with PyTorch#
Quantizing an NLP-based model in PyTorch involves reducing the precision of the model’s parameters to improve its inference speed and reduce its memory footprint. The process involves converting floating-point parameters to integers and can be implemented by adding a few lines of code.
Loading the Model#
The first step is to load any NLP-related model. In this notebook, we will be using a pre-trained GPT-2 model from the Hugging Face’s Hub.
[1]:
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("gpt2")
Post-Training Quantization (PTQ)#
Post-Training Quantization (PTQ) is a technique of quantizing a pre-trained model, where dynamic quantization is used to adjust the quantization levels during runtime to ensure optimal accuracy and performance.
Archai’s offer a wrapper function, denoted as dynamic_quantization_torch()
, which takes care of dynamically quantizing the pre-trained model.
Note that we set PyTorch’s number of threads to 1 because quantized models will only use a single thread.
[2]:
import torch
from archai.quantization.ptq import dynamic_quantization_torch
torch.set_num_threads(1)
model_qnt = dynamic_quantization_torch(model)
2023-03-21 15:18:12,480 - archai.quantization.ptq — INFO — Quantizing model ...
Comparing Default and Quantized Models#
Finally, we can compare the size of default and quantized models, as well as their logits different. Nevertheless, please note that if the model has not been pre-trained with Quantization Aware Training (QAT), it might produce different logits and have its performance diminished.
[3]:
from archai.common.file_utils import calculate_torch_model_size
print(f"Model: {calculate_torch_model_size(model)}MB")
print(f"Model-QNT: {calculate_torch_model_size(model_qnt)}MB")
inputs = {"input_ids": torch.randint(1, 10, (1, 192))}
logits = model(**inputs).logits
logits_qnt = model_qnt(**inputs).logits
print(f"Difference between logits: {logits_qnt - logits}")
Model: 510.391647MB
Model-QNT: 431.250044MB
Difference between logits: tensor([[[-0.2147, -0.0618, -0.2794, ..., 1.0471, 1.0807, -0.8749],
[-1.4394, -1.5974, -5.1243, ..., -3.5922, -2.7616, -1.6151],
[-4.1445, -3.5687, -6.8751, ..., -3.9694, -4.0689, -3.0092],
...,
[-2.2967, -4.1277, -9.3187, ..., -1.6556, -3.2380, -1.3445],
[-2.0462, -4.3560, -9.2828, ..., -2.0148, -2.9403, -1.1727],
[-1.5593, -4.3758, -8.6710, ..., -0.7250, -2.5097, -0.7405]]],
grad_fn=<SubBackward0>)