Quantization

Sections

Quantization#

Natural Language Processing
- Modules

Modules#

class archai.quantization.modules.FakeQuantEmbedding(*args, **kwargs)[source]#

Translate a torch-based Embedding layer into a QAT-ready Embedding layer.

property fake_quant_weight: Tensor#: Return a fake quantization over the weight matrix.

forward(x: Tensor) → Tensor[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_float(mod: Module, qconfig: Dict[Module, Any] | None = None, **kwargs) → FakeQuantEmbedding[source]#

Map module from float to QAT-ready.

Parameters:

mod – Module to be mapped.
qconfig – Quantization configuration.

Returns:

QAT-ready module.

to_float() → Module[source]#

Map module from QAT-ready to float.

Returns:: Float-based module.

num_embeddings: int#

embedding_dim: int#

padding_idx: int | None#

max_norm: float | None#

norm_type: float#

scale_grad_by_freq: bool#

weight: Tensor#

freeze: bool#

sparse: bool#

class archai.quantization.modules.FakeQuantEmbeddingForOnnx(*args, **kwargs)[source]#

Allow a QAT-ready Embedding layer to be exported with ONNX.

num_embeddings: int#

embedding_dim: int#

padding_idx: int | None#

max_norm: float | None#

norm_type: float#

scale_grad_by_freq: bool#

weight: Tensor#

freeze: bool#

sparse: bool#

training: bool#

class archai.quantization.modules.FakeDynamicQuantLinear(*args, dynamic_weight: bool | None = True, activation_reduce_range: bool | None = True, bits: int | None = 8, onnx_compatible: bool | None = False, qconfig: Dict[Module, Any] | None = None, **kwargs)[source]#

Translate a torch-based Linear layer into a QAT-ready Linear layer.

property fake_quant_weight: Tensor#: Return a fake quantization over the weight matrix.

forward(x: Tensor) → Tensor[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

classmethod from_float(mod: Module, qconfig: Dict[Module, Any] | None = None, activation_reduce_range: bool | None = True, **kwargs) → FakeDynamicQuantLinear[source]#

Map module from float to QAT-ready.

Parameters:

mod – Module to be mapped.
qconfig – Quantization configuration.
activation_reduce_range – Whether to reduce the range of activations.

Returns:

QAT-ready module.

to_float() → Module[source]#

Map module from QAT-ready to float.

Returns:: Float-based module.

in_features: int#

out_features: int#

weight: Tensor#

class archai.quantization.modules.FakeDynamicQuantLinearForOnnx(*args, **kwargs)[source]#

Allow a QAT-ready Linear layer to be exported with ONNX.

in_features: int#

out_features: int#

weight: Tensor#

training: bool#

class archai.quantization.modules.FakeDynamicQuantConv1d(*args, dynamic_weight: bool | None = True, activation_reduce_range: bool | None = True, bits: int | None = 8, onnx_compatible: bool | None = False, qconfig: Dict[Module, Any] | None = None, **kwargs)[source]#

Translate a torch-based Conv1d layer into a QAT-ready Conv1d layer.

property fake_quant_weight: Tensor#: Return a fake quantization over the weight matrix.

forward(x: Tensor) → Tensor[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

classmethod from_float(mod: Module, qconfig: Dict[Module, Any] | None = None, activation_reduce_range: bool | None = True, **kwargs) → FakeDynamicQuantConv1d[source]#

Map module from float to QAT-ready.

Parameters:

mod – Module to be mapped.
qconfig – Quantization configuration.
activation_reduce_range – Whether to reduce the range of activations.

Returns:

QAT-ready module.

to_float() → Module[source]#

Map module from QAT-ready to float.

Returns:: Float-based module.

bias: Tensor | None#

in_channels: int#

out_channels: int#

kernel_size: Tuple[int, ...]#

stride: Tuple[int, ...]#

padding: str | Tuple[int, ...]#

dilation: Tuple[int, ...]#

transposed: bool#

output_padding: Tuple[int, ...]#

groups: int#

padding_mode: str#

weight: Tensor#

class archai.quantization.modules.FakeDynamicQuantConv1dForOnnx(*args, **kwargs)[source]#

Allow a QAT-ready Conv1d layer to be exported with ONNX.

bias: Tensor | None#

in_channels: int#

out_channels: int#

kernel_size: Tuple[int, ...]#

stride: Tuple[int, ...]#

padding: str | Tuple[int, ...]#

dilation: Tuple[int, ...]#

transposed: bool#

output_padding: Tuple[int, ...]#

groups: int#

padding_mode: str#

weight: Tensor#

training: bool#

Observers#

class archai.quantization.observers.OnnxDynamicObserver(dtype: str)[source]#

DynamicObserver that is compliant with ONNX-based graphs.

This class can be used to perform symmetric or assymetric quantization, depending on the dtype provided. qint8 is usually used for symmetric quantization, while quint8 is used for assymetric quantization.

calculate_qparams() → None[source]#: Calculate the quantization parameters.

Quantizers#

class archai.quantization.quantizers.FakeDynamicQuant(reduce_range: bool | None = True, dtype: dtype | None = torch.quint8, bits: int | None = 8, onnx_compatible: bool | None = False)[source]#

Fake dynamic quantizer to allow for scale/zero point calculation during Quantization-Aware Training.

This class allows inserting a fake dynamic quantization operator in a PyTorch model, in order to calculate scale and zero point values that can be used to quantize the model during training. The operator can be customized to use different quantization types (quint8 or qint8) and bit widths, and it can be made compatible with ONNX.

Note: This module is only meant to be used during training, and should not be present in the final, deployed model.

training: bool#

forward(x: Tensor) → Tensor[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Quantization (Utilities)#

archai.quantization.quantization_utils.rgetattr(obj: Any, attr: str, *args) → Any[source]#

Recursively get an attribute from an object.

This function allows accessing nested attributes by separating each level with a dot (e.g., “attr1.attr2.attr3”). If any attribute along the chain does not exist, the function returns the default value specified in the *args parameter.

Parameters:

obj – Object from which the attribute will be retrieved.
attr – Name of the attribute to be retrieved, with each level separated by a dot.

Returns:

Attribute from the object.

Example

>>> obj = MyObject()
>>> rgetattr(obj, "attr1.attr2.attr3")

Reference:: https://stackoverflow.com/questions/31174295/getattr-and-setattr-on-nested-subobjects-chained-properties/31174427#31174427

archai.quantization.quantization_utils.rsetattr(obj: Any, attr: str, value: Any) → None[source]#

Recursively set an attribute on an object.

This function allows setting nested attributes by separating each level with a dot (e.g., “attr1.attr2.attr3”).

Parameters:

obj – Object on which the attribute will be set.
attr – Name of the attribute to be set, with each level separated by a dot.
value – New value for the attribute.

Example

>>> obj = MyObject()
>>> rsetattr(obj, "attr1.attr2.attr3", new_value)

Reference:: https://stackoverflow.com/questions/31174295/getattr-and-setattr-on-nested-subobjects-chained-properties/31174427#31174427

Post-Training Quantization (PTQ)#

class archai.quantization.ptq.GemmQuant(onnx_quantizer: ONNXQuantizer, onnx_node: NodeProto)[source]#

Quantized version of the Gemm operator.

quantize() → None[source]#

Quantize a Gemm node into QGemm.

This method replaces the original Gemm node with a QGemm node, which is a quantized version of the Gemm operator. It also adds a Cast node to cast the output of QGemm to float, and an Add node to sum the remaining bias to the Gemm output.

archai.quantization.ptq.add_new_quant_operators() → None[source]#: Add support for new quantization operators by changing internal onnxruntime registry dictionaries.

archai.quantization.ptq.dynamic_quantization_onnx(onnx_model_path: str) → str[source]#

Perform dynamic quantization on an ONNX model.

The quantized model is saved to a new file with “-int8” appended to the original file name.

Parameters:: onnx_model_path – Path to the ONNX model to be quantized.
Returns:: Path to the dynamic quantized ONNX model.

archai.quantization.ptq.dynamic_quantization_torch(model: Module, embedding_layers: List[str] | None = ['word_emb', 'transformer.wpe', 'transformer.wte']) → None[source]#

Perform dynamic quantization on a PyTorch model.

This function performs dynamic quantization on the input PyTorch model, including any specified embedding layers.

Parameters:

model – PyTorch model to be quantized.
embedding_layers – List of string-based identifiers of embedding layers to be quantized.

Quantization-Aware Training (QAT)#

archai.quantization.qat.qat_to_float_modules(model: Module) → None[source]#

Convert QAT-ready modules to float-based modules.

This function converts all QAT-ready modules in the input model to float-based modules. It does this recursively, so all sub-modules within the input model will also be converted if applicable.

Parameters:: model – QAT-ready module to be converted.

archai.quantization.qat.float_to_qat_modules(model: ~torch.nn.modules.module.Module, module_mapping: ~typing.Dict[~torch.nn.modules.module.Module, ~torch.nn.modules.module.Module] | None = {<class 'torch.nn.modules.conv.Conv1d'>: <class 'archai.quantization.modules.FakeDynamicQuantConv1d'>, <class 'torch.nn.modules.linear.Linear'>: <class 'archai.quantization.modules.FakeDynamicQuantLinear'>, <class 'torch.nn.modules.sparse.Embedding'>: <class 'archai.quantization.modules.FakeQuantEmbedding'>, <class 'transformers.pytorch_utils.Conv1D'>: <class 'archai.quantization.nlp.modules.FakeDynamicQuantHFConv1D'>}, qconfig: ~typing.Dict[~torch.nn.modules.module.Module, ~typing.Any] | None = None, **kwargs) → None[source]#

Convert float-based modules to QAT-ready modules.

This function converts all float-based modules in the input model to QAT-ready modules using the provided module mapping. It does this recursively, so all sub-modules within the input model will also be converted if applicable.

A quantization configuration can also be supplied.

Parameters:

model – Float-based module to be converted.
module_mapping – Maps between float and QAT-ready modules.
qconfig – Quantization configuration to be used for the conversion.

archai.quantization.qat.prepare_with_qat(model: Module, inplace: bool | None = True, onnx_compatible: bool | None = False, backend: str | None = 'qnnpack', **kwargs) → Module[source]#

Prepare a float-based PyTorch model for quantization-aware training (QAT).

This function modifies the input model in place by inserting QAT-based modules and configurations.

Parameters:

model – Float-based PyTorch module to be prepared for QAT.
inplace – Whether the prepared QAT model should replace the original model.
onnx_compatible – Whether the prepared QAT model should be compatible with ONNX.
backend – Quantization backend to be used.

Returns:

The input model, modified in place (or not) to be ready for QAT.

Mixed-QAT#

class archai.quantization.mixed_qat.MixedQAT(model: Module, qat_weight: float | None = 0.2)[source]#

Mixed QAT (Quantization-Aware Training) model, which can be fine-tuned using a linear combination of regular and QAT losses.

forward(input_ids: LongTensor, labels: LongTensor, *args, **kwargs) → Tuple[Tensor, ...][source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

training: bool#