Section 11: Plans - GPU Tensorization
In this section we will look more closely at how we can utilize tensor cores on supported GPUs to accelerate matrix multiplication operations.
Related Concepts
Since tensor cores on the GPU can perform matrix multiplication of some standard shapes, we need to first familiarize ourselves with some of the associated terminology:
- MMA shape - the smallest tensorizable matrix multiplication shape. In other words, nest of this shape or its multiple can be executed on tensor cores. Accera supports MMA shapes in the form of MmxNnxKk_Bb which performs matrix multiplication of shape {m, n, k}, i.e., C
+= A
x B
, where matrix A
is of shape {m, k}, matrix B
is of shape {k, n} and the result matrix C
is of shape {m, n}. The MMA shape can be specified by setting the mma_shape
parameter in the plan.tensorize
function call.
- Tensor pass - A single tensor pass refers to a single unit of tensor operation. For example, a single pass of the MMA shape M16xN16xK4_B1
performs matrix multiplication of shape {16, 16, 4}, whereas 4 passes of the same MMA shape performs a matmul of shape {16, 16, 16} in 4 iterations (passes) where each pass performs a matmul of shape {16, 16, 4}. The number of passes can be controlled by setting the num_total_passes
parameter in the plan.tensorize
function call.
Tuning parameters
- Pass fusing/grouping - A group of passes can be fused together to control allocation of registers required for input data (
A
andB
matrices) and memory I/O density during tensor matmul. This is explained in more detail in the Multi-Pass Tensorized MatMul with Pass Fusion tutorial. - Scheduling policy - This parameter can be used to tune register usage for accumulator data (
C
matrix) for multi-block tensor shapes. This is explained in more detail in Tensor MatMul on GPU: Scheduling Policy experiments tutorial. - Prologue/Epilogue Ops - These parameters can be set to perform element-wise ops before and after matmul operations on tensor cores in an optimized way. Examples of this usage is presented in the Tensor MatMul on GPU: Fused Element-wise Operations tutorial.