Skip to content

Section 11: Plans - GPU Tensorization

In this section we will look more closely at how we can utilize tensor cores on supported GPUs to accelerate matrix multiplication operations.

Since tensor cores on the GPU can perform matrix multiplication of some standard shapes, we need to first familiarize ourselves with some of the associated terminology: - MMA shape - the smallest tensorizable matrix multiplication shape. In other words, nest of this shape or its multiple can be executed on tensor cores. Accera supports MMA shapes in the form of MmxNnxKk_Bb which performs matrix multiplication of shape {m, n, k}, i.e., C += A x B, where matrix A is of shape {m, k}, matrix B is of shape {k, n} and the result matrix C is of shape {m, n}. The MMA shape can be specified by setting the mma_shape parameter in the plan.tensorize function call. - Tensor pass - A single tensor pass refers to a single unit of tensor operation. For example, a single pass of the MMA shape M16xN16xK4_B1 performs matrix multiplication of shape {16, 16, 4}, whereas 4 passes of the same MMA shape performs a matmul of shape {16, 16, 16} in 4 iterations (passes) where each pass performs a matmul of shape {16, 16, 4}. The number of passes can be controlled by setting the num_total_passes parameter in the plan.tensorize function call.

Tuning parameters

  • Pass fusing/grouping - A group of passes can be fused together to control allocation of registers required for input data (A and B matrices) and memory I/O density during tensor matmul. This is explained in more detail in the Multi-Pass Tensorized MatMul with Pass Fusion tutorial.
  • Scheduling policy - This parameter can be used to tune register usage for accumulator data (C matrix) for multi-block tensor shapes. This is explained in more detail in Tensor MatMul on GPU: Scheduling Policy experiments tutorial.
  • Prologue/Epilogue Ops - These parameters can be set to perform element-wise ops before and after matmul operations on tensor cores in an optimized way. Examples of this usage is presented in the Tensor MatMul on GPU: Fused Element-wise Operations tutorial.

Last update: 2023-04-17