Tensor MatMul on GPU: Fused Element-wise Operations

In this tutorial, you will learn how to implement a composite op which performs Matrix Multiplication (MatMul) on GPU tensor cores along with element-wise operations on the result data in an optimized way.

Prerequisites

You have completed the Tensor_MatMul_GPU tutorial.

Accera implementation

Often times we want to perform element-wise operation (auxiliary) on the data before/after the main operation. For example, for non-unit alpha (α != 1) in GEMM, we want to scale each element of the matrix multiplication result by a constant value. For these scenarios, we enable the use of prologue and epilogue ops on tensor data for pre/post processing of data held in tensor fragments. Note that when tensorization is not used, this can be simply achieved by fusion.

Prologue Tensor Operation

This operation is applied on each element of the tensor fragment data before matrix multiplication is done. Here is an example where we use the SET operation to zero-initialize the result fragment before matmul is performed:

plan.tensorize(indices=tensor_indices, mma_shape=mma_shape, prologue_op=acc.MMAFragmentOp.SET, prologue_arg=0.0)

The complete python script with 0-init of result data can be found here. The generated source code looks something like this (note the fill_fragment call below instead of load_matrix_sync):

extern "C" __global__  __launch_bounds__(256) void tensor_zero_init_matmul_gpu_d6f8bdcfb87f53cc__gpu__(float *arg0, float *arg1, float *arg2) {
    // Calculate threadid offsets and other locals
    /*...*/

    // Main K-loop
    for (int32_t idx14 = 0; idx14 < 512; idx14 += 1) {
        // Declare matrix fragments for A, B and C
        /*...*/

        // Tensor Prologue: Fill result fragment with 0s
        rocwmma::fill_fragment(mmaMatrix_18, float{});

        // Matrix multiplication
        rocwmma::load_matrix_sync<2048>(var7, mmaMatrix_16, arg0 + ...);
        rocwmma::load_matrix_sync<1024>(var7, mmaMatrix_17, arg1 + ...);
        rocwmma::mma_sync<0, 0, 0>(mmaMatrix_18, mmaMatrix_16, mmaMatrix_17, mmaMatrix_18);

        // Store result into global memory
        rocwmma::store_matrix_sync<0, rocwmma::layout_t::mem_row_major, 1024>(var7, arg2 + ..., mmaMatrix_18);
    }
}

Epilogue Tensor Operation

This operation is applied on each element of the tensor fragment data after matrix multiplication is done. Similar to prologue_op, we can also set the epilogue_op argument to achieve this. Here is an example of how alpha scaling of matmul result can be done with the following code:

plan.tensorize(indices=tensor_indices, mma_shape=mma_shape, prologue_op=acc.MMAFragmentOp.SET, prologue_arg=0.0, epilogue_op=acc.MMAFragmentOp.SCALE, epilogue_arg=5.0)

The complete python script with alpha-scaling of result data can be found here. The generated source code looks something like this:

extern "C" __global__  __launch_bounds__(64) void tensor_alpha_scaling_matmul_gpu_e5c7114024bfca18__gpu__(float *arg0, float *arg1, float *arg2) {
    // Calculate threadid offsets and other locals
    /*...*/

    // Main K-loop
    for (int32_t idx14 = 0; idx14 < 512; idx14 += 1) {
        // Declare matrix fragments for A, B and C
        /*...*/

        // Tensor Prologue: Fill result fragment with 0s
        rocwmma::fill_fragment(mmaMatrix_18, float{});

        // Matrix multiplication
        rocwmma::load_matrix_sync<2048>(var7, mmaMatrix_16, arg0 + ...);
        rocwmma::load_matrix_sync<1024>(var7, mmaMatrix_17, arg1 + ...);
        rocwmma::mma_sync<0, 0, 0>(mmaMatrix_18, mmaMatrix_16, mmaMatrix_17, mmaMatrix_18);

        // Tensor Epilogue: Alpha scaling of matmul result
        {
            float* mmaMatrix_18_data = reinterpret_cast<float*>(&mmaMatrix_18);
            for (int i = 0; i < sizeof(mmaMatrix_18) / sizeof(float); ++i) {
                mmaMatrix_18_data[i] *= float{5.000000e+00};
            }
        }

        // Store result into global memory
        rocwmma::store_matrix_sync<0, rocwmma::layout_t::mem_row_major, 1024>(var7, arg2 + ..., mmaMatrix_18);
    }
}

General Matrix Multiply (GEMM) using Prologue and Epilogue Ops

The general matric multiplication problem can be formulated as below, where A, B, and C are matrices and α and β are scalar constants:

C = α.(A @ B) + β.C

GEMM with some of the different combinations of α and β can be computed using prologue and epilogue ops along with tensorization as summarized in the table below:

	β == 0	β == 1	β != 1
α==1	C = A @ B: `prologue_op` = MMAFragmentOp.SET, `prologue_arg` = 0	C += A @ B: no need to set prologue and epilogue args	C = A @ B + β.C: `prologue_op` = MMAFragmentOp.SCALE, `prologue_arg` = β
α!=1	C = α.(A @ B): `prologue_op` = MMAFragmentOp.SET, `prologue_arg` = 0, `epilogue_op` = MMAFragmentOp.SCALE, `epilogue_arg` = α	C += α.(A @ B): use fusion	C = α.(A @ B) + β.C: use fusion

Last update: 2023-04-17