Section 7: Plans - Operations and Optimizations
We can control target-specific operations and optimizations using a plan. Examples include instruction pipelining, applying SIMD vector instructions, and so on.
unroll
By default, each dimension of the iteration space is implemented as a for-loop. The unroll
instruction marks a dimension for unrolling rather than looping. Imagine the following nest that multiplies the entries of an array by a constant:
import accera as acc
my_target = acc.Target(type=acc.Target.Category.CPU)
nest = acc.Nest(shape=(3,5))
i, j = nest.get_indices()
@nest.iteration_logic
def _():
A[i, j] *= 2.0
plan = nest.create_plan(my_target)
plan
as is, the resulting implementation would be equivalent to the following Python code:
for i in range(3):
for j in range(5):
A[i, j] *= 2.0
plan.unroll(index=j)
, the resulting implementation becomes equivalent to:
for i in range(3):
A[i, 0] *= 2.0
A[i, 1] *= 2.0
A[i, 2] *= 2.0
A[i, 3] *= 2.0
A[i, 4] *= 2.0
j
, we add the instruction plan.unroll(index=i)
, the resulting implementation becomes equivalent to:
for j in range(5):
A[0, j] *= 2.0
for j in range(5):
A[1, j] *= 2.0
for j in range(5):
A[2, j] *= 2.0
vectorize
Modern target platforms support SIMD vector instructions. These instructions perform the same operation on an entire vector of elements, all at once. By default, each dimension of an iteration space becomes a for-loop. The vectorize
instruction instead labels a dimension for vectorized execution, rather than for-looping.
For example, assume that a host supports 256-bit vector instructions, indicating that its vector instructions operate on eight floating-point elements at once. Also, consider that we already have arrays A
, B
, and C
, and we write the following code:
nest = acc.Nest(shape=(64,))
i = nest.get_indices()
@nest.iteration_logic
def _():
C[i] = A[i] * B[i]
schedule = nest.create_schedule()
ii = schedule.split(i, 8)
plan = nest.create_plan()
plan.vectorize(index=ii)
00000001400010B0: C5 FC 10 0C 11 vmovups ymm1,ymmword ptr [rcx+rdx]
00000001400010B5: C5 F4 59 0A vmulps ymm1,ymm1,ymmword ptr [rdx]
00000001400010B9: C4 C1 7C 11 0C 10 vmovups ymmword ptr [r8+rdx],ymm1
00000001400010BF: 48 8D 52 20 lea rdx,[rdx+20h]
00000001400010C3: 48 83 E8 01 sub rax,1
00000001400010C7: 75 E7 jne 00000001400010B0
Different targets support different vector instructions having different vector sizes. The following table includes iteration logic that vectorizes correctly on most targets with vectorization support, such as Intel Haswell, Broadwell or newer, and ARM v7/A32. Other examples of iteration logic may or may not vectorize correctly. Variables prefixed with v are vector types, and those prefixed with s are scalar types.
Vector pseudocode | Equivalent to | Supported types |
---|---|---|
v1 += s0 * v0 |
for i in range(vector_size): v1[i] += s0 * v0[i] |
float32 |
v1 += v0 * s0 |
for i in range(vector_size): v1[i] += v0[i] * s0 |
float32 |
v1 += v0 / s0 |
for i in range(vector_size): v1[i] += v0[i] / s0 |
float32 |
v1 -= s0 * v0 |
for i in range(vector_size): v1[i] -= s0 * v0[i] |
float32 |
v1 -= v0 * s0 |
for i in range(vector_size): v1[i] -= v0[i] * s0 |
float32 |
v1 -= v0 / s0 |
for i in range(vector_size): v1[i] -= v0[i] / s0 |
float32 |
v2 += v0 * v1 |
for i in range(vector_size): v2[i] += v0[i] * v1[i] |
float32 |
vector inner (dot) product: s0 += dot(v0, v1) |
for i in range(vector_size): s0 += v0[i] * v1[i] |
float32 |
v2 = v0 + v1 |
for i in range(vector_size): v2[i] = v0[i] + v1[i] |
int8/16/32/64, float32 |
v2 = v0 - v1 |
for i in range(vector_size): v2[i] = v0[i] - v1[i] |
int8/16/32/64, float32 |
v2 = v0 * v1 |
for i in range(vector_size): v2[i] = v0[i] * v1[i] |
int8/16/32/64, float32 |
v2 = v0 / v1 |
for i in range(vector_size): v2[i] = v0[i] / v1[i] |
float32 |
v1 = abs(v[0]) |
for i in range(vector_size): v1[i] = abs(v0[i]) |
int8/16/32/64, float32 |
v2 = (v0 == v1) |
for i in range(vector_size): v2[i] = 0XF..F if v0[i] == v1[i] else 0 |
int8/16/32/64, float32 |
v2 = (v0 > v1) |
for i in range(vector_size): v2[i] = 0XF..F if v0[i] > v1[i] else 0 |
int8/16/32/64, float32 |
v2 = (v0 >= v1) |
for i in range(vector_size): v2[i] = 0XF..F if v0[i] >= v1[i] else 0 |
int8/16/32/64, float32 |
v2 = (v0 < v1) |
for i in range(vector_size): v2[i] = 0XF..F if v0[i] < v1[i] else 0 |
int8/16/32/64, float32 |
v2 = (v0 <= v1) |
for i in range(vector_size): v2[i] = 0XF..F if v0[i] <= v1[i] else 0 |
int8/16/32/64, float32 |
v1 = v0 << s0 |
for i in range(vector_size): v1[i] = v0[i] << s0 |
int16/32/64, float32 |
v1 = v0 >> s0 |
for i in range(vector_size): v1[i] = v0[i] >> s0 |
int16/32/64, float32 |
s0 = sum(v0) |
for i in range(vector_size): s0 += v0[i] |
int8/16/32/64, float32 |
s0 = max(v0 + v1) |
for i in range(vector_size): s0 = max(v0[i] + v1[i], s0) |
int8/16/32/64, float32 |
s0 = max(v0 - v1) |
for i in range(vector_size): s0 = max(v0[i] - v1[i], s0) |
int8/16/32/64, float32 |
Additionally, Accera can perform vectorized load and store operations to/from vector registers and memory if the memory locations are contiguous.
To vectorize dimension i
, the number of active elements that corresponds to dimension i
must exactly match the vector instruction width of the target processor. For example, if the target processor has vector instructions that operate on either 4 or 8 floating-point elements at once, then the number of active elements can either be 4 or 8. Additionally, those active elements must occupy adjacent memory locations (they cannot be spread out).
tensorize
Some hardware also have specialized instructions for performing matrix multiplications. These instructions operate on certain matrix dimensions with specific data types. The tensorization instructions take tiles of the A
, B
, and C
matrices and compute the C = A * B + C
operation.
The tensorize
operation takes 3 indices and a tensorization shape (MMA shape) along with some tuning parameters:
plan.tensorize(indices=(i,j,k), mma_shape=MMAShape.M16xN16xK4_B1)
Tensorization is limited and is only valid on loop structures of the form
for i in range(M):
for k in range(N):
for j in range(K):
C[i, j] += A[i, k] * B[k, j]
Where there is MxNxK
tensorization hardware support using the A
, B
, and C
element data types. Tensorization support for GPUs in Accera is explained in more detail here.
Convenience syntax: kernelize
The kernelize
instruction is a convenience syntax that does not provide any unique functionality. Specifically, kernelize
is equivalent to a sequence of unroll
instructions, followed by an optional vectorize
instruction.
A typical Accera design pattern is to first break a loop-nest into tiles and then apply an optimized kernel to each tile. For example, imagine that the loop nest multiplies two 256×256 matrices and the kernel is a highly optimized procedure for multiplying 4×4 matrices. Accera will introduce different ways to write highly optimized kernels in the future. However, currently, it only supports automatic kernelization using the kernelize
instruction. As mentioned above, kernelize
is shorthand for unrolling and vectorizing. These instructions structure the code in a way that makes it easy for downstream compiler heuristics to automatically generate kernels.
Consider, once again, the matrix multiplication example we discussed previously in Section 2. Assume that we declare the schedule and reorder as follows:
schedule = nest.create_schedule()
schedule.reorder(i, k, j)
i, k, j
are the last three dimensions in the iteration space and the resulting implementation becomes equivalent to:
for i in range(M):
for k in range(S):
for j in range(N):
C[i, j] += A[i, k] * B[k, j]
The instruction:
plan.kernelize(unroll_indices=(i, k), vectorize_indices=j)
plan.unroll(i)
plan.unroll(k)
plan.vectorize(j)
i, k, j
.
For simplicity, assume that the matrix sizes defined by M, N, and S are 3, 4, and 2 respectively.
After applying kernelize
, the schedule is equivalent to the following Python code:
C[0,0:4] += A[0,0] * B[0,0:4] # vectorized
C[0,0:4] += A[0,1] * B[1,0:4] # vectorized
C[1,0:4] += A[1,0] * B[0,0:4] # vectorized
C[1,0:4] += A[1,1] * B[1,0:4] # vectorized
C[2,0:4] += A[2,0] * B[0,0:4] # vectorized
C[2,0:4] += A[2,1] * B[1,0:4] # vectorized
This would result in the following vectorized instructions on an Intel Haswell CPU:
0000000000000200: C4 C1 78 10 00 vmovups xmm0,xmmword ptr [r8]
0000000000000205: C4 E2 79 18 09 vbroadcastss xmm1,dword ptr [rcx]
000000000000020A: C5 F8 10 12 vmovups xmm2,xmmword ptr [rdx]
000000000000020E: C4 E2 69 A8 C8 vfmadd213ps xmm1,xmm2,xmm0
0000000000000213: C5 F8 10 5A 10 vmovups xmm3,xmmword ptr [rdx+10h]
0000000000000218: C4 E2 79 18 61 04 vbroadcastss xmm4,dword ptr [rcx+4]
000000000000021E: C4 E2 61 A8 E1 vfmadd213ps xmm4,xmm3,xmm1
0000000000000223: C4 E2 79 18 49 08 vbroadcastss xmm1,dword ptr [rcx+8]
0000000000000229: C4 E2 69 A8 C8 vfmadd213ps xmm1,xmm2,xmm0
000000000000022E: C4 E2 79 18 69 0C vbroadcastss xmm5,dword ptr [rcx+0Ch]
0000000000000234: C4 E2 61 A8 E9 vfmadd213ps xmm5,xmm3,xmm1
0000000000000239: C4 E2 79 18 49 10 vbroadcastss xmm1,dword ptr [rcx+10h]
000000000000023F: C4 E2 69 A8 C8 vfmadd213ps xmm1,xmm2,xmm0
0000000000000244: C4 E2 79 18 41 14 vbroadcastss xmm0,dword ptr [rcx+14h]
000000000000024A: C4 E2 61 A8 C1 vfmadd213ps xmm0,xmm3,xmm1
000000000000024F: C4 C1 58 58 09 vaddps xmm1,xmm4,xmmword ptr [r9]
0000000000000254: C4 C1 50 58 51 10 vaddps xmm2,xmm5,xmmword ptr [r9+10h]
000000000000025A: C4 C1 78 58 41 20 vaddps xmm0,xmm0,xmmword ptr [r9+20h]
parallelize
The parallelize
instruction performs one or more loops in parallel on multiple cores.
xeonPlat = acc.Target("Intel 9221", num_threads=16)
plan = schedule.create_plan(xeonPlat)
plan.parallelize(indices=(i,j,k))
collapse
argument in OpenMP. Therefore, the dimensions must be contiguous in the iteration space dimension order.
Static scheduling policy
Static scheduling strategy is invoked by setting the argument policy="static"
in the call to parallelize
. If n iterations are parallelized across c cores, the static scheduling partitions the work into c fixed parts, some of size floor(n/c), some of size ceil(n/c), and executes each part on a different core.
Dynamic scheduling policy
Dynamic scheduling strategy is invoked by setting the argument policy="dynamic"
in the call to parallelize
. Dynamic scheduling creates a single work queue that is shared across different cores.
Specifying thread limit
Setting the argument max_threads
to a positive integer value will tell the compiler to have an upper bound on the number of threads used for distributing the workload.
Not yet implemented: Pinning to specific cores
The pin
argument allows the parallel work to be pinned to specific cores.
bind
Some target platforms, such as GPUs, are specifically designed to execute nested loops. They can take an entire grid of work and schedule its execution on multiple cores. On a GPU, this grid is broken up into multiple blocks, where each block contains multiple threads. Block iterators and thread iterators are identified by special variables in the Target
object. To take advantage of a target platform's ability to execute grids, we must bind dimensions of the iteration space with these special iterator variables.
For example,
v100 = acc.Target("Tesla V100")
plan.bind(mapping={
i: v100.GridUnit.BLOCK_X,
j: v100.GridUnit.THREAD_X,
k: v100.GridUnit.THREAD_Y
}
)