v1.0 2019-10-15

This doc covers the new shader model 6.5 for Vibranium release (20H1). With the exception of the WaveMatch and WaveMultiPrefix intrinsics, the new features are defined in separate documents.

DirectX Raytracing (DXR) Tier 1.1 adds:

`uint GeometryIndex()`

: a new intrinsic for retrieving the generated geometry index to the existing raytracing shader types with intersection information (intersection, any hit, and closest hit).`RayQuery`

: a new object, available to every shader stage, that enables inline access to raytracing operations.

Feature support is indicated by the Raytracing Tier.

See the DirectX Raytracing (DXR) Functional spec for details.

Sampler Feedback is an optional feature available in shader model 6.5
that adds 2 new Texture resource types:
`FeedbackTexture2D<type>`

and `FeedbackTexture2DArray<type>`

to HLSL.
These resource types have a template argument
specifying the format of the feedback map.

See the HLSL section of the Sampler Feedback spec for details.

Two new shader types are added to HLSL for the new mesh shader graphics pipeline.
These are Mesh Shaders `ms_6_5`

and Amplification Shaders `as_6_5`

.

See the Mesh Shader spec for details.

`WaveMatch()`

, and a number of `WaveMultiPrefix*()`

intrinsics,
are available starting with shader model 6.5,
when the optional Wave Intrinsics feature is supported.

All shader stages, except for Raytracing shaders, support wave intrinsics as of shader model 6.5.

Shader model 6 introduces a set of data parallel wave intrinsics, which implement fundamental computational primitives such as voting, reductions and prefix operations among the lanes in a wave. These intrinsics are valuable tools for many compute algorithms, exploiting efficiency of SIMD execution model of modern GPUs.

Shader model 6.5 adds two new classes of wave intrinsics. They are useful for many data parallel algorithms involving deduplication of data streams, coalescing of memory operations, and implementing efficient concurrent data structures such as hash tables and maps. In particular, these new intrinsics are important building blocks texture-space shading, and provide support for index buffer deduplication in the context of the meshlet programming model.

We believe these data parallel primitives will grow in importance as applications embrace more compute-centric algorithms.

`uint4 WaveMatch( <type> val )`

The `WaveMatch()`

intrinsic compares the value of the expression in the current lane
to its value in all other active lanes in the current wave and returns a
bitmask representing the set of lanes matching current lane’s value.

`val`

can be any expression which evaluates to any of the currently
supported primitive data types (e.g. float4, uint2, etc.).

The return value is a uint4 representing a 128b bit mask which
identifies lanes in the current wave matching current lane’s value of `val`

.
Bits in the mask corresponding to inactive lanes, or at positions
beyond current implementation’s wave width, will contribute 0’s.
Bits in the mask corresponding to active lanes
which match the value of `val`

in the current lane will be set to 1.
The bit in the mask corresponding to the current lane will always be set to 1.

The following diagram (see figure 1) demonstrates the action of
`WaveMatch()`

assuming an implementation with a wave width of 8 lanes.
Inactive lanes are depicted in gray, bits in the mask beyond bit
position 7 are guaranteed to be cleared (effective mask width is 8).

Figure 1. The action of WaveMatch() function.

TODO: insert here

`WaveMultiPrefix*()`

is a set of functions which implement
*multi-prefix* operations among the set of active lanes in the current wave.

A multi-prefix operation comprises a set of prefix operations, executed in parallel within subsets of lanes identified with the provided bitmasks. These bitmasks represent partitioning of the set of active lanes in the current wave into N groups (where N is the number of unique masks across all lanes in the wave). N prefix operations are then performed each within its corresponding group. The groups are assumed to be non-intersecting (that is, a given lane can be a member of one and only one group), and bitmasks in all lanes belonging to the same group are required to be the same.

The following operations evaluates multiple prefix operations within groups of threads identified by `mask`

:

* <type>* can be any of the currently supported
integer or floating point primitive types.

* <int_type>* can be any of the currently supported
integer primitive types.

* val* is the value to perform the prefix operation on.

* mask* is a 128b bitmask,
representing the partitioning of the current wave into groups of lanes,
as described above.
Bits in the masks at positions beyond current implementation’s wave width,
or corresponding to inactive lanes, are ignored (assumed to be 0).
If the masks do not form non-intersecting subsets of lanes,
then the values returned by this intrinsic are undefined.
Bitmasks for all lanes belonging to the same group are required to match,
otherwise the results returned by this intrinsic are undefined.

Returned * <type>* is the same type as the input type for

`val`

.
The result of the prefix operation is computed with
values from prior lanes in the same group only;
it does not include the value from the current lane.
A postfix value would be computed by
applying the corresponding operator between
the result of the prefix operation
and the value passed in to the prefix operation.`<type> WaveMultiPrefixSum( <type> val, uint4 mask )`

val0 + val1 + val2 …

`<type> WaveMultiPrefixProduct( <type> val, uint4 mask )`

val0 * val1 * val2 …

`uint WaveMultiPrefixCountBits( bool val, uint4 mask )`

(val0 ? 1 : 0) + (val1 ? 1 : 0) + (val2 ? 1 : 0) …

`<int_type> WaveMultiPrefixAnd( <int_type> val, uint4 mask )`

val0 & val1 & val2 …

`<int_type> WaveMultiPrefixOr( <int_type> val, uint4 mask )`

val0 | val1 | val2 …

`<int_type> WaveMultiPrefixXor( <int_type> val, uint4 mask )`

val0 ^ val1 ^ val2 …

The following diagram demonstrates the action of
`WaveMultiPrefixSum()`

, assuming an implementation with wave width of 8 lanes.
Inactive lanes are depicted in gray.

Figure 2. The action of WaveMultiPrefixSum() function.

TODO: insert here

Note how one of the lanes from the orange subset refers to lane 1, which is inactive. This doesn’t affect the result since bits in the mask corresponding to inactive lanes are ignored.

`WaveMatch()`

and `WaveMultiPrefix*()`

intrinsics are designed to work together.
In particular, the masks returned by the `WaveMatch()`

intrinsic can be used directly
as group masks in the `WaveMultiPrefix*()`

set of intrinsics.

The following illustrates obtaining an equivalent result
to a `WaveMultiPrefixSum`

operation using the
`WaveReadLaneFirst`

and `WaveActiveSum`

intrinsics.
However, using the new wave intrinsics provides more
optimization opportunities for hardware to take advantage of.

```
// Given:
int sum = 0, value = ...;
int expr = ...; // Uniform subsets exist for this expr value
// The following:
uint4 mask = WaveMatch(expr);
sum = WaveMultiPrefixSum(value, mask);
// Is equivalent to writing a loop like this:
while (true) {
if (WaveReadLaneFirst(expr) == expr) {
sum = WaveActiveSum(value));
break;
}
}
```

The following example demonstrates how to coalesce atomic OR operations to a surface with x,y coordinates computed dynamically per lane.

```
uint key = computeHash(x, y); // compute a key for matching
uint4 groupMask = WaveMatch( key );
// firstbithigh returns -1 when no bit set, otherwise < 32,
// so OR will add lane index offset without changing -1.
int4 highLanes = (int4)(firstbithigh(groupMask) | uint4(0, 0x20, 0x40, 0x60));
// The signed max should be the highest lane index in the group.
uint highLane = (uint)max(max(max(highLanes.x, highLanes.y), highLanes.z), highLanes.w);
bool leader = WaveGetLaneIndex() == highLane;
unsigned int result = WaveMultiPrefixBitOr( myValueToOr, groupMask );
if (leader)
InterlockedOr( mem[key], result | myValueToOr );
```

First, WaveMatch() is used to identify sets of threads
which have the same x,y coordinates
(that is, will update the same location in memory).
Then, a single thread from each set is elected to issue a single atomic operation
to memory on behalf of all lanes in the set.
The `WaveMultiPrefixBitOr()`

function is used to apply bitwise-OR
reduction within multiple sets of colliding lanes concurrently,
the results of which are then used in the elected threads to issue atomic operations to memory.