This document describes Work Graphs, a new feature for GPU-based work generation built on new Shader Model 6.8 DXIL features.
Full documentation of the Work Graphs feature including DirectX runtime, HLSL, and DXIL documentation is available in the Work Graphs Spec on GitHub. This document subsumes that document specifically as it relates to HLSL and the compiler features.
Current language and runtime limitations make generating work on GPU threads insufficient to meet the needs of some workloads. If existing GPU features (like ExecuteIndirect) can’t sufficiently generate work the application must generate the work from the CPU resulting in unnecessary round-tripping between the GPU and CPU. The Work Graphs feature solves this problem by enabling more robust GPU-based work creation APIs.
The Work Graphs feature allows an application to specify a set of tasks as nodes in a graph representing a more complex workload. Each node has a fixed shader which takes one or more input records as input and can produce one or more output records as output.
Node shaders have one of three launch modes:
Thread launch nodes represent an individual thread of work that processes a
single input record. Thread launch nodes have no visible thread group, do not
require the numthreads
attribute, have no access to groupshared memory, and
cannot use group-scope memory or sync barriers.
Broadcasting launch nodes represent a grid of work operating on a single input record. Each input record to a broadcasting node launches a full dispatch grid. The size of the dispatch grid can either be fixed for the node or specified in the input.
Coalescing launch nodes represent a thread group operating on a shared array of input records. The shader declares the maximum number of records per thread group.
Input records represent inputs to node shaders, and output records represent outputs from node shaders. Records can be singular or arrays. Input records can be read-only or read-write, while output records are read-write but uninitialized when created.
Node shaders are represented as entry functions built into library targets. Node shaders have
similar capabilities and execution semantics to compute shaders. Shader entries
annotated as [shader("node")]
are usable as work graph nodes.
As with all previous shader types, a shader entry function may have only one shader stage annotation.
All node shaders except thread launch nodes must specify the thread group size
using the [numthreads(<x>, <y>, <z>)]
attribute.
Node shaders support existing compute shader function annotations: numthreads
and wavesize
.
They also have new annotations unique to node shaders.
Consistent with HLSL grammar, attribute names are case-insensitive. Any attributes that take string arguments, the string argument values are case-sensitive.
[NodeLaunch("<mode>")]
Valid values for node launch mode are broadcasting
, coalescing
, or thread
.
If the NodeLaunch
attribute is not specified on a node entry the default
launch mode is broadcasting
.
[NodeIsProgramEntry]
Indicates a node that may be invoked as an entry directly by the API. Shaders that can receive input records from both inside and outside the work graph (i.e. from a command list) require this attribute. Graphs can have more than one entry shader that receives inputs from outside the graph. This property is implied if no other work graph nodes target this node, so in that case the attribute is optional and may be omitted.
[NodeID("<name>", <index> = 0)]
Use the provided name to represent this node instead of the function name.
If present, the index parameter specifies the index into the named node array.
If this attribute is not present, the function name is the node name,
and the index is 0
.
[NodeLocalRootArgumentsTableIndex(<index>)]
The specified index indicates the record index into the local root arguments table bound to the work graph. If this attribute is not specified and the shader has a local root signature, the index defaults to an unallocated table location.
[NodeShareInputOf("<name>", <index> = 0 )]
Share the inputs for the node specified by name and optional index. If present, the index parameter specifies the index into the named node array. Both nodes must have identical input records, the same launch mode, and identical dispatch grid size (if fixed).
[NodeDispatchGrid(<x>, <y>, <z>)]
Specifies the size of the dispatch grid. Broadcast launch nodes must specify
either the dispatch grid size or maximum dispatch grid size in source. The x
y
and z
parameters individually cannot exceed (2^16)-1 (65535), and x*y*z
cannot exceed (2^24)-1 (16,777,215).
[NodeMaxDispatchGrid(<x>, <y>, <z>)]
Specifies the maximum dispatch grid size when the dispatch grid size is
specified on the input record using the SV_DispatchGrid
semantic. The x
y
and z
parameters individually cannot exceed (2^16)-1 (65535), and x*y*z
cannot
exceed (2^24)-1 (16,777,215).
[NodeMaxRecursionDepth(<count>)]
Specifies the maximum recursion depth for a node. This attribute is required if one of the outputs of a node is the ID of the node itself.
Specific types for input parameters depend on the node launch mode, but all node shaders support two categories of parameters:
Node input objects come in three categories, one for each launch mode, with read-only and read-write variants:
{RW}ThreadNodeInputRecord<RecordTy>
- for thread launch nodes.{RW}DispatchNodeInputRecord<RecordTy>
- for broadcasting launch nodes.{RW}GroupNodeInputRecords<RecordTy>
- for coalescing launch nodes.The pseudo-HLSL code below describes the basic interface of the
{RW}{Thread|Dispatch|Group}NodeInputRecord{s}<RecordTy>
classes:
namespace detail {
///@brief Common interfaces for Thread and Dispatch node input record types.
template <typename RecordTy, bool IsRW> class NodeInputRecordInterface {
/// @brief Get a copy of the underlying record.
RecordTy Get() const;
/// @brief Get a writable reference to the underlying record.
///
/// Only available for RW object variants. The non-const Get() returns a
/// reference to the underlying data.
std::enable_if_t<IsRW, RecordTy>::type &Get();
};
/// @brief Interface for GroupNodeInputRecords and RWGroupNodeInputRecords
template <typename RecordTy, bool IsRW = false> class GroupNodeInputRecordsBase {
/// @brief Returns the number of records that have been coalesced into the
/// current thread group
/// Returned value is in the range [1..._MaxCount_], where _MaxCount_ is
/// specified by the `[MaxRecords(_MaxCount_)]` attribute applied to the
/// parameter declaration.
uint Count() const;
/// @brief Get a copy of the underlying record at the specified index.
RecordTy Get(uint Index) const;
/// @brief Get a writable reference to the underlying record.
/// @param Index The record index to access.
///
/// Only available for RW object variants. The non-const Get() returns a
/// reference to the underlying data.
std::enable_if_t<IsRW, RecordTy>::type &Get(uint Index);
/// @brief Get a copy of the underlying record at the specified index.
/// @param Index The record index to access.
RecordTy operator[](uint Index) const;
/// @brief Get a writable reference to the underlying record.
/// @param Index The record index to access.
///
/// Only available for non-RW object variants. The non-const operator[]
/// returns a reference to the underlying data.
std::enable_if_t<IsRW, RecordTy>::type &operator[](uint Index);
};
} // namespace detail
template <typename RecordTy>
using ThreadNodeInputRecord = detail::NodeInputRecordInterface<RecordTy, false>;
template <typename RecordTy>
using RWThreadNodeInputRecord =
detail::NodeInputRecordInterface<RecordTy, true>;
template <typename RecordTy>
using DispatchNodeInputRecord =
detail::NodeInputRecordInterface<RecordTy, false>;
template <typename RecordTy>
class RWDispatchNodeInputRecord
: public detail::NodeInputRecordInterface<RecordTy, true> {
/// @brief Allows thread groups to coordinate reading and writing to a shared
/// set of records.
/// @returns Returns `false` for thread groups that are not the last to finish
/// and `true` for the last thread group to call this method.
///
/// This method must be called by all threads in a dispatch or not called at
/// all. The call must be in dispatch grid uniform control flow. The callsite
/// must be uniform across all threads in all thread groups in the dispatch
/// grid. This method may be called at most once per thread.
///
/// Any violation of these requirements is undefined behavior.
///
/// This method returns `false` for thread groups that are not the last to
/// finish and these thread groups are not allowed to read or write to the
/// input. Reading or writing the input after this call on a thread that
/// returned `false` is undefined behavior.
///
/// This method returns `true` for the last thread group to finish. That
/// thread group can continue reading and writing to the input.
bool FinishedCrossGroupSharing();
};
template <typename RecordTy>
using GroupNodeInputRecords = detail::GroupNodeInputRecordsBase<RecordTy, false>;
template <typename RecordTy>
using RWGroupNodeInputRecords = detail::GroupNodeInputRecordsBase<RecordTy, true>;
Coalescing launch nodes also accept the EmptyNodeInput
input object for cases
without record data. The pseudo-HLSL interface for EmptyNodeInput
is:
class EmptyNodeInput {
/// @brief Returns the number of records that have been coalesced into the
/// current thread group.
///
/// Returns 1..._MaxCount_, where _MaxCount_ is specified by the
/// `[MaxRecords(_MaxCount_)]` attribute applied to the parameter declaration.
uint Count() const;
};
Broadcasting and Coalescing Launch shaders support a subset of compute shader system value inputs. These have the same types, meanings, and usages as they do for compute shaders.
system value semantic | supported launch modes | description |
---|---|---|
SV_GroupThreadID |
Broadcasting, Coalescing | Thread ID within group |
SV_GroupIndex |
Broadcasting, Coalescing | Flattened thread index within group |
SV_GroupID |
Broadcasting | Group ID within dispatch |
SV_DispatchThreadID |
Broadcasting | Thread ID within dispatch |
Node output objects either allocate records or increment a counter for empty
records. The following pseudo-HLSL defines the interfaces for the NodeOutput
and EmptyNodeOutput
objects:
template <typename RecordTy> class NodeOutput {
/// @brief Allocate a new ThreadNodeOutputRecords for this thread.
/// @returns A handle that collects a set of thread output records.
/// @param NumRecords The number of records to return for the calling thread.
///
/// ThreadNodeOutputRecords are per-thread output records. Each thread can
/// produce a different number of outputs which are each unique per thread.
///
/// Must be called in thread group uniform control flow. The value of
/// `NumRecords` is not required to be uniform. If the value of `NumRecords`
/// is `0`, the object returned is zero-sized and cannot be indexed on that
/// thread.
ThreadNodeOutputRecords<RecordTy> GetThreadNodeOutputRecords(uint NumRecords);
/// @brief Allocate GroupNodeOutputRecords for this thread group.
/// @returns A handle that collects a set of group node output records.
/// @param NumRecords The number of records to return.
///
/// GroupNodeOutputRecords are per-group output records. The output record set
/// is shared across the thread group and the threads work cooperatively to
/// produce the output records.
///
/// Must be called in thread group uniform control flow. The value of
/// `NumRecords` and `this` must be uniform across the thread group. If the
/// value of `NumRecords` is `0`, the object returned is zero-sized and cannot
/// be indexed.
///
/// This method may not be called from _thread launch_ shaders since they do
/// not have a thread group.
GroupNodeOutputRecords<RecordTy> GetGroupNodeOutputRecords(uint NumRecords);
/// @brief Returns true if the specified output node is in the work graph.
bool IsValid() const;
};
class EmptyNodeOutput {
/// @brief Adds `Count` empty output records to the node output, where this
/// `Count` is specified per-thread. The total number added is the sum of
/// `Count` values for each thread in the group.
///
/// Must be called in thread group uniform control flow.
void ThreadIncrementOutputCount(uint Count);
/// @brief Adds `Count` empty output records to the node output, once for the
/// group, instead of summing the value across threads.
///
/// Must be called in thread group uniform control flow. The value of
/// `Count` and `this` must be uniform across the thread group.
void GroupIncrementOutputCount(uint Count);
/// @brief Identifies if the output node is valid to write to.
/// @returns True if the specified output node is in the work graph, or for
/// recursive nodes if the maximum recursion limit has not been reached.
bool IsValid() const;
};
Array variations of the node output objects also exist exposing subscript
operators to index the individual output. The following pseudo-HLSL defines the
interfaces for the NodeOutputArray
and EmptyNodeOutputArray
objects:
namespace detail {
template <typename NodeOutputTy> class NodeOutputArrayBase {
/// @brief Returns the node output for the specified index.
/// @param Index The record index to access.
NodeOutputTy &operator[](uint Index);
};
} // namespace detail
template <typename RecordTy>
using NodeOutputArray = detail::NodeOutputArrayBase<NodeOutput<RecordTy>>;
using EmptyNodeOutputArray = detail::NodeOutputArrayBase<EmptyNodeOutput>;
Each node output can contain zero or more thread or group node output records which feed into other nodes for processing.
The following pseudo-HLSL defines the interfaces for the
ThreadNodeOutputRecords
and GroupNodeOutputRecords
objects:
namespace detail {
template <typename RecordTy> class NodeOutputRecordsBase {
/// @brief Get a copy of the underlying record.
RecordTy &Get(uint Index);
/// @brief Mark the output node as completed.
///
/// Each thread producing an output must call `OutputComplete` at least once.
/// Calling `OutputComplete()` signals to the runtime that the node output
/// memory is finalized. The behavior of writes to the output after this call
/// is undefined.
///
/// Calls to `OutputComplete` must be in thread group uniform control flow
/// otherwise the behavior is undefined.
void OutputComplete();
};
} // namespace detail
template <typename RecordTy>
using ThreadNodeOutputRecords = detail::NodeOutputRecordsBase<RecordTy>;
template <typename RecordTy>
using GroupNodeOutputRecords = detail::NodeOutputRecordsBase<RecordTy>;
Consistent with HLSL grammar attribute names are case-insensitive. Any attributes that take string arguments, the string argument values are case-sensitive.
[MaxRecords(<count>)]
Applies to node inputs in coalescing launch nodes or outputs for any launch mode.
Required for node inputs for coalescing launch nodes, this attribute restricts the maximum number of records per thread group. Implementations are not required to fill to the specified maximum.
When applied to node outputs, this attribute restricts the maximum number of
records produced to the output. When applied to a NodeOutputArray
, the maximum
applies as the sum of all records across the output array, not per-node.
Node outputs require either the MaxRecords
or MaxRecordsSharedWith
attribute.
[MaxRecordsSharedWith(<parameter>)]
This attribute applies to node outputs. The named parameter must have the
MaxRecords
attribute. This attribute and the MaxRecords
attribute are
mutually exclusive.
The node output that this attribute is applied to shares a maximum record allocation with the named node output parameter.
Node outputs require either the MaxRecords
or MaxRecordsSharedWith
attribute.
[NodeID("<name>", <index> = 0)]
This attribute applies to output nodes and defines the name and index of the output node. If not provided, the default index is 0.
If this attribute is not present on an output, the default node ID for the output is the name of the parameter, and the index is the default index (0).
[AllowSparseNodes]
This attribute applies to outputs and allows the work graph to be created even
if there is not a node defined for the specified output. If the output is an
array, each element may or may not have a downstream node defined in the graph.
IsValid()
can be used to determine whether an output node is defined in the
graph.
[NodeArraySize(<count>)]
Specifies the output array size for NodeOutputArray
or EmptyNodeOutputArray
objects.
/// @brief Returns the number of recursion levels remaining against the declared
/// `NodeMaxRecursionDepth`.
///
/// Returns 0 for leaf nodes and if the current node is not recursive.
uint GetRemainingRecursionLevels();
For nodes that recurse, the GetRemainingRecursionLevels()
function returns the
number of remaining recursion levels before reaching the node’s maximum
recursion depth.
enum MEMORY_TYPE_FLAG {
UAV_MEMORY = 0x00000001,
GROUP_SHARED_MEMORY = 0x00000002,
NODE_INPUT_MEMORY = 0x00000004,
NODE_OUTPUT_MEMORY = 0x00000008,
ALL_MEMORY = 0x0000000f,
};
enum SEMANTIC_FLAG {
GROUP_SYNC = 0x00000001,
GROUP_SCOPE = 0x00000002,
DEVICE_SCOPE = 0x00000004,
};
/// @brief Request a barrier for a set of memory types and/or thread group
/// execution sync.
/// @param MemoryTypeFlags Flag bits as defined by MEMORY_TYPE_FLAG. Specifying
/// ALL_MEMORY means all valid memory types given the context.
/// @param SemanticFlags Flag bits as defined by SEMANTIC_FLAG.
///
/// `Barrier` must be called from thread group uniform control flow when
/// `SemanticFlags` includes `GROUP_SYNC`.
void Barrier(uint MemoryTypeFlags, uint SemanticFlags);
/// @brief Request a barrier for just the memory used by an object.
/// @param TargetObject The object or resource which owns the memory to apply
/// the barrier to.
/// @param SemanticFlags Flag bits as defined by SEMANTIC_FLAG.
///
/// The TargetObject parameter can be a particular node input/output record
/// object or UAV resource. Groupshared variables are not currently supported.
///
/// `Barrier` must be called from thread group uniform control flow when
/// `SemanticFlags` includes `GROUP_SYNC`.
void Barrier(Object TargetObject, uint SemanticFlags);
The Work Graphs feature introduces a new more flexible implementation of the memory barrier functions. This function is available in all shader types (including non-node shaders).
The new Barrier
function implements a superset of the existing memory barrier
functions which are still supported (i.e. AllMemoryBarrier{WithGroupSync}()
,
GroupMemoryBarrier{WithGroupSync}()
, DeviceMemoryBarrier{WithGroupSync}()
).
In the context of a node shader, Barrier
enables requesting a memory barrier
on input and/or output record memory specifically, while the implementation is
free to store the data in any memory region.
When specifying ALL_MEMORY
for MemoryTypeFlags
, the compiler will limit
effective flags to the ones available given the context. Otherwise, explicitly
using a memory flag for a memory type that is unavailable given the context
will result in an error.
The pseudo-code below shows implementing the existing HLSL memory barrier
functions using the new Barrier
function.
void AllMemoryBarrier() { Barrier(ALL_MEMORY, DEVICE_SCOPE); }
void AllMemoryBarrierWithGroupSync() {
Barrier(ALL_MEMORY, DEVICE_SCOPE | GROUP_SYNC);
}
void DeviceMemoryBarrier() {
Barrier(UAV_MEMORY, DEVICE_SCOPE);
}
void DeviceMemoryBarrierWithGroupSync() {
Barrier(UAV_MEMORY, DEVICE_SCOPE | GROUP_SYNC);
}
void GroupMemoryBarrier() { Barrier(GROUP_SHARED_MEMORY, GROUP_SCOPE); }
void GroupMemoryBarrierWithGroupSync() {
Barrier(GROUP_SHARED_MEMORY, GROUP_SCOPE | GROUP_SYNC);
}
Note: The new barrier is only available on shader model 6.8 and above. Although there are some cases where there is an equivalent DXIL operation for prior shader models, there is not a complete mapping from one to the other, which would only make it partially available. This design allows for a consistent set of rules that can easily be validated early on in compilation.
Consistent with HLSL grammar attribute names are case-insensitive. Any attributes that take string arguments, the string argument values are case-sensitive.
[NodeTrackRWInputSharing]
If a RWDispatchNodeInputRecord<T>
is used for cross-group sharing and calls
FinishedCrossGroupSharing
, the struct type T
must have the
[NodeTrackRWInputSharing]
attribute applied to it. This allocates memory in
the record allocation to track thread completion.
uint/uint2/uint3/uint16_t/uint16_t2/uint16_t3 SV_DispatchGrid
SV_DispatchGrid
can optionally appear anywhere in a record.
If the record arrives at a broadcasting launch node that doesn’t declare a fixed dispatch grid size via [NodeDispatchGrid(x,y,z)]
, SV_DispatchGrid
becomes the dynamic grid size used to launch at the node. The value has no special significance in other contexts.
This spec is an extensive collaboration between the Microsoft HLSL and Direct3D teams and IHV partners.
Special thanks to Claire Andrews, Amar Patel, and Tex Riddell