Version: v0.11 - May 4, 2026
Today, work in DirectX is primarily conveyed through a low-level intermediate language (IL) known as shader bytecode. The instructions in shader bytecode are closer to an assembly level of abstraction (load, store, add, multiply, etc.), and compiled shaders are associated with an execution model that maps to a specific graphics pipeline stage (vertex, pixel, compute, etc.). This places shader ILs at a level just high enough for drivers to compile down to machine code, but too low to preserve details that are useful for domain-specific optimizations (e.g., fusing neural network layers in ML inference scenarios). This spec introduces extensions to the D3D12 API and DDI to enable compute work at a higher level than shader bytecode without resorting to metacommands.
This doc is organized into the following major sections:
The manner in which clients submit work to DirectX devices tracks the evolution of graphics hardware: fixed-function units in GPUs were progressively exposed as programmable stages, with the logic for each stage encapsulated in a shader program. Shaders have served as the fundamental unit of work on GPUs for nearly 25 years and build on the SIMT execution model, with waves of threads being organized and scheduled either implicitly (graphics shaders) or explicitly (compute shaders). The shader programming model has worked reasonably well for raster graphics, real-time ray tracing, and scoped general purpose compute interleaved with graphics tasks; however, the suitability of this model for the ML domain (specifically inference of deep neural networks, or DNNs) is questionable.
DNNs are computational graphs composed of semi-standardized math operations, or layers, like convolution and generalized matrix multiplication (GEMM). Developers using low-level graphics platforms to implement these graphs (for fine-grained hardware scheduling, memory allocation, interoperability with graphics pipeline, novel layers or network architectures, etc.) must take the route of writing shaders (kernels in ML parlance) for individual network layers; while intuitive, this path leads to a suboptimal platform experience as a result of:
1) Difficulty optimizing individual layers. Writing an efficient GEMM kernel is notoriously difficult, and GPU experts can spend months tuning algorithms. There are numerous popular and active libraries dedicated exclusively to optimizing GEMM for specific GPU architectures, and this challenge is amplified when attempting to write an abstraction that guarantees an efficient implementation across GPUs. Peak performance requires architecture-specific ML-oriented hardware that is difficult to incorporate into a generalized yet low-level shading language like HLSL without compromising on performance or flexibility. A kernel that can’t leverage ML-oriented hardware might result in 10-20% of peak throughput.
2) Difficulty optimizing across layers. Layer-by-layer execution, even with architecture-optimized kernels for each layer, will typically result in memory-bound performance for end-to-end network inference. Newer hardware is tailored to matrix multiplication that can only be fully utilized with a staggeringly large compute-to-memory-access ratio (on the order of hundreds of floating-point ops per byte of DRAM access), so aggressively fusing layers to avoid DRAM memory access is an essential optimization. This is deceptively hard in practice. Ignoring cross-layer optimizations might result in single-digit percentages of peak throughput.
3) Scope of programmability. The flexibility of programmable shader stages is an invaluable perk when using hardware for creative rendering techniques; however, it can be a burden when developers simply want highly optimized implementations of specific, well-defined math operations. Creativity in the ML domain typically manifests in training and the graph topology, not the implementation of individual graph nodes.
These challenges are reflected in the design of DirectML, which offered a graph-level API to solve these platform challenges. However, the foundation of DirectML was built on a mixture of hand-authored HLSL compute shaders and metacommands (opaque implementations of critical and common ML layers). Even with programmatic access to hardware MMA (historically a gap of shaders that is being addressed with Linear Algebra Matrix), maintaining a library of hand-authored compute shaders across several vendors’ GPU architectures is not scalable. Scaling optimal kernels across drivers and GPUs is what led to metacommands, but metacommands simply introduced a new set of scaling challenges: new layers, fusions, tiling strategies, quantization schemes, data types, tensor layouts, and more. Finally, the mixture of shaders and metacommands is problematic since they often introduce expensive conversions and cannot be fused.
There is, of course, tremendous value in evolving HLSL to expose ML-oriented hardware, and low-level programmability is essential for writing novel layers or mixing MMA within a traditional shader stage. However, the complexity in implementing a single deep neural network for inference is analogous to the complexity in implementing an entire rendering pipeline; shaders are simply not designed for this scale of work. The only existing alternative, metacommands, is really intended for fixed algorithms (see Appendix: Metacommand Limitations), yet the number of useful metacommands patterns in ML is effectively infinite. Metacommands are not designed to be adaptive and accept arbitrary combinations of layers. Taken to the limit, it almost seems necessary to convey an entire network to a driver as a single metacommand, but this approach tends to leak hardware details upward and inhibit higher-level graph transformations – potentially resulting in a unique graph for each device! There is no room for consolidating common optimizations, granting developers fine-grained control of scheduling and allocation, enabling unified tooling, or interoperability with shaders when a driver doesn’t implement something.
The problem statement is not necessarily unique to neural networks or ML inference, and it may apply to other domains in the future, but the remainder of this doc uses ML to frame a new pathway for conveying compute work to the D3D runtime and drivers. We envision two paths for conveying ML-focused compute work to a driver:
The existing shader program path for handwritten implementations of shallow, fully-connected networks (MLPs) inlined into existing shader stages. MLPs rely solely on matrix multiplication and trivial elementwise layers, they are easy to manually fuse, and they do not require sophisticated cross-layer optimizations – they are of a realistic scope to implement by hand. This path will expand through new low-level abstractions such as Linear Algebra Matrix.
A new MLIR program path for compiler-generated implementations of deeper networks composed from a richer set of layer types (U-Nets, CNNs, transformers, etc.) that require sophisticated cross-layer optimizations. Unlike DirectML, this path will not resemble a kernel library for layers that sits on top of D3D12’s shader and metacommand features; instead, compiled networks will be expressed at a coarser granularity using a new MLIR-based IR that is a first-class currency in the D3D12 API and DDI.
It’s important to note that MLIR is not specific to machine learning or even any concrete IR; the acronym stands for multi-level IR (not to be confused with machine learning IR), and it is simply infrastructure for expressing and translating domain-specific IRs. This doc is focused on the changes in D3D12 to support the DirectX Compute Graph Compiler (DXCGC), but the APIs introduced to D3D in this doc are generic to MLIR. Any work expressed with MLIR is fully compiled by a driver when converted into an executable MLIR program contained within a D3D12 state object. This process is intentionally similar to shader compile and execute patterns, though there are important differences that will be covered in the following sections.
Before delving into the D3D12 interfaces to support MLIR program compilation and execution, it is helpful to have a high-level understanding of the composition of a deep neural network, how it is represented in MLIR with DXCGC, and how it differs from shader-based work. This section is not a complete deep dive into the compiler and all nuances (versioning, lowering, details of declarative patterns, etc.) and only intended to orient end users on how it fits into the D3D12 APIs and DDIs.
We start with a high-level diagram illustrating the various components (white boxes) and MLIR-based intermediate representations (yellow boxes). DXCGC is a new compiler that accepts dataflow graphs as input (“CGC Input IR”) instead of HLSL. DXCGC produces an optimized MLIR bytecode output (“CGC Output IR”) as a result of high-level optimization and lowering, but it leaves the final hardware codegen as a step for the driver.
graph LR
input_ir["CGC Input IR"]:::mlir
lowering_irs["CGC Lowering IRs"]:::mlir
output_ir["CGC Output IR"]:::mlir
compiler["DXCGC"]:::component
partition_ir["CGC Partition IR"]:::mlir
interpreter["DXCGC Runtime"]:::component
runtime["D3D12 Runtime"]:::component
driver["IHV Driver"]:::component
input_ir --> compiler --> output_ir
compiler <--> lowering_irs <--> runtime <--> driver
output_ir --> interpreter --> partition_ir --> runtime
classDef mlir fill:#ffff55, color:black;
classDef component fill:#ffffff, color:black;
There are four types of MLIRs to focus on in this document:
Each IR is represented using a mixture of ops/types/attributes defined in a new set of CGC MLIR dialects (cgc.*, cgc_op.*, etc.). For example, the convolution layer maps to the convolution operation in the cgc_op dialect. The cgc_op dialect is a high-level dialect that contains abstract representations of ML layers, similar to ONNX, and the cgc dialect contains foundational elements like data types and memory views.
The next sections will focus on the first three IRs at a high level for illustrative purposes. A contrived 7-layer network topology that is much simpler than most real-world models is used for illustrative purposes, and the text representation of MLIR is slightly simplified for the sake of brevity.
A simple network is visualized below with network layers in blue and resource bindings in green/yellow. This is called a dataflow graph, and it represents the semantic intent of the overall computation rather than the explicit implementation.
graph LR
in[<b>argument</b><br/>'in']:::io;
conv1["<b>cgc_op.convolution</b><br/>'conv1'"]:::layer;
conv1.w["<b>cgc.constant</b><br/>'w1'"]:::const;
conv1.b["<b>cgc.constant</b><br/>'b1'"]:::const;
relu1["<b>cgc_op.relu</b><br/>'relu1'"]:::layer;
add1["<b>cgc_op.add</b><br/>'add1'"]:::layer;
conv2["<b>cgc_op.convolution</b><br/>'conv2'"]:::layer;
conv2.w["<b>cgc.constant</b><br/>'w2'"]:::const;
conv2.b["<b>cgc.constant</b><br/>'b2'"]:::const;
relu2["<b>cgc_op.relu</b><br/>'relu2'"]:::layer;
add2["<b>cgc_op.add</b><br/>'add2'"]:::layer;
upsample["<b>cgc_op.upsample</b><br/>'upsample'"]:::layer;
out[<b>Return Value</b>]:::io;
conv1.w --> conv1;
conv1.b --> conv1;
conv2.w --> conv2;
conv2.b --> conv2;
in --> conv1 --> relu1 --> add1 --> conv2 --> relu2 --> add2 --> upsample --> out;
in --> add1;
add1 --> add2;
classDef layer fill:#66bbff,color:black;
classDef const fill:#bbff66,color:black;
classDef io fill:#ffff66,color:black;
The example network in the human-readable textual form is shown below:
cgc.module
{
cgc.entry_point @main(%in: !cgc.tensor<1x1x4x4!cgc.float32>) -> cgc.tensor<1x1x8x8!cgc.float32>
{
// SSA variable references to externally bound named constants as opposed to
// positional arguments of the main function.
%w1 = cgc.constant("w1")
%b2 = cgc.constant("b2")
%w2 = cgc.constant("w2")
%b1 = cgc.constant("b1")
%conv1 = cgc_op.convolution(%in, %w1, %b1) { start_padding=[1,1], end_padding=[1,1], ... }
%relu1 = cgc_op.relu(%conv1)
%add1 = cgc_op.add(%relu1, %in)
%conv2 = cgc_op.convolution(%add1, %w2, %b2) { start_padding=[1,1], end_padding=[1,1], ... }
%relu2 = cgc_op.relu(%conv2)
%add2 = cgc_op.add(%relu2, %add1)
%upsample = cgc_op.upsample(%add2)
cgc.return %upsample
}
}
The basic text format of the MLIR operations presented above is <dialect>.<operation>(<arguments>) {<attributes>} : (<argument types>) -> <return types>. However, the printed MLIR presented in this doc is slightly simplified and omits details on source location, blocks, and regions associated with every MLIR operation. It also hides many types and attributes for brevity.
The job of DXCGC is to translate the abstract operations from CGC Input IR (convolution, matrix multiply, add, etc.) to a graph of concrete operations (codegen’able implementations) that can be executed on a set of targets. This process is called lowering, and the resulting MLIR is referred to as CGC Output IR. Examples of targets include:
The initial version of DXCGC relies on lowering to target-specific connected subgraphs, which conceptually represent fusions. The full lowering process involves additional graph transformations, and future versions of DXCGC will introduce concrete operations beyond subgraphs, but for the purpose of this doc we will focus exclusively on subgraphs.
To extend our earlier example, let’s say a hardware driver supports fusing conv -> relu -> add as a subgraph. Let’s also say the upsample layer isn’t explicitly implemented by the driver, so it ends up getting lowered into a fallback implementation (e.g., shader implementation). The lowered output would resemble the following:
graph LR
in[<b>argument</b><br/>'in']:::io;
conv1["<b>cgc_op.convolution</b><br/>'conv1'"]:::layer;
conv1.w["<b>cgc.constant</b><br/>'w1'"]:::const;
conv1.b["<b>cgc.constant</b><br/>'b1'"]:::const;
relu1["<b>cgc_op.relu</b><br/>'relu1'"]:::layer;
add1["<b>cgc_op.add</b><br/>'add1'"]:::layer;
conv2["<b>cgc_op.convolution</b><br/>'conv2'"]:::layer;
conv2.w["<b>cgc.constant</b><br/>'w2'"]:::const;
conv2.b["<b>cgc.constant</b><br/>'b2'"]:::const;
relu2["<b>cgc_op.relu</b><br/>'relu2'"]:::layer;
add2["<b>cgc_op.add</b><br/>'add2'"]:::layer;
upsample["<b>cgc_op.upsample</b><br/>'upsample'"]:::layer;
out[<b>Return Value</b>]:::io;
subgraph Driver-Target Subgraph
conv1
relu1
add1
end
subgraph Driver-Target Subgraph
conv2
relu2
add2
end
subgraph Fallback-Target Subgraph
upsample
end
conv1.w --> conv1;
conv1.b --> conv1;
conv2.w --> conv2;
conv2.b --> conv2;
in --> conv1 --> relu1 --> add1 --> conv2 --> relu2 --> add2 --> upsample --> out;
in --> add1;
add1 --> add2;
classDef layer fill:#66bbff,color:black;
classDef const fill:#bbff66,color:black;
classDef io fill:#ffff66,color:black;
DXCGC incorporates subgraphs using a declarative approach: targets tell DXCGC the types of subgraph patterns they support (independent of any specific graph), these patterns are applied by DXCGC transforming the Input IR, and each target is then responsible for providing implementations for matched subgraphs at runtime. Target subgraph patterns are not limited to fixed DAGs of operations shown in this contrived example. For example, a dynamic subgraph pattern could be “convolution followed by an arbitrary sequence of elementwise operations” (so-called epilogue fusion). Furthermore, subgraph patterns can declare constraints on data types, tensor shapes, memory layout, alignment, attribute values, and more. The details of how declarative subgraph patterns work are out of scope for this doc (falling under the “lowering IRs” mentioned earlier), and we will only look at the effects of subgraphs on resulting Output IR.
The Output IR in MLIR text form reveals both the driver and fallback subgraphs as well as how they are connected. Subgraphs appear as externally implemented functions that are associated with a concrete DAG of input IR ops and tensors (the “functional definition”). We also see an “origin” attribute on each subgraph that makes clear which target is responsible for implementing it. Subgraph invocations are organized into partitions, another function-like concept that serves as a bridge between abstract tensors and concrete memory references – we’ll cover this in more detail in the next section. Finally, partition invocations are organized into a program that serves as the main entrypoint for end-to-end execution.
cgc.module
{
// ---------------------------------------------------------------------
// Portion implemented by driver target
// ---------------------------------------------------------------------
cgc.subgraph @convReluAdd attributes {origin = "driverTarget", ...}
{
cgc.functional_definition private @convReluAddFunctional(
%in: !cgc.tensor<1x1x4x4x!cgc.float32>,
%w: !cgc.tensor<constant 1x1x3x3x!cgc.float32>,
%b: !cgc.tensor<constant 1x!cgc.float32>) -> !cgc.tensor<1x1x4x4x!cgc.float32>
{
%0 = cgc_op.convolution (%in, %w, %b) { start_padding=[1,1], end_padding=[1,1], ... }
%1 = cgc_op.relu (%0)
%2 = cgc_op.add (%1, %in)
cgc.return %2
}
}
cgc.partition @driverPartition(
%in: !cgc.memref<1x1x4x4x!cgc.float32>,
%w1: !cgc.memref<constant 1x1x3x3x!cgc.float32>,
%b1: !cgc.memref<constant 1x!cgc.float32>,
%w2: !cgc.memref<constant 1x1x3x3x!cgc.float32>,
%b2: !cgc.memref<constant 1x!cgc.float32>,
%output: !cgc.memref<1x1x4x4x!cgc.float32>,
%scratch: !cgc.memref<1x1x4x4x!cgc.float32>)
{
cgc.invoke @convReluAdd(%in, %w1, %b1, %scratch)
cgc.invoke @convReluAdd(%scratch, %w2, %b2, %output)
cgc.return
}
// ---------------------------------------------------------------------
// Portion implemented by fallback target
// ---------------------------------------------------------------------
cgc_subgraph.subgraph @upsample attributes {origin="fallbackTarget", ...}
{
cgc_subgraph.functional_definition private @upsampleFunctional(
%in: !cgc.tensor<1x1x4x4x!cgc.float32>) -> !cgc.tensor<1x1x8x8x!cgc.float32>
{
%0 = cgc_op.upsample2d (%in) { scale_size=[2,2], ... }
cgc.return %0
}
}
cgc.partition @fallbackPartition(
%in: !cgc.memref<1x1x4x4x!cgc.float32>,
%out: !cgc.memref<1x1x8x8x!cgc.float32>)
{
cgc.invoke @upsample(%in, %out)
cgc.return
}
// ---------------------------------------------------------------------
// Orchestration of end-to-end execution managed by runtime
// ---------------------------------------------------------------------
cgc.program @main(
%in: !cgc.memref<1x1x4x4x!cgc.float32>,
%out: !cgc.memref<1x1x8x8x!cgc.float32>,
%scratch: !cgc.memref<64x!cgc.int8>)
{
%w1 = cgc.constant("w1")
%b2 = cgc.constant("b2")
%w2 = cgc.constant("w2")
%b1 = cgc.constant("b1")
%0 = cgc.view_memref %scratch[0] : !cgc.memref<64x!cgc.int8> to !cgc.memref<1x1x4x4x!cgc.float32>
cgc.invoke @driverPartition(%in, %w1, %b1, %w2, %b2, %0)
cgc.invoke @fallbackPartition(%0, %out)
cgc.return
}
}
There are many details and concepts that are hidden here for simplicity, but the important takeaway is that Output IR is simply a DAG of partition calls reachable through a cgc.program, and each partition is a DAG of subgraph calls. The hierarchy is:
cgc.module (1) : the entire output IR
cgc.program (1) : the entrypoint for end-to-end execution at runtime
cgc.partition (1+) : collection of subgraphs to compile/execute as a unit
cgc.subgraph (1+) : logical fusion of abstract ops from input IRPartitions bridge the gap between abstract data views (tensors) and concrete data views (memrefs) as well as providing more control over execution scheduling. In this example the output IR comprises two partitions that must be independently executed with different resource bindings (solid arrows). This is visualized below, with the body of the cgc.program acting as a sequence of partition invocations. There is a dependency (dotted arrow) between the two partitions because of shared use of scratch memory.
graph
subgraph Program Arguments
in([%in]):::buffer;
scratch([%scratch]):::buffer;
out([%out]):::buffer;
end
subgraph Program Constants
w1([%w1]):::constant;
b1([%b1]):::constant;
w2([%w2]):::constant;
b2([%b2]):::constant;
end
subgraph driverPartition
cra1["cgc.subgraph (conv+relu+add)"]:::op;
cra2["cgc.subgraph (conv+relu+add)"]:::op;
end
subgraph fallbackPartition
upsample["cgc.subgraph (upsample)"]:::op;
end
in --> driverPartition;
w1 --> driverPartition;
b1 --> driverPartition;
w2 --> driverPartition;
b2 --> driverPartition;
scratch --> fallbackPartition;
scratch --> driverPartition;
out --> fallbackPartition;
driverPartition -.-> fallbackPartition;
classDef op fill:#66bbff,color:black;
classDef buffer fill:#ffff66,color:black;
classDef intermediate fill:#66ffff,color:black;
classDef constant fill:#bbff66,color:black;
classDef memview fill:#ffffff,color:black;
A few important things to note on the output IR:
%scratch) that is bound as an input to the network. More complicated graphs may have multiple scratch allocations, and it’s even possible that non-intermediate inputs are part of the scratch allocation if they aren’t needed after inference.One critical observation is that DXCGC compilation as shown here is dependent on subgraph declarations from drivers at runtime, which implies graph compilation is a JIT-only process (as was the case for shader bytecode to machine code for many years). However, the intention is to hoist subgraph patterns from drivers into offline compiler toolchains (see MLIR program precompilation). The preview version of DXCGC does not support AOT compilation, but we aim to support this in the near future.
Partitions in the output IR demarcate the scope of work that a driver (or other target) is required to compile and execute. Drivers do not process the full MLIR module but instead the individual partitions. In comparison to the shader programming model, the role of an MLIR program partition is similar to a compute shader pipeline state object (PSO); the D3D application is responsible for compiling, binding, and dispatching at the partition scope. The need for multiple partitions arises from:
Earlier, we saw an example where the compiler organized two subgraph dialect operations into a single partition and a shader dialect operation into a second partition. The portion of the output IR that the driver actually sees is restricted to the symbols reachable from the partition. In other words, the following “isolated” snippet of IR is what is ultimately passed to the driver when compiling a partition into an executable D3D12 API object:
cgc.module
{
cgc.subgraph @convReluAdd attributes {origin = "ihv.driver", ...}
{
cgc.functional_definition private @convReluAddFunctional(
%in: !cgc.tensor<1x1x4x4x!cgc.float32>,
%w: !cgc.tensor<constant 1x1x3x3x!cgc.float32>,
%b: !cgc.tensor<constant 1x!cgc.float32>) -> !cgc.tensor<1x1x4x4x!cgc.float32>
{
%0 = cgc_op.convolution (%in, %w, %b) { start_padding=[1,1], end_padding=[1,1], ... }
%1 = cgc_op.relu (%0)
%2 = cgc_op.add (%1, %in)
cgc.return %2
}
}
cgc.partition @driverPartition(
%in: !cgc.memref<1x1x4x4x!cgc.float32>,
%w1: !cgc.memref<constant 1x1x3x3x!cgc.float32>,
%b1: !cgc.memref<constant 1x!cgc.float32>,
%w2: !cgc.memref<constant 1x1x3x3x!cgc.float32>,
%b2: !cgc.memref<constant 1x!cgc.float32>,
%output: !cgc.memref<1x1x4x4x!cgc.float32>,
%scratch: !cgc.memref<1x1x4x4x!cgc.float32>)
{
cgc.invoke @convReluAdd(%in, %w1, %b1, %scratch)
cgc.invoke @convReluAdd(%scratch, %w2, %b2, %output)
cgc.return
}
}
Alternatively, DXCGC could have split output IR up into three partitions:
graph
subgraph Program Arguments
out([%out]):::buffer;
scratch([%scratch]):::buffer;
in([%in]):::buffer;
end
subgraph Program Constants
w1([%w1]):::constant;
b1([%b1]):::constant;
w2([%w2]):::constant;
b2([%b2]):::constant;
end
subgraph driverPartition1
cra1["cgc.subgraph (conv+relu+add)"]:::op;
end
subgraph driverPartition2
cra2["cgc.subgraph (conv+relu+add)"]:::op;
end
subgraph fallbackPartition
upsample["cgc.subgraph (upsample)"]:::op;
end
scratch --> fallbackPartition;
scratch --> driverPartition2;
scratch --> driverPartition1;
in --> driverPartition1;
w1 --> driverPartition1;
b1 --> driverPartition1;
w2 --> driverPartition2;
b2 --> driverPartition2;
out --> fallbackPartition;
driverPartition1 -.-> driverPartition2;
driverPartition2 -.-> fallbackPartition;
classDef op fill:#66bbff,color:black;
classDef buffer fill:#ffff66,color:black;
classDef intermediate fill:#66ffff,color:black;
classDef constant fill:#bbff66,color:black;
classDef memview fill:#ffffff,color:black;
Ultimately, both partitionings above are valid and have pros and cons. For example, the second finer-grained partitioning is useful if the subgraph ops are expensive to compile: the client may compile all three partitions in parallel on separate threads. In the first example, the driver is required to compile both subgraph ops on a single thread. In general, the granularity of partitioning represents a tunable knob that shifts responsibility from the D3D client (many small partitions) to the driver (few large partitions).
Planning memory in advance is crucial for maximizing performance and minimizing the memory footprint at runtime; as such, there are no dynamic memory allocation operations in the output IR. Concrete operations in the output IR consume memrefs, which are simply views into existing resources. At runtime the D3D client must associate resources with the arguments and constants consumed by each MLIR program partition. As the previous section explained, partitions represent independently executable units of work with their own bindings.
cgc.partition @driverPartition(
%in: !cgc.memref<1x1x4x4x!cgc.float32>,
%w1: !cgc.memref<constant 1x1x3x3x!cgc.float32>,
%b1: !cgc.memref<constant 1x!cgc.float32>,
%w2: !cgc.memref<constant 1x1x3x3x!cgc.float32>,
%b2: !cgc.memref<constant 1x!cgc.float32>,
%output: !cgc.memref<1x1x4x4x!cgc.float32>,
%scratch: !cgc.memref<1x1x4x4x!cgc.float32>)
{
cgc.invoke @convReluAdd(%in, %w1, %b1, %scratch)
cgc.invoke @convReluAdd(%scratch, %w2, %b2, %output)
cgc.return
}
In the optimized MLIR representations (output IR or partition IR), each partition has a flat list of positional arguments for all resource bindings (both inputs and outputs are arguments). These arguments are also called “bind points” in the API, and this example has 7 bind points:
The D3D client must supply 7 bindings when executing this partition. GPU virtual addresses are used to link the bind points with D3D resources. This topic is covered in more detail later on in the section Resource Binding.
As noted earlier, compiler targets such as a driver are only responsible for implementing individual partitions. Something (the D3D application) must orchestrate resource allocation, compilation of partitions, and scheduling of partitions. This responsibility may be very little (e.g., there is a single partition), but it might also more work if there are many partitions. These may seem like unnecessary steps (why not let the D3D runtime or driver handle it?), but it is intentional and desirable:
We are building an optional DXCGC runtime component that handles direct execution of output IR in a robust way, though it is intended as a starting point and not a requirement for end users. This will greatly simplify the execution flow while giving flexibility to modify or take control where needed.
sequenceDiagram
participant compiler as DXCGC
participant app as D3D Client / Application
participant int as DXCGC Runtime
participant d3d as D3D12
app->>+compiler: CGC Input IR
compiler-->>-app: CGC Output IR
app->>+int: CGC Output IR
int->>+d3d: Compile Flow Described Later
int->>d3d: Execution Flow Described Later
int-->>-app: Command List
app->>d3d: Submit Command List
For convenience this runtime component is built as a library with an API, but we do not envision this component as a versioned and opaque DLL component that ships in Windows; instead, this is something we intend to ship as an open-source sample that developers can modify and integrate with their engines as needed. The runtime is a helper layer that is not mandatory for correctly executing CGC Output IR, but it can simplify dependencies for applications that do not want to parse MLIR and handle fallback manually.
The previous section covered DXCGC and its use of MLIR to represent an unoptimized network (“input IR”) as well as an optimized representation (“output IR”). While optimized, the output of the DXCGC is still just another intermediate representation that needs to be translated into device-specific machine code before it can be executed. This section explains how the optimized MLIR bytecode is fully compiled and executed using MLIR programs in D3D12. There are similarities between MLIR programs and shader programs, and it may be helpful to compare the two along certain dimensions:
| Role | Shader Program | MLIR Program (for DXCGC) |
|---|---|---|
| Compiler Input | HLSL | CGC Input IR |
| Compiler Output | DXIL | CGC Output/Partition IR |
| Interchange Format | DX Container | MLIR Bytecode |
| IR Versioning | Shader Model | Semantic Version |
| IR Identity | Shader Profile | MLIR Interface |
| Compile API | Generic Program Subobject | MLIR Program Subobject |
| Dispatch API | Commandlist Dispatch | Commandlist DispatchGraph |
We covered the compiler input/output representations for DXCGC IRs in the previous section. This section will focus on the interchange format, versioning, and compile/execute APIs.
DXCGC IRs are transmitted across component boundaries using MLIR bytecode, which allows two components with different versions of MLIR to serialize and deserialize correctly. In order to guarantee this, however, the MLIR itself must comprise dialects that are immutable; it is for this reason that CGC IRs use CGC dialects only, at least at the public interface.
MLIR has a built-in dialect that expresses some of the same concepts in CGC dialects (func.func, module, tensor, and others), and there are also upstream dialects that capture high-level operations similar to cgc_op. These dialects are not directly usable in DXCGC IRs since we cannot guarantee they will not be broken; however, within DXCGC itself (or drivers/targets that have their own version of MLIR/LLVM) it is feasible and encouraged to convert to upstream dialects where it makes sense.
Parsing MLIR bytecode directly without depending on MLIR itself is possible, but it is limited to obtaining top-level information like the names of dialects and operations used. Interpreting the semantics of the encoded dialects requires access to definitions of the dialects, and this ultimately means taking a build-time dependency on MLIR (and thus consuming the whole LLVM repo). As an optional component, we will provide a standalone helper library for lightweight traversal of DXCGC IRs with a simple interface for the convenience of D3D12 applications and PIX. This library will be fully open source so that projects can extend or modify it if needed; there is no need to use an “official” version of this library, since any project that already consumes MLIR might prefer to simply add the DXCGC dialects (also open source) into its codebase. We anticipate that drivers might not use this library at all – some drivers already consume MLIR – but the DXCGC-oriented interfaces may still be useful for reference or ease of use.
DXCGC IRs are composed of CGC dialects that get versioned together using semantic versioning; if any CGC dialect is extended, then all CGC dialects collectively get bumped together into a new updated version. To keep things backward compatible, CGC dialects are only ever extended: existing ops/types/attributes are never changed or deleted.
MLIR bytecode by itself is insufficient to identify what the encoded data represents or how it should be used. For example, how does a component know that a serialized MLIR bytecode captures CGC Input IR or CGC Output IR? The dialects and ops within the MLIR might give some clues, but deducing the semantics from the presence of specific dialects or operations is brittle. For this reason, we use the notion of “MLIR interfaces” to define the role of different types of MLIR flowing through the DirectX stack. Some examples of MLIR interfaces for DXCGC are listed in DXCGC MLIR Interfaces.
Interfaces are associated with a GUID to help the compiler, runtime, drivers, and tools easily and quickly recognize, interpret, and validate any MLIR they might encounter. The GUID may be embedded directly in the MLIR bytecode (e.g., as an attribute of the top-level module), but it may also be stored externally from the MLIR bytecode. The D3D runtime will only permit MLIR bytecode to flow between the D3D client and driver when it recognizes the MLIR interface GUID, though this restriction may be relaxed when executing with developer mode.
MLIR interfaces are immutable once introduced. In other words, the full set of MLIR dialects (and their respective operations, types, etc.) must be static when an interface is introduced. This is similar to how shader models define a set of requirements, and it avoids a moving target for drivers and tools to support. Extensions to any existing MLIR interface require a new interface name and GUID.
A simplified view of MLIR program compilation is presented in this section. It is useful to contrast MLIR compilation with shader compilation, so we start with how compute shaders are compiled (using state objects introduced in DXR and extended to support programs with work graphs; this is largely identical to the classic compute pipeline state object flow but uses a more generic API).
sequenceDiagram
participant compiler as Shader Compiler
participant app as Application
participant d3d as D3D12
participant driver as Driver
app->>+compiler: Shader Source (HLSL)
compiler-->>-app: Shader Bytecode (DXIL/SPIR-V)
app->>+d3d: Shader Bytecode & Root Signature
d3d->>+driver: Shader Bytecode & Root Signature
driver-->>-d3d: ISA
d3d-->>-app: Generic Program (alternatively: PSO)
MLIR compilation with DXCGC is illustrated below, and it is largely identical to how shaders are compiled into state objects. The key difference with MLIR program compilation is the need to loop over partitions in Output IR and compile each separately.
sequenceDiagram
participant compiler as DXCGC
participant app as Application
participant d3d as D3D12
participant driver as Driver
app->>+compiler: CGC Input IR
compiler-->>-app: CGC Output IR
loop each Partition in CGC Output IR
app->>app: Isolate Partition IR
app->>+d3d: Partition IR
d3d->>+driver: Partition IR
driver-->>-d3d: ISA
d3d-->>-app: MLIR Program
end
If there are multiple targets used in the compilation of CGC Input IR, then non-driver partitions need to be compiled using the appropriate APIs (not shown above). For example, a fallback target that provides shader or metacommand implementations of MLIR subgraphs will have partition IR translated to existing D3D12 APIs.
Shaders and MLIR programs can coexist in the same D3D command list, though they will use separate interfaces for command recording (Dispatch for compute shaders, DispatchGraph for MLIR programs). Eventually the two paths even interoperate more closely (e.g., shaders calling MLIR programs), but this is out of scope for now.
sequenceDiagram
participant app as Application
participant cmdlist as ID3D12GraphicsCommandList
app->>cmdlist: SetDescriptorHeaps
app->>app: Create descriptors
alt is generic program / PSO
app->>cmdlist: SetComputeRootSignature
app->>cmdlist: SetComputeRootConstants
app->>cmdlist: SetProgram (alternatively: SetPipelineState)
app->>cmdlist: Dispatch
else is MLIR program
app->>cmdlist: SetProgram
app->>cmdlist: DispatchGraph
end
Once a command list is recorded, it can be submitted to a command queue like any other D3D work. The resource binding structure for shaders (generic programs or PSOs) is a root signature. The resource binding structure for MLIR programs is the list of partition bind points in the MLIR, and resource bindings are supplied directly to the DispatchGraph call.
The two primary mechanisms for referencing device memory in D3D12 are GPU virtual addresses (GPUVAs) and descriptors, with the former being a raw pointer and the latter capturing richer semantics about how the memory is utilized or viewed.
Descriptors are essential for shader-based workloads, but they are slightly awkward for compute workloads that don’t use compute shaders: the most appropriate view type for these workloads is a UAV, but some of the fields aren’t entirely applicable (DXGI formats might not match with tensor formats, element-sized offset with FirstElement is clunky, raw UAVs aren’t relevant, etc.). UAVs are also subject to some alignment constraints that make them problematic for ML workloads (e.g., eliding joins/gathers by offsetting writes to addresses that may not align with 4 current byte requirement). In light of the drawbacks of using UAVs, MLIR programs bind through GPUVAs.
Resource bindings are always buffers with unordered access:
Enhanced Barrier Access: all resource bindings must be accessible with D3D12_BARRIER_ACCESS_COMMON or D3D12_BARRIER_ACCESS_UNORDERED_ACCESS.
Legacy Barrier State: all resource bindings must be accessible with D3D12_RESOURCE_STATE_COMMON or D3D12_RESOURCE_STATE_UNORDERED_ACCESS.
NOTE: binding directly through GPUVAs also removes the opportunity for recording a command list once and swapping out the bindings with volatile descriptors. In order to enable this “command list reuse” optimization we would need to add an indirection, like binding a GPUVA to a buffer that itself contains the address for the actual data for the bind point. This may be added in the future with a flag like D3D12_MLIR_PROGRAM_BINDING_FLAG_INDIRECT or similar. Additionally, descriptor support may be added for texture access in the future.
MLIR program dispatches (i.e., ID3D12GraphicsCommandList::DispatchGraph) in a command list are by default asynchronous with respect to other work recorded into that same command list: the driver is free to execute an MLIR program immediately and in parallel with other work so long as it respects the semantics of all previously encountered resource barriers and implicitly synchronizing operations (e.g., ClearUnorderedAccessView). In other words, the behavior for synchronizing MLIR program dispatches is identical to the behavior for compute shader dispatches (i.e., ID3D12GraphicsCommandList::Dispatch) with unordered access; the synchronization scope for a DispatchGraph using an MLIR program is D3D12_BARRIER_SYNC_COMPUTE_SHADING.
Inter-dispatch synchronization: Resource access in an MLIR program is dictated by the GPUVAs associated with the MLIR partition’s bind points during a DispatchGraph call. The D3D client is responsible for explicitly recording buffer-specific barriers or global barriers between an MLIR program dispatch and any dependent operations that affect its input resources.
Intra-dispatch synchronization: A compiled MLIR program may logically encapsulate multiple “suboperations” (e.g., multiple CGC subgraph ops); however, from the perspective of the D3D client, this partition is treated as a single atomic unit of work. The driver is responsible for synchronization of suboperations within an MLIR program.
The DX shader compiler performs a bytecode validation step when compiling shader source to bytecode (i.e., HLSL to DXIL) to ensure the emitted instructions conform to the rules of a shader model. Shader bytecode validation is performed in the DXIL Validator, which is a library that is independent of the shader compiler itself. Upon successful validation the validator will compute and store a hash in the 16-byte DX container digest field, and this hash serves as a proxy to indicate to the D3D runtime that the bytecode has been previously validated: the validator and runtime use the same hashing algorithm, so the two should derive the same hash value barring any tampering or skipped validation (the digest would be empty).
For MLIR programs, the strategy is a relaxed since MLIR has built-in validation to guarantee the compiler does not emit invalid IR. Additionally, the Partition IR for subgraph workloads is largely dependent on externally defined implementations from the IHV driver target. The driver will only need to validate that it receives subgraphs compatible with what it previously declared, which cannot be rigorously enforced or checked by DXCGC or D3D12.
When enabled, the D3D debug layers can perform some additional runtime validation to catch developer errors. For MLIR programs this validation includes:
DispatchGraph arguments.
⚠️ This section covers a feature that is planned but not fully implemented.
Advanced Shader Delivery (or ASD) transitions shader bytecode-to-ISA compilation from a runtime step to an offline step. Normally this process requires both the D3D runtime and a hardware driver, but the design calls for the IHV shader compiler to be hoisted out of the driver into a plugin that can be invoked in an offline compile toolchain. At a conceptual level, this offline compile process looks something like the diagram below (not entirely accurate – read the full D3D specs for details):
erDiagram
client["D3D Client"]
toolchain["Offline Compile Toolchain"]
sodb["State Object Database (SODB)"]
psdb["Precompiled Shader Database (PSDB)"]
compiler["IHV Compiler Plugin"]
sodb_entry["SODB Entry"] {
D3D12_STATE_OBJECT_DESC desc
byte* bytecode
}
psdb_entry["PSDB Entry"] {
byte* isa
}
toolchain 1 to 1+ compiler : invokes
client 1 to 1+ sodb : creates
client 1 to 1+ sodb_entry : creates
toolchain 1 to 1+ sodb : compiles
sodb 1 to 1+ sodb_entry : contains
toolchain 1 to 1+ psdb : creates
psdb 1 to 1+ psdb_entry : contains
In a nutshell:
CreateStateObject or CreatePipelineState are bundled up in this SODB.Offline compilation of MLIR programs would be tremendously advantageous since the compile time for a DNN may be considerable. This offline toolchain fits well not only for shader precompilation but also MLIR program precompilation: MLIR program SO descs and bytecode map 1:1 with shader SO/PSO descs and bytecode. DXCGC could trivially build an SODB covering all MLIR partitions in one or more input networks and produce a PSDB with perfect coverage for those networks (unlike game engines, there is no guesswork or large SO/PSO space that needs to be sampled to produce the SODB). However, there are a few issues:
DXCGC needs declarative subgraphs to lower CGC Input IR to Output IR. In other words, we would need to extend the IHV compiler plugin interface and toolchain to reflect the MLIR capabilities covered by the D3D12_FEATURE_MLIR_EXCHANGE API later.
DXCGC needs constant preprocessing to occur offline to fully eliminate some runtime initialization overhead. Constant preprocessing isn’t covered in this doc, but it involves transforming model parameters into hardware-optimal layouts. It might not be appropriate to store preprocessed weights in a PSDB, but it could be a valuable step in offline compilation of MLIR programs.
These issues will be explored in an upcoming revision of this doc.
Per D3D12 device interface semantics, these device methods can be called by multiple threads simultaneously.
✅ This is an existing D3D12 method that is unchanged.
HRESULT CheckFeatureSupport(
D3D12_FEATURE Feature,
_Inout_updates_bytes_(FeatureSupportDataSize) void* pFeatureSupportData,
UINT FeatureSupportDataSize
);
This is the generic D3D API for querying feature support and isn’t specific to MLIR programs.
D3D12_MLIR_PROGRAMS_TIER MlirProgramsTier.⚠️ This is an existing D3D12 enum that is extended.
enum D3D12_FEATURE
{
...
D3D12_FEATURE_BYTECODE_BYPASS_HASH_SUPPORTED = 57,
...
+ D3D12_FEATURE_D3D12_OPTIONS_MLIR = 68,
+ D3D12_FEATURE_MLIR_EXCHANGE = 69,
+ D3D12_FEATURE_MLIR_INTERFACE_SUPPORT = 70,
};
Three new values are added:
D3D12_FEATURE_D3D12_OPTIONS_MLIR: for querying general MLIR program support. See D3D12_FEATURE_DATA_D3D12_OPTIONS_MLIRD3D12_FEATURE_MLIR_EXCHANGE: for queries encoded as MLIR. See D3D12_FEATURE_DATA_MLIR_EXCHANGED3D12_FEATURE_MLIR_INTERFACE_SUPPORT: for querying specific MLIR interface support. See D3D12_FEATURE_DATA_MLIR_INTERFACE_SUPPORT🆕 This is a new D3D12 struct.
struct D3D12_FEATURE_DATA_D3D12_OPTIONS_MLIR
{
_Out_ D3D12_MLIR_PROGRAMS_TIER MlirProgramsTier;
};
This struct reports support for MLIR programs (not specific to any MLIR interface).
| Member | Definition |
|---|---|
D3D12_MLIR_PROGRAMS_TIER |
Indicates support for MLIR programs in general (not specific to any MLIR interface). See D3D12_MLIR_PROGRAMS_TIER. |
🆕 This is a new D3D12 enum.
enum D3D12_MLIR_PROGRAMS_TIER
{
D3D12_MLIR_PROGRAMS_TIER_NOT_SUPPORTED = 0,
D3D12_MLIR_PROGRAMS_TIER_1_0 = 10,
D3D12_MLIR_PROGRAMS_TIER_2_0 = 20,
};
Expresses support for MLIR programs. Support is typically additive, with higher tiers being more capable than earlier tiers.
| Member | Definition |
|---|---|
D3D12_MLIR_PROGRAMS_TIER_NOT_SUPPORTED |
No support for MLIR programs on the device. Attempts to create any state objects containing MLIR programs will fail and using related APIs on command lists results in undefined behavior. |
D3D12_MLIR_PROGRAMS_TIER_1_0 |
The device supports MLIR programs with D3D12_MLIR_PROGRAM_BINDING_TYPE_GPU bindings. |
D3D12_MLIR_PROGRAMS_TIER_2_0 |
The device supports MLIR programs with D3D12_MLIR_PROGRAM_BINDING_TYPE_GPU and D3D12_MLIR_PROGRAM_BINDING_TYPE_CPU bindings. |
🆕 This is a new D3D12 struct.
struct D3D12_FEATURE_DATA_MLIR_EXCHANGE
{
GUID MlirInterface;
_Field_size_bytes_full_opt_(InputDataSizeInBytes) const void* pInputData;
_In_ SIZE_T InputDataSizeInBytes;
_Field_size_bytes_full_opt_(OutputDataSizeInBytes) void* pOutputData;
_In_ SIZE_T* OutputDataSizeInBytes;
};
This struct is for exchanging MLIR data (both input and output data in the DX Container format) that conveys ML compiler-specific information including but not limited to:
It is not expected that a D3D12 app would directly invoke this API, and the MLIR interface GUIDs are externally defined to decouple churn in requirements from the D3D API.
| Member | Definition |
|---|---|
MlirInterface |
Unique identifier that captures the semantics and structure of the exchange. See DXCGC MLIR Interfaces for examples. If the runtime does not recognize the GUID, it will fail the API call (before the respective DDI is called) and return E_INVALIDARG; however, this block is removed when executing with Windows developer mode active. The driver should return DXGI_ERROR_UNSUPPORTED if it does not support the MLIR interface. |
pInputData |
Pointer to input data (in DX Container format) supplied by client and read by driver. May be null for some queries. Neither the runtime nor drivers may access this pointer once CheckFeatureSupport returns (the D3D client may free it immediately on return). |
InputDataSizeInBytes |
Size in bytes of pInputData. |
pOutputData |
Pointer to output data (in DX Container format) written by driver and read by client. May be null for some queries. Neither the runtime nor drivers may access this pointer once CheckFeatureSupport returns (the D3D client may free it immediately on return). |
OutputDataSizeInBytes |
Pointer to variable to store the size in bytes of pOutputData. This is written by the driver. |
🆕 This is a new D3D12 struct.
struct D3D12_FEATURE_DATA_MLIR_INTERFACE_SUPPORT
{
_In_ UINT NumMlirInterfaces;
_In_reads_(*NumMlirInterfaces) const GUID *pMlirInterfacesRequested;
_Out_writes_(*NumMlirInterfaces) BOOL *pMlirInterfacesSupported;
};
This struct reports support for specific MLIR interfaces.
| Member | Definition |
|---|---|
NumMlirInterfaces |
The number of MLIR interfaces the client is interested in using. See IR Identity. |
pMlirInterfacesRequested |
An array of GUIDs associated with the MLIR interfaces the client is interested in using. |
pMlirInterfacesSupported |
An array of booleans indicating if the MLIR interface is supported. |
✅ This is an existing D3D12 method that is unchanged.
HRESULT CreateStateObject(
_In_ const D3D12_STATE_OBJECT_DESC* pDesc,
_In_ REFIID riid, // ID3D12StateObject
_COM_Outptr_ void** ppStateObject
);
See the DXR spec for an overview of state objects. MLIR program partitions are encapsulated in an executable state object.
| Parameter | Definition |
|---|---|
pDesc |
Description of the state object to create. |
riid |
__uuidof(ID3D12StateObject) |
ppStateObject |
Returned state object. |
Return: HRESULT |
S_OK for success. E_INVALIDARG, E_OUTOFMEMORY on failure. The debug layer provides detailed status information. |
✅ This is an existing D3D12 struct that is unchanged.
struct D3D12_STATE_OBJECT_DESC
{
D3D12_STATE_OBJECT_TYPE Type;
UINT NumSubobjects;
_In_reads_(NumSubobjects) const D3D12_STATE_SUBOBJECT* pSubobjects;
};
When compiling MLIR program workloads, type Type should be set to D3D12_STATE_OBJECT_TYPE_EXECUTABLE with at least one subobject of type D3D12_STATE_SUBOBJECT_TYPE_MLIR_PROGRAM.
| Member | Definition |
|---|---|
Type |
See D3D12_STATE_OBJECT_TYPE. |
NumSubobjects |
Pointer to state object description of the specified type. |
pSubobjects |
Pointer to an array of subobjects. |
✅ This is an existing D3D12 struct that is unchanged.
enum D3D12_STATE_OBJECT_TYPE
{
D3D12_STATE_OBJECT_TYPE_COLLECTION = 0,
D3D12_STATE_OBJECT_TYPE_RAYTRACING_PIPELINE = 3,
D3D12_STATE_OBJECT_TYPE_EXECUTABLE = 4, // Relevant to MLIR programs
};
When creating MLIR program workloads, the state object type must be D3D12_STATE_OBJECT_TYPE_EXECUTABLE.
| Value | Definition |
|---|---|
| D3D12_STATE_OBJECT_TYPE_COLLECTION | Not applicable to MLIR programs. MLIR programs cannot be added to collection state objects. |
| D3D12_STATE_OBJECT_TYPE_RAYTRACING_PIPELINE | Not applicable to MLIR programs. MLIR programs cannot be added to raytracing state objects. |
| D3D12_STATE_OBJECT_TYPE_EXECUTABLE | State object that holds one or more programs, which may include zero or more MLIR programs, zero or more work graphs, and zero or more shader programs. |
✅ This is an existing D3D12 struct that is unchanged.
struct D3D12_STATE_SUBOBJECT
{
D3D12_STATE_SUBOBJECT_TYPE Type;
const void* pDesc;
};
Subobject within a state object. To encapsulate an MLIR program workload, the subobject type should be D3D12_STATE_SUBOBJECT_TYPE_MLIR_PROGRAM and point to a description of type D3D12_MLIR_PROGRAM_DESC.
| Member | Definition |
|---|---|
Type |
See D3D12_STATE_SUBOBJECT_TYPE. |
pDesc |
Pointer to state object description of the specified type. |
⚠️ This is an existing D3D12 enum that is extended.
enum D3D12_STATE_SUBOBJECT_TYPE
{
...
D3D12_STATE_SUBOBJECT_TYPE_DEPTH_STENCIL2 = 30,
- D3D12_STATE_SUBOBJECT_TYPE_MAX_VALID = ( D3D12_STATE_SUBOBJECT_TYPE_DEPTH_STENCIL2 + 1 )
+ D3D12_STATE_SUBOBJECT_TYPE_MLIR_PROGRAM = 31,
+ D3D12_STATE_SUBOBJECT_TYPE_MAX_VALID = ( D3D12_STATE_SUBOBJECT_TYPE_MLIR_PROGRAM + 1 )
};
One new value is added:
D3D12_STATE_SUBOBJECT_TYPE_MLIR_PROGRAM: specifies a subobject that encapsulates an MLIR program partition.🆕 This is a new D3D12 struct.
struct D3D12_MLIR_PROGRAM_DESC
{
LPCWSTR ProgramName;
const void* pBytecode;
SIZE_T BytecodeSizeInBytes;
};
| Member | Definition |
|---|---|
ProgramName |
Name to assign to the MLIR program in the state object. This can be used in APIs that need to reference MLIR program definitions, like ID3D12StateObjectProperties1::GetProgramIdentifier(). There can be multiple MLIR programs in a single state object. |
pBytecode |
A pointer to the MLIR program partition as MLIR bytecode wrapped in a DX Container. Neither the runtime nor drivers may access this pointer once CreateStateObject returns (the D3D client may free it immediately on return). |
BytecodeSizeInBytes |
Size of pBytecode in bytes. |
✅ This is an existing D3D12 method that is unchanged.
HRESULT AddToStateObject(
_In_ const D3D12_STATE_OBJECT_DESC* pAddition,
_In_ ID3D12StateObject* pStateObjectToGrowFrom,
_In_ REFIID riid, // ID3D12StateObject
_COM_Outptr_ void** ppNewStateObject
);
AddToStateObject() allows incrementally adding to an existing state object. For example, consider a state object $S_0={A,B}$, where $A$ and $B$ are subobjects, and a second set of subobject descs ${B,C}$. This method produces a new state object $S_1={A,B,C}$ that is the union of the subobjects. This interface was introduced in DXR for the purpose of reducing CPU overhead when creating state objects that are supersets or share common state.
MLIR programs are fully self-contained and do not affect or depend on other subobjects in an executable state object. There is little utility in adding new MLIR programs to an existing state object (or even creating a state object with multiple MLIR programs) beyond API consistency and perhaps application convenience. Since MLIR programs are a type of subobject that is only valid when used within an executable state object, the behavior for extending an executable state object applies to MLIR programs and is unchanged aside from allowing MLIR programs as a valid addition. In summary:
AddToStateObject() by specifying the flag D3D12_STATE_OBJECT_FLAG_ALLOW_STATE_OBJECT_ADDITIONS as part of the flags member of a D3D12_STATE_OBJECT_CONFIG subobject.pStateObjectToGrowFrom.For all command list methods, at command list recording the runtime makes a deep copy of the parameters (not including data in GPU memory pointed to via GPU virtual addresses). The application’s CPU memory for the parameters is no longer referenced when the call returns. When the commands actually execute on the GPU timeline any GPU memory identified by GPU virtual addresses gets accessed, giving freedom for the application to change that memory independent of command list recording time.
✅ This is an existing D3D12 struct that is unchanged.
void SetProgram(_In_ const D3D12_SET_PROGRAM_DESC* pDesc);
Method for setting a program on a command list, which may be an MLIR program, work graph, compute shader, etc.
| Parameter | Definition |
|---|---|
pDesc |
See D3D12_SET_PROGRAM_DESC. |
⚠️ This is an existing D3D12 struct that is extended.
struct D3D12_SET_PROGRAM_DESC
{
D3D12_PROGRAM_TYPE Type;
union
{
D3D12_SET_GENERIC_PIPELINE_DESC GenericPipeline;
D3D12_SET_RAYTRACING_PIPELINE_DESC RaytracingPipeline;
D3D12_SET_WORK_GRAPH_DESC WorkGraph;
+ D3D12_SET_MLIR_PROGRAM_DESC MlirProgram;
};
};
| Member | Definition |
|---|---|
Type |
Selected the program type to set, and thus which entry in the union to use. See D3D12_PROGRAM_TYPE. |
GenericPipeline |
Not relevant to this spec. |
RaytracingPipeline |
Not relevant to this spec. |
WorkGraph |
Not relevant to this spec. |
MlirProgram |
See D3D12_SET_MLIR_PROGRAM_DESC. |
⚠️ This is an existing D3D12 enum that is extended.
enum D3D12_PROGRAM_TYPE
{
D3D12_PROGRAM_TYPE_GENERIC_PIPELINE = 1,
D3D12_PROGRAM_TYPE_RAYTRACING_PIPELINE = 4,
D3D12_PROGRAM_TYPE_WORK_GRAPH = 5,
+ D3D12_PROGRAM_TYPE_MLIR_PROGRAM = 6
} D3D12_PROGRAM_TYPE;
One new value is added:
D3D12_PROGRAM_TYPE_MLIR_PROGRAM: specifies that the program type is an MLIR program.🆕 This is a new D3D12 struct.
struct D3D12_SET_MLIR_PROGRAM_DESC
{
D3D12_PROGRAM_IDENTIFIER ProgramIdentifier;
D3D12_SET_MLIR_PROGRAM_FLAGS Flags;
};
Describes the program to set as active on the command list.
| Member | Definition |
|---|---|
D3D12_PROGRAM_IDENTIFIER ProgramIdentifier |
ID of the MLIR program to set. This value may be retrieved by looking up the ID using the name of an MLIR program in a state object with ID3D12StateObjectProperties1::GetProgramIdentifer. |
D3D12_SET_MLIR_PROGRAM_FLAGS Flags |
See D3D12_SET_MLIR_PROGRAM_FLAGS. |
🆕 This is a new D3D12 enum.
enum D3D12_SET_MLIR_PROGRAM_FLAGS
{
D3D12_SET_MLIR_PROGRAM_FLAG_NONE = 0,
};
This enum is reserved for future use.
✅ This is an existing D3D12 struct that is unchanged.
void DispatchGraph(const D3D12_DISPATCH_GRAPH_DESC* pDesc);
The DispatchGraph method may be used to dispatch both work graphs and MLIR programs depending on the active program set with SetProgram. Resource binding is supplied through the pDesc parameter. The synchronization scope associated with DispatchGraph and an MLIR program is D3D12_BARRIER_SYNC_COMPUTE_SHADING.
| Parameter | Definition |
|---|---|
pDesc |
See D3D12_DISPATCH_GRAPH_DESC. |
⚠️ This is an existing D3D12 struct that is extended.
struct D3D12_DISPATCH_GRAPH_DESC
{
D3D12_DISPATCH_GRAPH_MODE Mode;
union
{
D3D12_NODE_CPU_INPUT NodeCPUInput;
D3D12_GPU_VIRTUAL_ADDRESS NodeGPUInput;
D3D12_MULTI_NODE_CPU_INPUT MultiNodeCPUInput;
D3D12_GPU_VIRTUAL_ADDRESS MultiNodeGPUInput;
+ D3D12_MLIR_PROGRAM_BINDINGS MlirProgramBindings;
};
};
This struct supplies resource binding information for a single DispatchGraph call. For MLIR program execution, the mode must be set to D3D12_DISPATCH_MODE_MLIR_PROGRAM and the MlirProgramBindings union member used for supplying bindings.
| Member | Definition |
|---|---|
NodeCPUInput |
Not relevant to this spec. |
NodeGPUInput |
Not relevant to this spec. |
MultiNodeCPUInput |
Not relevant to this spec. |
MultiNodeGPUInput |
Not relevant to this spec. |
MlirProgramBindings |
See D3D12_MLIR_PROGRAM_BINDINGS. |
⚠️ This is an existing D3D12 enum that is extended.
enum D3D12_DISPATCH_MODE
{
D3D12_DISPATCH_MODE_NODE_CPU_INPUT = 0,
D3D12_DISPATCH_MODE_NODE_GPU_INPUT = 1,
D3D12_DISPATCH_MODE_MULTI_NODE_CPU_INPUT = 2,
D3D12_DISPATCH_MODE_MULTI_NODE_GPU_INPUT = 3,
+ D3D12_DISPATCH_MODE_MLIR_PROGRAM = 4
};
One new value is added:
D3D12_DISPATCH_MODE_MLIR_PROGRAM: indicates the type graph dispatch is MLIR programs, and bindings must be supplied in the MlirProgramBindings field of D3D12_DISPATCH_GRAPH_DESC.🆕 This is a new D3D12 struct.
struct D3D12_MLIR_PROGRAM_BINDINGS
{
UINT NumBindings;
const D3D12_MLIR_PROGRAM_BINDING* pBindings;
};
This struct is used to supply resource bindings for one execution of an active MLIR program.
| Member | Definition |
|---|---|
NumBindings |
The size of the pBindings array. |
pBindings |
Pointer to an array of bindings. See D3D12_MLIR_PROGRAM_BINDING. Neither the runtime nor drivers may access this pointer once DispatchGraph returns (the D3D client may free it immediately on return). The runtime will not make a copy of this data. |
🆕 This is a new D3D12 struct.
struct D3D12_MLIR_PROGRAM_BINDING
{
UINT BindPointIndex;
D3D12_MLIR_PROGRAM_BINDING_FLAGS Flags;
D3D12_MLIR_PROGRAM_BINDING_TYPE Type;
union
{
D3D12_GPU_VIRTUAL_ADDRESS_RANGE GpuBinding;
D3D12_MLIR_PROGRAM_CPU_BINDING CpuBinding;
};
};
This struct is used to supply bindings for an MLIR program partition bind point. A single binding may populate one bind point. Bind points may be fed with either GPU data (typically large tensors or buffers) or CPU data (typically small parameters that don’t affect compilation). If CPU data is supplied, the runtime must copy data during recording such that the application’s CPU memory for the parameters is no longer referenced when the call returns.
| Member | Definition |
|---|---|
BindPointIndex |
Index of the bind point to populate. |
Flags |
See D3D12_MLIR_PROGRAM_BINDING_FLAGS. |
Type |
Indicates the type of binding and thus which union field is used. |
GpuBinding |
Pointer to GPU memory. |
CpuBinding |
Pointer to CPU memory. |
🆕 This is a new D3D12 enum.
enum D3D12_MLIR_PROGRAM_BINDING_FLAGS
{
D3D12_MLIR_PROGRAM_BINDING_FLAG_NONE = 0,
};
This enum is reserved for future use.
🆕 This is a new D3D12 enum.
enum D3D12_MLIR_PROGRAM_BINDING_TYPE
{
D3D12_MLIR_PROGRAM_BINDING_TYPE_GPU = 0,
D3D12_MLIR_PROGRAM_BINDING_TYPE_CPU = 1,
};
| Value | Definition |
|---|---|
D3D12_MLIR_PROGRAM_BINDING_TYPE_GPU |
Indicates a binding of GPU data in D3D12_MLIR_PROGRAM_BINDING. |
D3D12_MLIR_PROGRAM_BINDING_TYPE_CPU |
Indicates a binding of CPU data in D3D12_MLIR_PROGRAM_BINDING. |
🆕 This is a new D3D12 struct.
struct D3D12_MLIR_PROGRAM_CPU_BINDING
{
const void* pData;
UINT64 DataSizeInBytes;
};
| Value | Definition |
|---|---|
pData |
Pointer to CPU memory owned by the client. Neither the runtime nor drivers may access this pointer once DispatchGraph returns (the D3D client may free it immediately on return). The runtime will not make a copy of this data. The driver copies/saves this data into command list state. |
DataSizeInBytes |
Size in bytes of pData. |
// Get one partition from output of CGC compiler.
std::vector<std::byte> partitionBytecode = cgcHelper->GetPartitionBytecode(0);
// Executable-type state object with a single MLIR program subobject.
// Might have more than one program-type subobject in the state object...
CD3DX12_STATE_OBJECT_DESC stateObjectDesc(D3D12_STATE_OBJECT_TYPE_EXECUTABLE);
auto mlGraph = stateObjectDesc.CreateSubobject<CD3DX12_MLIR_PROGRAM_SUBOBJECT>();
mlGraph->SetProgramName(L"partition0");
mlGraph->SetBytecode(partitionBytecode.data(), partitionBytecode.size());
// Driver compiles MLIR program partition on current thread.
ID3D12Device14* device = ...;
ComPtr<ID3D12StateObject> stateObject;
THROW_IF_FAILED(device->CreateStateObject(&(*stateObjectDesc), IID_PPV_ARGS(&stateObject)));
// Created earlier (encapsulates at least one MLIR program subobject)
ComPtr<ID3D12StateObject> stateObject = ...;
// Each bind point is bound to its own buffer resource in this example.
std::array<ID3D12Resource*, 5> resources = ...;
// Get program identifier for the MLIR program in the state object.
ComPtr<ID3D12StateObjectProperties1> stateObjectProperties;
THROW_IF_FAILED(stateObject->QueryInterface(IID_PPV_ARGS(&stateObjectProperties)));
auto programId = stateObjectProperties->GetProgramIdentifier(L"partition0");
// Set active program to MLIR program.
D3D12_SET_PROGRAM_DESC programDesc = {};
programDesc.Type = D3D12_PROGRAM_TYPE_MLIR_PROGRAM;
programDesc.MlirProgram.ProgramIdentifier = programId;
programDesc.MlirProgram.Flags = D3D12_SET_MLIR_PROGRAM_FLAG_NONE;
commandList->SetProgram(&programDesc);
// Associate bind points with GPUVAs.
std::array<D3D12_MLIR_PROGRAM_BINDING, 5> bindings;
for (uint32_t i = 0; i < bindings.size(); i++)
{
bindings[i].BindPointIndex = i;
bindings[i].Type = D3D12_MLIR_PROGRAM_BINDING_TYPE_GPU;
bindings[i].GpuBinding = resources[i]->GetGPUVirtualAddress();
}
D3D12_DISPATCH_GRAPH_DESC dispatchDesc = {};
dispatchDesc.Mode = D3D12_DISPATCH_MODE_MLIR_PROGRAM;
dispatchDesc.MlirProgramBindings.NumBindings = bindings.size();
dispatchDesc.MlirProgramBindings.pBindings = bindings.data();
// Dispatch the MLIR program partition.
commandList->DispatchGraph(&dispatchDesc);
This example illustrates a client (like DXCGC) checking for driver support of MLIR programs in general. If the driver doesn’t support MLIR programs, there is no need to check individual MLIR interfaces.
ID3D12Device* device = ...;
D3D12_FEATURE_DATA_D3D12_OPTIONS_MLIR options = {};
if (SUCCEEDED(device->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS_MLIR, &options, sizeof(options))) &&
options.MlirProgramsTier > D3D12_MLIR_PROGRAMS_TIER_NOT_SUPPORTED)
{
// Client can use MLIR programs
}
This example illustrates a client (like DXCGC) checking for driver support of specific MLIR interfaces.
ID3D12Device* device = ...;
// In this example, the client is interested in subgraph-related MLIR interfaces.
std::array<GUID, 5> mlirInterfacesRequested =
{
CGC_SUBGRAPH_PARTITION,
CGC_SUBGRAPH_DECLARATION_REQUEST,
CGC_SUBGRAPH_DECLARATION,
CGC_SUBGRAPH_SPECIALIZATION_REQUEST,
CGC_SUBGRAPH_SPECIALIZATION,
};
std::array<bool, mlirInterfacesRequested.size()> mlirInterfacesSupported = {};
D3D12_FEATURE_DATA_MLIR_INTERFACE_SUPPORT support = {};
support.NumMlirInterfaces = mlirInterfacesRequested.size();
support.pMlirInterfacesRequested = mlirInterfacesRequested.data();
support.pMlirInterfacesSupported = mlirInterfacesSupported.data();
if (SUCCEEDED(device->CheckFeatureSupport(D3D12_FEATURE_MLIR_INTERFACE_SUPPORT, &support, sizeof(support))) &&
std::all_of(mlirInterfacesSupported.begin(), mlirInterfacesSupported.end(), [](BOOL b) { return b; }))
{
// Driver supports subgraph lowering and execution
}
The D3D12_FEATURE_MLIR_EXCHANGE interface is used to discover the declarative transformation capabilities of a D3D12 adapter (the “Get HW Caps & Driver Subgraph Transforms” arrow from earlier section).
ID3D12Device* device = ...;
SIZE_T outputBufferSize = 0;
D3D12_FEATURE_DATA_MLIR_EXCHANGE exchange =
{
.MlirInterface = CGC_SUBGRAPH_DECLARATION_REQUEST,
.pInputData = nullptr,
.InputDataSizeInBytes = 0,
.pOutputData = nullptr,
.OutputDataSizeInBytes = &outputBufferSize
};
// Determine size of output buffer.
if (FAILED(device->CheckFeatureSupport(D3D12_FEATURE_MLIR_EXCHANGE, &exchange, sizeof(exchange))))
{
// Driver doesn't support ML subgraphs. Compiler should use shaders only.
return;
}
// Allocate space for output from driver.
std::vector<std::byte> buffer(outputBufferSize);
exchange.pOutputData = buffer.data();
// Fetch output from driver.
THROW_IF_FAILED(device->CheckFeatureSupport(D3D12_FEATURE_MLIR_EXCHANGE, &exchange, sizeof(exchange)));
// Compiler registers patterns encoded in buffer.
It may seem odd initially that a client must validate something declared and defined by the driver, but this is a consequence of drivers not having full context when declaring transformations: transformation patterns might be a little optimistic yet be rejected under certain circumstances (a combination of tensor shapes that isn’t supported). It’s incredibly difficult to precisely express all possible constraints declaratively, so validation is an unfortunate but necessary step.
ID3D12Device* device = ...;
// This is populated with MLIR from the compiler. This is NOT simply
// driver-declared subgraph pattern (what was fetched from the driver earlier),
// but it is instead a concrete partition with tensor shapes.
std::vector<std::byte> subgraphMLIR = ...;
SIZE_T outputBufferSize = 0;
D3D12_FEATURE_DATA_MLIR_EXCHANGE exchange =
{
.MlirInterface = CGC_SUBGRAPH_SPECIALIZATION_REQUEST,
.pInputData = subgraphMLIR.data(),
.InputDataSizeInBytes = subgraphMLIR.size(),
.pOutputData = nullptr,
.OutputDataSizeInBytes = &outputBufferSize
};
// Determine size of output buffer for response.
if (FAILED(device->CheckFeatureSupport(D3D12_FEATURE_MLIR_EXCHANGE, &exchange, sizeof(exchange))))
{
return;
}
// Allocate space for output from driver.
std::vector<std::byte> buffer(outputBufferSize);
exchange.pOutputData = buffer.data();
// Fetch response from driver.
THROW_IF_FAILED(device->CheckFeatureSupport(D3D12_FEATURE_MLIR_EXCHANGE, &exchange, sizeof(exchange)));
// Compiler will scan buffer from driver that determines if subgraph is valid.
This table maps API types and methods related to MLIR programs to their respective DDI types and function pointers. Unchanged interfaces are listed for completeness, but their signatures are not detailed in the sections that follow this table.
| Functionality | API | DDI | Type | Modification |
|---|---|---|---|---|
| MLIR Exchange | ID3D12Device::CheckFeatureSupport |
PFND3D12DDI_GETCAPS |
Function | ✅ Unchanged |
| MLIR Exchange | D3D12_FEATURE |
D3D12DDICAPS_TYPE |
Enum | ⚠️ Extended |
| MLIR Exchange | D3D12_FEATURE_MLIR_EXCHANGE |
D3D12DDICAPS_TYPE_MLIR_EXCHANGE_0119 |
Enum Value | 🆕 New |
| MLIR Exchange | D3D12_FEATURE_DATA_MLIR_EXCHANGE |
D3D12DDI_MLIR_EXCHANGE_0119 |
Struct | 🆕 New |
| MLIR Program Support | ID3D12Device::CheckFeatureSupport |
PFND3D12DDI_GETCAPS |
Function | ✅ Unchanged |
| MLIR Program Support | D3D12_FEATURE |
D3D12DDICAPS_TYPE |
Enum | ⚠️ Extended |
| MLIR Program Support | D3D12_FEATURE_D3D12_OPTIONS_MLIR |
D3D12DDI_OPTIONS_DATA_MLIR |
Enum Value | 🆕 New |
| MLIR Program Support | D3D12_MLIR_PROGRAMS_TIER |
D3D12DDI_MLIR_PROGRAMS_TIER |
Enum | 🆕 New |
| MLIR Interface Support | ID3D12Device::CheckFeatureSupport |
PFND3D12DDI_GETCAPS |
Function | ✅ Unchanged |
| MLIR Interface Support | D3D12_FEATURE |
D3D12DDICAPS_TYPE |
Enum | ⚠️ Extended |
| MLIR Interface Support | D3D12_FEATURE_MLIR_INTERFACE_SUPPORT |
D3D12DDICAPS_TYPE_MLIR_INTERFACE_SUPPORT_0119 |
Enum Value | 🆕 New |
| MLIR Interface Support | D3D12_FEATURE_DATA_MLIR_INTERFACE_SUPPORT |
D3D12DDI_MLIR_INTERFACE_SUPPORT_0119 |
Struct | 🆕 New |
| MLIR Program Creation | ID3D12Device::CreateStateObject |
PFND3D12DDI_CREATE_STATE_OBJECT_0054 |
Function | ✅ Unchanged |
| MLIR Program Creation | D3D12_STATE_OBJECT_DESC |
D3D12DDIARG_CREATE_STATE_OBJECT_0054 |
Struct | ✅ Unchanged |
| MLIR Program Creation | D3D12_STATE_OBJECT_TYPE |
D3D12DDI_STATE_OBJECT_TYPE |
Enum | ✅ Unchanged |
| MLIR Program Creation | D3D12_STATE_SUBOBJECT |
D3D12DDI_STATE_SUBOBJECT_0054 |
Struct | ✅ Unchanged |
| MLIR Program Creation | D3D12_STATE_SUBOBJECT_TYPE |
D3D12DDI_STATE_SUBOBJECT_TYPE |
Enum | ⚠️ Extended |
| MLIR Program Creation | D3D12_STATE_SUBOBJECT_TYPE_MLIR_PROGRAM |
D3D12DDI_STATE_SUBOBJECT_TYPE_MLIR_PROGRAM |
Enum Value | 🆕 New |
| MLIR Program Creation | D3D12_MLIR_PROGRAM_DESC |
D3D12DDI_MLIR_PROGRAM_DESC_0119 |
Struct | 🆕 New |
| MLIR Program Addition | ID3D12Device::AddToStateObject |
PFND3D12DDI_ADD_TO_STATE_OBJECT_0072 |
Function | ✅ Unchanged |
| MLIR Program Execution | ID3D12GraphicsCommandList::SetProgram |
PFND3D12DDI_SET_PROGRAM_0108 |
Function | ✅ Unchanged |
| MLIR Program Execution | D3D12_SET_PROGRAM_DESC |
D3D12DDI_SET_PROGRAM_DESC_0108 |
Struct | ⚠️ Extended |
| MLIR Program Execution | D3D12_PROGRAM_TYPE |
D3D12DDI_PROGRAM_TYPE_0108 |
Enum | ⚠️ Extended |
| MLIR Program Execution | D3D12_PROGRAM_TYPE_MLIR_PROGRAM |
D3D12DDI_PROGRAM_TYPE_MLIR_PROGRAM_0119 |
Enum Value | 🆕 New |
| MLIR Program Execution | D3D12_SET_MLIR_PROGRAM_DESC |
D3D12DDI_SET_MLIR_PROGRAM_DESC_0119 |
Struct | 🆕 New |
| MLIR Program Execution | D3D12_SET_MLIR_PROGRAM_FLAGS |
D3D12DDI_SET_MLIR_PROGRAM_FLAGS_0119 |
Enum | 🆕 New |
| MLIR Program Execution | ID3D12GraphicsCommandList::DispatchGraph |
PFND3D12DDI_DISPATCH_GRAPH_0108 |
Function | ✅ Unchanged |
| MLIR Program Execution | D3D12_DISPATCH_GRAPH_DESC |
D3D12DDI_DISPATCH_GRAPH_DESC_0108 |
Struct | ⚠️ Extended |
| MLIR Program Execution | D3D12_DISPATCH_MODE |
D3D12DDI_DISPATCH_MODE_0108 |
Enum | ⚠️ Extended |
| MLIR Program Execution | D3D12_DISPATCH_MODE_MLIR_PROGRAM |
D3D12DDI_DISPATCH_MODE_MLIR_PROGRAM_0119 |
Enum Value | 🆕 New |
| MLIR Program Execution | D3D12_MLIR_PROGRAM_BINDINGS |
D3D12DDI_MLIR_PROGRAM_BINDINGS_0119 |
Struct | 🆕 New |
| MLIR Program Execution | D3D12_MLIR_PROGRAM_BINDING |
D3D12DDI_MLIR_PROGRAM_BINDING_0119 |
Struct | 🆕 New |
| MLIR Program Execution | D3D12_MLIR_PROGRAM_BINDING_FLAGS |
D3D12DDI_MLIR_PROGRAM_BINDING_FLAGS_0119 |
Enum | 🆕 New |
| MLIR Program Execution | D3D12_MLIR_PROGRAM_BINDING_TYPE |
D3D12DDI_MLIR_PROGRAM_BINDING_TYPE_0119 |
Enum | 🆕 New |
| MLIR Program Execution | D3D12_MLIR_PROGRAM_CPU_BINDING |
D3D12DDI_MLIR_PROGRAM_CPU_BINDING_0119 |
Struct | 🆕 New |
There are no new adapter, device, or command list function signatures for MLIR programs. The following functions are relevant to MLIR programs:
typedef struct D3D12DDI_DEVICE_FUNCS_CORE_0119
{
...
PFND3D12DDI_CREATE_STATE_OBJECT_0054 pfnCreateStateObject; // for MLIR Program Creation
PFND3D12DDI_ADD_TO_STATE_OBJECT_0072 pfnAddToStateObject; // for MLIR Program Addition
PFND3D12DDI_GET_PROGRAM_IDENTIFIER_0108 pfnGetProgramIdentifier; // for MLIR Program Execution
} D3D12DDI_DEVICE_FUNCS_CORE_0119;
typedef struct D3D12DDI_COMMAND_LIST_FUNCS_3D_0119
{
...
PFND3D12DDI_SET_PROGRAM_0108 pfnSetProgram; // for MLIR Program Execution
PFND3D12DDI_DISPATCH_GRAPH_0108 pfnDispatchGraph; // for MLIR Program Execution
} D3D12DDI_COMMAND_LIST_FUNCS_3D_0119;
See device methods and command-list methods for details.
typedef enum D3D12DDICAPS_TYPE
{
...
+ D3D12DDI_CAPS_TYPE_MLIR = 1100,
+ D3D12DDICAPS_TYPE_MLIR_EXCHANGE_0119 = 1101,
+ D3D12DDICAPS_TYPE_MLIR_INTERFACE_SUPPORT_0119 = 1102,
} D3D12DDICAPS_TYPE;
Two new values are added to the existing D3D12DDICAPS_TYPE enum used by the existing DDI function PFND3D12DDI_GETCAPS:
D3D12DDI_CAPS_TYPE_MLIR: corresponds to the caps structure D3D12DDI_OPTIONS_DATA_MLIR.D3D12DDICAPS_TYPE_MLIR_EXCHANGE_0119: corresponds to the caps structure D3D12DDI_MLIR_EXCHANGE_0119.D3D12DDICAPS_TYPE_MLIR_INTERFACE_SUPPORT_0119: corresponds to the caps structure D3D12DDI_MLIR_INTERFACE_SUPPORT_0119.typedef struct D3D12DDI_OPTIONS_DATA_MLIR
{
D3D12DDI_MLIR_PROGRAMS_TIER MlirProgramsTier;
} D3D12DDI_OPTIONS_DATA_MLIR;
typedef struct D3D12DDI_MLIR_EXCHANGE_0119
{
GUID MlirInterface;
const void* pInputData;
SIZE_T InputDataSizeInBytes;
void* pOutputData;
SIZE_T* OutputDataSizeInBytes;
} D3D12DDI_MLIR_EXCHANGE_0119;
The lifetime of the memory referenced by pInputData and pOutputData is the scope of the call into PFND3D12DDI_GETCAPS. The driver must not dereference these pointers outside the scope of the DDI call.
See details in the API equivalent D3D12_FEATURE_DATA_MLIR_EXCHANGE.
See details in the API equivalent D3D12_FEATURE_DATA_D3D12_OPTIONS_MLIR.
typedef enum D3D12DDI_MLIR_PROGRAMS_TIER
{
D3D12DDI_MLIR_PROGRAMS_TIER_NOT_SUPPORTED = 0,
D3D12DDI_MLIR_PROGRAMS_TIER_1_0 = 10
D3D12DDI_MLIR_PROGRAMS_TIER_2_0 = 20
} D3D12DDI_MLIR_PROGRAMS_TIER;
See details in the API equivalent D3D12_MLIR_PROGRAMS_TIER.
typedef struct D3D12DDI_MLIR_INTERFACE_SUPPORT_0119
{
UINT NumMlirInterfaces;
const GUID* pMlirInterfacesRequested;
BOOL* pMlirInterfacesSupported;
} D3D12DDI_MLIR_INTERFACE_SUPPORT_0119;
See details in the API equivalent D3D12_FEATURE_DATA_MLIR_INTERFACE_SUPPORT.
typedef enum D3D12DDI_STATE_SUBOBJECT_TYPE
{
...
+ D3D12DDI_STATE_SUBOBJECT_TYPE_MLIR_PROGRAM = 34,
} D3D12DDI_STATE_SUBOBJECT_TYPE;
One new value is added to the existing D3D12DDI_STATE_SUBOBJECT_TYPE enum used by the existing DDI function PFND3D12DDI_CREATE_STATE_OBJECT_0054:
D3D12DDI_STATE_SUBOBJECT_TYPE_MLIR_PROGRAM: corresponds to the caps structure D3D12DDI_MLIR_PROGRAM_DESC_0119.typedef struct D3D12DDI_MLIR_PROGRAM_DESC_0119
{
LPCWSTR ProgramName;
const void* pBytecode;
SIZE_T BytecodeSizeInBytes;
} D3D12DDI_MLIR_PROGRAM_DESC_0119;
The lifetime of the memory referenced by pBytecode is the scope of the call into PFND3D12DDI_CREATE_STATE_OBJECT_0054, so the driver must not dereference this pointer outside of this scope. If the driver requires a background compile it must make a copy (the runtime does not keep a copy of the data passed from the application).
See details in the API equivalent D3D12_MLIR_PROGRAM_DESC.
typedef struct D3D12DDI_SET_PROGRAM_DESC_0108
{
D3D12DDI_PROGRAM_TYPE_0108 Type;
union
{
D3D12DDI_SET_GENERIC_PIPELINE_DESC_0108 GenericPipeline;
D3D12DDI_SET_RAYTRACING_PIPELINE_DESC_0108 RaytracingPipeline;
D3D12DDI_SET_WORK_GRAPH_DESC_0108 WorkGraph;
+ D3D12DDI_SET_MLIR_PROGRAM_DESC_0119 MlirProgram;
};
} D3D12DDI_SET_PROGRAM_DESC_0108;
See details in the API equivalent D3D12_SET_PROGRAM_DESC.
typedef enum D3D12DDI_PROGRAM_TYPE_0108
{
D3D12DDI_PROGRAM_TYPE_GENERIC_PIPELINE_0108 = 1,
D3D12DDI_PROGRAM_TYPE_RAYTRACING_PIPELINE_0108 = 4,
D3D12DDI_PROGRAM_TYPE_WORK_GRAPH_0108 = 5
+ D3D12DDI_PROGRAM_TYPE_MLIR_PROGRAM_0119 = 6
} D3D12DDI_PROGRAM_TYPE_0108;
See details in the API equivalent D3D12_PROGRAM_TYPE.
typedef struct D3D12DDI_SET_MLIR_PROGRAM_DESC_0119
{
D3D12DDI_PROGRAM_IDENTIFIER_0108 ProgramIdentifier;
D3D12DDI_SET_MLIR_PROGRAM_FLAGS_0119 Flags;
} D3D12DDI_SET_MLIR_PROGRAM_DESC_0119;
See details in the API equivalent D3D12_SET_MLIR_PROGRAM_DESC.
typedef enum D3D12DDI_SET_MLIR_PROGRAM_FLAGS_0119
{
D3D12DDI_SET_MLIR_PROGRAM_FLAG_NONE_0119 = 0x0,
// Reserved for possible future extensions
} D3D12DDI_SET_MLIR_PROGRAM_FLAGS_0119;
DEFINE_ENUM_FLAG_OPERATORS( D3D12DDI_SET_MLIR_PROGRAM_FLAGS_0119 )
See details in the API equivalent D3D12_SET_MLIR_PROGRAM_FLAGS.
typedef struct D3D12DDI_DISPATCH_GRAPH_DESC_0108
{
D3D12DDI_DISPATCH_MODE_0108 Mode;
union
{
D3D12DDI_NODE_CPU_INPUT_0108 NodeCPUInput;
D3D12DDI_GPU_VIRTUAL_ADDRESS NodeGPUInput;
D3D12DDI_MULTI_NODE_CPU_INPUT_0108 MultiNodeCPUInput;
D3D12DDI_GPU_VIRTUAL_ADDRESS MultiNodeGPUInput;
+ D3D12DDI_MLIR_PROGRAM_BINDINGS_0119 MlirProgramBindings;
};
} D3D12DDI_DISPATCH_GRAPH_DESC_0108;
See details in the API equivalent D3D12_DISPATCH_GRAPH_DESC.
typedef enum D3D12DDI_DISPATCH_MODE_0108
{
D3D12DDI_DISPATCH_MODE_NODE_CPU_INPUT_0108 = 0,
D3D12DDI_DISPATCH_MODE_NODE_GPU_INPUT_0108 = 1,
D3D12DDI_DISPATCH_MODE_MULTI_NODE_CPU_INPUT_0108 = 2,
D3D12DDI_DISPATCH_MODE_MULTI_NODE_GPU_INPUT_0108 = 3,
+ D3D12DDI_DISPATCH_MODE_MLIR_PROGRAM = 4
} D3D12DDI_DISPATCH_MODE_0108;
See details in the API equivalent D3D12_DISPATCH_MODE.
typedef struct D3D12DDI_MLIR_PROGRAM_BINDINGS_0119
{
UINT NumBindings;
const D3D12DDI_MLIR_PROGRAM_BINDING_0119* pBindings;
} D3D12DDI_MLIR_PROGRAM_BINDINGS_0119;
The lifetime of the memory referenced by pBindings is the scope of the call into PFND3D12DDI_DISPATCH_GRAPH_0108, so the driver must not dereference this pointer outside of this scope.
See details in the API equivalent D3D12_MLIR_PROGRAM_BINDINGS.
typedef struct D3D12DDI_MLIR_PROGRAM_BINDING_0119
{
UINT BindPointIndex;
D3D12DDI_MLIR_PROGRAM_BINDING_FLAGS_0119 Flags;
D3D12DDI_MLIR_PROGRAM_BINDING_TYPE_0119 Type;
union
{
D3D12DDI_GPU_VIRTUAL_ADDRESS_RANGE GpuBinding;
D3D12DDI_MLIR_PROGRAM_CPU_BINDING_0119 CpuBinding;
};
} D3D12DDI_MLIR_PROGRAM_BINDING_0119;
See details in the API equivalent D3D12_MLIR_PROGRAM_BINDING.
typedef enum D3D12DDI_MLIR_PROGRAM_BINDING_FLAGS_0119
{
D3D12DDI_MLIR_PROGRAM_BINDING_FLAG_NONE_0119 = 0,
} D3D12DDI_MLIR_PROGRAM_BINDING_FLAGS_0119;
See details in the API equivalent D3D12_MLIR_PROGRAM_BINDING_FLAGS.
typedef enum D3D12DDI_MLIR_PROGRAM_BINDING_TYPE_0119
{
D3D12DDI_MLIR_PROGRAM_BINDING_TYPE_GPU_0119 = 0,
D3D12DDI_MLIR_PROGRAM_BINDING_TYPE_CPU_0119 = 1,
} D3D12DDI_MLIR_PROGRAM_BINDING_TYPE_0119;
See details in the API equivalent D3D12_MLIR_PROGRAM_BINDING_TYPE.
typedef struct D3D12DDI_MLIR_PROGRAM_CPU_INPUT_0119
{
const void* pData;
UINT64 DataSizeInBytes;
} D3D12DDI_MLIR_PROGRAM_CPU_INPUT_0119;
The lifetime of the memory referenced by pData is the scope of the call into PFND3D12DDI_DISPATCH_GRAPH_0108, so the driver must not dereference this pointer outside of this scope.
See details in the API equivalent D3D12_MLIR_PROGRAM_CPU_BINDING.
Cross-component exchanges of MLIR data (subgraph transformations, capability checks, etc.) drive the DXCGC compilation process. MLIR interfaces define the semantics and structure of serialized data exchanged between components, and each interface is associated with a GUID. Compilation targets (drivers, apps) must declare support for a subset of these interfaces to participate in the compilation process.
// ------------------------------------------------------------------------
// Compiler input/output data.
// ------------------------------------------------------------------------
// CGC Input IR.
// {7FC4F23F-2474-4AE8-AE7A-2722118B350E}
DEFINE_GUID(CGC_INPUT,
0x7fc4f23f, 0x2474, 0x4ae8, 0xae, 0x7a, 0x27, 0x22, 0x11, 0x8b, 0x35, 0xe);
// CGC Output IR.
// {F53E923F-5300-4EEF-8F1E-ADA03FFF2605}
DEFINE_GUID(CGC_OUTPUT,
0xf53e923f, 0x5300, 0x4eef, 0x8f, 0x1e, 0xad, 0xa0, 0x3f, 0xff, 0x26, 0x5);
// ========================================================================
// Subgraph-related interfaces.
//
// Targets must support all SUBGRAPH_* interfaces if they wish to optimize
// subgraphs (i.e., declare patterns, specialize matches, and receive
// partitions at runtime).
// ========================================================================
// A partition of CGC Output IR comprising one or more subgraphs.
// Targets receive data associated with this interface through the
// CreateStateObject API/DDI when creating an MLIR program.
// {EAA277C3-190D-4049-A7B0-A611B11701B6}
DEFINE_GUID(CGC_SUBGRAPH_PARTITION,
0xeaa277c3, 0x190d, 0x4049, 0xa7, 0xb0, 0xa6, 0x11, 0xb1, 0x17, 0x1, 0xb6);
// ------------------------------------------------------------------------
// The following two interfaces are paired in an MLIR_EXCHANGE.
// inputData = null
// outputData = valid MLIR containing dxpatterns (see subgraph IR spec)
// ------------------------------------------------------------------------
// Compiler-initiated request to get all subgraph transformations from a target.
// {EB53032A-1116-4E71-8309-54347C1E5A26}
DEFINE_GUID(CGC_SUBGRAPH_DECLARATION_REQUEST,
0xeb53032a, 0x1116, 0x4e71, 0x83, 0x9, 0x54, 0x34, 0x7c, 0x1e, 0x5a, 0x26);
// Target's response to SUBGRAPH_DECLARATION_REQUEST.
// {BE441609-6AB8-4FA8-B551-019ED938C298}
DEFINE_GUID(CGC_SUBGRAPH_DECLARATION,
0xbe441609, 0x6ab8, 0x4fa8, 0xb5, 0x51, 0x1, 0x9e, 0xd9, 0x38, 0xc2, 0x98);
// ------------------------------------------------------------------------
// The following two interfaces are paired in an MLIR_EXCHANGE.
// inputData = valid MLIR (see subgraph IR spec)
// outputData = valid MLIR (see subgraph IR spec)
// ------------------------------------------------------------------------
// Compiler-initiated request to specialize a subgraph with concrete shapes.
// {EBED5ED6-2206-457B-8BD0-ABFCC677195B}
DEFINE_GUID(CGC_SUBGRAPH_SPECIALIZATION_REQUEST,
0xebed5ed6, 0x2206, 0x457b, 0x8b, 0xd0, 0xab, 0xfc, 0xc6, 0x77, 0x19, 0x5b);
// Target's response to SUBGRAPH_SPECIALIZATION_REQUEST.
// {F90A981E-0FCD-46D8-8AA1-166994C4DE02}
DEFINE_GUID(CGC_SUBGRAPH_SPECIALIZATION,
0xf90a981e, 0xfcd, 0x46d8, 0x8a, 0xa1, 0x16, 0x69, 0x94, 0xc4, 0xde, 0x2);
Metacommands are intended to be opaque implementations of specific algorithms that conform to fixed signatures (like a C function). MLIR programs are transparent implementations of arbitrary algorithms that conform to dynamic signatures (like a C program). There are a number of issues that make metacommands less than ideal for expressing workloads with dynamic signatures:
ID3D12MetaCommand, since the UMD-specific handle for the metacommand is not exposed through the API. This makes it infeasible for an application to serialize or cache metacommands.CheckFeatureSupport with D3D12_FEATURE_DATA_QUERY_META_COMMAND) is limited to general queries about the metacommand GUID rather than an instance. Again, there is no way to convey a metacommand instance to the driver.Memory planning is the process of figuring out where the edges of a graph live in memory. For GPU execution this typically means planning global memory (DRAM), since on-chip memory caches are automatically managed by the hardware (shared memory can be planned if a graph is fused into a single kernel). This process can be done ahead of time (hence the term plan), or it can be done just in time (dynamic allocation). To maximize performance (or minimize memory usage), it’s preferable to figure this out in advance.
Consider the following graph with layers A, B, C, D, and E. For simplicity, all tensors (in, a, b, c, d, out) in the example have the same size in bytes. The interior tensors (a, b, c, d) are called intermediates since they are only temporarily needed while the network is executed; however, the boundary tensors (in, out) usually need to persist in memory after execution completes.
graph LR
input:::io;
output:::io;
A:::layer;
B:::layer;
C:::layer;
D:::layer;
E:::layer;
input --in--> A --a--> B --b--> D --d--> E --"out"--> output;
A --a--> C --c--> E;
classDef layer fill:#66bbff,color:black;
classDef io fill:#ffff66,color:black;
A naive memory plan for this network involves simply allocating unique memory for each tensor. In this case, the memory footprint is the sum of all tensors. This is simple, but it’s obviously impractical for larger networks. A slight extension of this is to page memory in and out as tensors are needed: for example, the memory storing tensor a can be paged in to execute layer A, then paged out once layers B and C complete. This solution, of course, comes at the cost of performance as paging memory across PCI-e is extremely slow.
In practice, most implementations will try to reuse memory locations occupied by intermediate edges once those tensors are no longer needed. Assuming input and output tensors are preallocated (common, but not necessarily required), such a memory plan might look like this:
| Executing Layer | Mem[0] | Mem[1] | Mem[2] | Mem[3] | Mem[4] |
|---|---|---|---|---|---|
| Start | in |
out |
? |
? |
? |
| A | in |
out |
a |
? |
? |
| B | in |
out |
a |
b |
? |
| C | in |
out |
a |
b |
c |
| D | in |
out |
d |
b |
c |
| E | in |
out |
d |
b |
c |
| End | in |
out |
d |
b |
c |
Note how Mem[2] is initially occupied by tensor a, then repurposed to store tensor d once layers B and C are complete. This type of plan can be implemented by reference counting edges and doing a graph traversal. A related optimization is exploiting “in-place execution” on layers where the input memory can be repurposed as the output (e.g., elementwise ops). These optimizations are conceptually simple, but the example here glosses over the challenge of memory fragmentation since all tensors are the same size. There are many ways to address fragmentation, but they won’t be explained here.
A subtle consequence of reusing memory locations in memory planning is the introduction of implicit dependencies that affect concurrency. In the original graph, layer C can run concurrently with nodes B and D; however, the memory plan above effectively makes C a dependency of D given the shared memory location. If D executed before C (or simultaneously with C) then it would overwrite the contents of tensor a.
graph LR
input:::io;
output:::io;
A:::layer;
B:::layer;
C:::layer;
D:::layer;
E:::layer;
input --in--> A --a--> B --b--> D --d--> E --"out"--> output;
A --a--> C --c--> E;
C -."implicit dependency".-> D;
classDef layer fill:#66bbff,color:black;
classDef io fill:#ffff66,color:black;
Was it smart to reuse memory in this example? If the model wouldn’t fit in memory without reusing the memory location for a and d then this is absolutely a good plan. However, perhaps layers B and C only have work to occupy half the GPU: latency could be reduced by running them in parallel. Alternatively, maybe B fully occupies the GPU and it is better to run C and D in parallel. Ultimately, the right choice depends on the network, the hardware, and the optimization goal of minimizing memory or minimizing latency.
Memory planning is an interesting challenge with serious consequence for performance. The intent of this brief introduction is to highlight one of the tasks that is well-suited to a compiler algorithm and challenging for handwritten implementations.
Execution scheduling is the process of figuring out when to execute (dispatch) units of work, like shader programs or MLIR programs. In D3D12 this is done using command lists and queues. Most operations within a command list are asynchronous by default; for example, two back-to-back shader dispatches are assumed to have no dependency and thus can run simultaneously if the hardware has capacity. Barriers are the mechanism to enforce dependencies between work items in a command list, but the legacy barrier APIs make this quite problematic for efficiently handling dependency graphs.
graph LR
input:::io;
output:::io;
A:::layer;
B:::layer;
C:::layer;
D:::layer;
E:::layer;
input --in--> A --a--> B --b--> D --d--> E --"out"--> output;
A --a--> C --c--> E;
classDef layer fill:#66bbff,color:black;
classDef io fill:#ffff66,color:black;
To illustrate the issue with legacy barriers being used for expressing node dependencies, consider that a barrier needs to exist between node pairs A/B, A/C, B/D, C/E, and D/E. If every edge was a separate resource this might look like the following in a command list:
Dispatch(A) // consumes input; produces a
Barrier(a)
Dispatch(B) // consumes a; produces b
Barrier(b)
Dispatch(C) // consumes a; produces c
Barrier(c)
Dispatch(D) // consumes b; produces d
Barrier(d)
Dispatch(E) // consumes d, c; produces out
Barrier(out)
In theory, a driver could detect the implicit dependencies between nodes and schedule them in parallel. For example, the driver could notice that Dispatch(B) and Dispatch(C) depend only on a and thus can run in parallel immediately after Barrier(a). However, this dependency isn’t so clearly articulated as shown in the comments. Furthermore, consider what happens if the intermediate edges are suballocated out of a single buffer resource (especially common to alias scratch memory as shown in the section on memory planning):
Dispatch(A)
Barrier(scratch)
Dispatch(B)
Barrier(scratch)
Dispatch(C)
Barrier(scratch)
Dispatch(D)
Barrier(scratch)
Dispatch(E)
Barrier(out)
In this case, the driver MUST synchronize between every dispatch since it has no way of knowing if the reads and writes to scratch alias the same regions. In other words, Barrier(scratch) in this example is effectively equivalent a “null” or “global UAV barrier” that forces all UAV accesses to complete before any subsequent commands can commence. DirectML relied heavily on aliasing a single large scratch buffer and used global UAV barriers for enforcing correct execution order in graphs like the one above. The observant reader may notice that the naive translation of resource-specific barriers to global barriers leads to excessive synchronization (e.g., a barrier between B and C, which is not needed). This leads to a forced decision on how to “flatten” the dependency graph by grouping dispatches into barrier boundaries. This DAG has four valid arrangements with global barriers:
A, Barrier, B, C, Barrier, D, Barrier, EA, Barrier, C, B, Barrier, D, Barrier, EA, Barrier, B, Barrier, C, D, Barrier, EA, Barrier, B, Barrier, D, C, Barrier, EWhich of the above schedules it optimal? It is hard (or even impossible) for a D3D client to know, and it’s possible that none of these may maximize utilization of the hardware. A full-fledged solution to this problem would require expressing the dependency graph more explicitly, but this is out of scope at the moment. As a consolation, however, it is important to recognize that MLIR program partitions mitigate this to an extent. Consider if nodes B, C, and D are grouped into a partition:
graph LR
input:::io;
output:::io;
A:::layer;
E:::layer;
subgraph Partition
B:::layer;
C:::layer;
D:::layer;
end
input --in--> A --a--> B --b--> D --d--> E --"out"--> output;
A --a--> C --c--> E;
classDef layer fill:#66bbff,color:black;
classDef io fill:#ffff66,color:black;
The resulting command list would be A, Barrier, Partition, Barrier, E, which is significantly better than forcing an arbitrary decision on where to place nodes B/C/D. In the future, we may consider extending the work graph APIs to support the dispatch of an explicit dependency graph. The current design is not directly suitable since joins aren’t supported.
The previous sections on memory planning and execution scheduling subtly hint at two fundamentally different ways to look at computational graphs:
Dataflow graph: nodes represent computation (network layers, subgraphs, shader dispatches, etc.) and edges represent data (tensors, memory references, pointer and offset, etc.). As the name indicates, this type of graph emphasizes the data going into and coming out of nodes; however, it does not explicitly convey dependencies that affect execution order.
Dependency graph: nodes represent computation (or any task really) and inbound edges represent dependencies on other nodes that must complete before the target node can start.
Regardless of the framework or interchange format (ONNX, GGUF, MLIR bytecode, etc.), pretty much all ML models are described as dataflow graphs. A dataflow graph is flexible in that doesn’t prescribe details of how to allocate memory or schedule execution of nodes, so it enables a variety of implementation approaches. A dependency graph description of an ML model is typically a byproduct of locking down many of these decisions and doesn’t give as much wiggle room in an implementation.
| Version | Date | Description |
|---|---|---|
| v0.11 | 05/04/2026 |
|